TLDR; You can have roxygen automatically build your R data documentation from an external csv file.
Documenting data is an important part of the R package publishing process. Most packages document their data using a named list so that they have a roxygen block that looks like this:
#' \describe{
#' \item{One}{Description of the One variable}
#' \item{Two}{Description of the Two variable}
#' }
Writing out named lists in this way is fine for small and infrequently updated data. However, manually entering the information in the named list and the fact that it is non-tabular makes for difficult reuse in other contexts such as in data dictionaries.
We can make the storage and entry of our data documentation cleaner while enabling reuse by storing the data dictionary in a csv file separate from the R source code. Then we could pass our data dictionary (as a data.frame
object) through a function to convert it into a LaTeX-style \tabular
object that can be manually pasted into a roxygen block (See http://r-pkgs.had.co.nz/man.html):
# modified from http://r-pkgs.had.co.nz/man.html
tabular <- function(df, ...) {
stopifnot(is.data.frame(df))
align <- function(x) if (is.numeric(x)) "r" else "l"
col_align <- vapply(df, align, character(1))
cols <- lapply(df, format, ...)
contents <- do.call("paste",
c(cols, list(sep = " \\tab ", collapse = "\\cr\n ")))
col_names <- paste0("\\bold{",
do.call("paste",
c(names(df), list(sep = "} \\tab \\bold{", collapse = "\\cr\n "))),
"} \\cr")
paste("\\tabular{", paste(col_align, collapse = ""), "}{\n",
col_names,
"\n",
contents, "\n}\n", sep = "")
}
The output could then be manually copy-pasted into roxygen code looking like this:
cat(tabular(dictionary))
## \tabular{ll}{
## \bold{name} \tab \bold{description} \cr
## One \tab Description of the One variable\cr
## Two \tab Description of the Two variable
## }
The problem with this approach is that the interactive manual copying becomes extremely cumbersome if you have a lot of different datasets to document. A project I’m working on has 10 datasets, which are updated frequently, and have over 20 different variables each. I discovered an automated solution using (a little known) @eval
roxygen tag.
What follows is a a step-by-step guide to creating automated roxygen data documentation that pulls from an external data dictionary. For this demonstration we will use the population
dataset contained in the tidyr
R package. The strategy I describe is implemented in a fully functioning demonstration package at https://github.com/jsta/autodatadoc.
Step 1: Create a csv data dictionary
Most people will use a spreadsheet program to manually create their data dictionary. An alternative approach is to use a more complex tool such as the dataMeta R package or some other external tool besides a spreadsheet program.
Our example data has a data dictionary that looks like this:
name | description |
---|---|
country | Country name |
year | Year |
population | Population |
We should probably store our data dictionary under version control alongside our package. This is conventionally done by storing it in the data-raw
folder.
Step 2: Write a function to pass a tabular representation of our data dictionary as individual lines
We need to write a function to locate our data dictionary, read the contents, pass it through our tabular
function above, and return the output as individual roxygen lines:
get_table_metadata <- function(path){
dt <- read.csv(path, stringsAsFactors = FALSE)
paste0(readLines(textConnection(tabular(dt))))
}
get_table_metadata("data-raw/dictionary.csv")
## [1] "\\tabular{ll}{"
## [2] "\\bold{name} \\tab \\bold{description} \\cr"
## [3] "country \\tab Country name\\cr"
## [4] " year \\tab Year \\cr"
## [5] " population \\tab Population "
## [6] "}"
## [7] ""
Step 3: Automatically include this output in a roxygen block using the @eval
tag.
To have this function evaluated during the course of roxygen building, we include an @eval
tag in our roxygen code:
#' World Health Organization TB data
#'
#' A subset of data from the World Health Organization Global Tuberculosis
#' Report, and accompanying global populations.
#'
#' @eval c("@format", get_table_metadata("data-raw/dictionary.csv"))
#'
"population"
Step 4: Run roxygen::document
Now when we build the documentation with roxygen::document()
, the @eval
tag will evaluate the function and include a nicely formatted data dictionary table in our docs!
population | R Documentation |
World Health Organization TB data
Description
A subset of data from the World Health Organization Global Tuberculosis Report, and accompanying global populations.
Usage
population
Format
name | description |
country | Country name |
year | Year |
population | Population |