Automated roxygen documentation of R package data

TLDR; You can have roxygen automatically build your R data documentation from an external csv file.

Documenting data is an important part of the R package publishing process. Most packages document their data using a named list so that they have a roxygen block that looks like this:

#' \describe{
#'   \item{One}{Description of the One variable}
#'   \item{Two}{Description of the Two variable}
#' }

Writing out named lists in this way is fine for small and infrequently updated data. However, manually entering the information in the named list and the fact that it is non-tabular makes for difficult reuse in other contexts such as in data dictionaries.

We can make the storage and entry of our data documentation cleaner while enabling reuse by storing the data dictionary in a csv file separate from the R source code. Then we could pass our data dictionary (as a data.frame object) through a function to convert it into a LaTeX-style \tabular object that can be manually pasted into a roxygen block (See http://r-pkgs.had.co.nz/man.html):

# modified from http://r-pkgs.had.co.nz/man.html
tabular <- function(df, ...) {
  stopifnot(is.data.frame(df))

  align <- function(x) if (is.numeric(x)) "r" else "l"
  col_align <- vapply(df, align, character(1))

  cols <- lapply(df, format, ...)
  contents <- do.call("paste",
                      c(cols, list(sep = " \\tab ", collapse = "\\cr\n  ")))
  col_names <- paste0("\\bold{",
                      do.call("paste",
                              c(names(df), list(sep = "} \\tab \\bold{", collapse = "\\cr\n  "))),
                      "} \\cr")

  paste("\\tabular{", paste(col_align, collapse = ""), "}{\n",
        col_names,
        "\n",
        contents, "\n}\n", sep = "")
}

The output could then be manually copy-pasted into roxygen code looking like this:

cat(tabular(dictionary))

## \tabular{ll}{
## \bold{name} \tab \bold{description} \cr
## One \tab Description of the One variable\cr
##   Two \tab Description of the Two variable
## }

The problem with this approach is that the interactive manual copying becomes extremely cumbersome if you have a lot of different datasets to document. A project I’m working on has 10 datasets, which are updated frequently, and have over 20 different variables each. I discovered an automated solution using (a little known) @eval roxygen tag.

What follows is a a step-by-step guide to creating automated roxygen data documentation that pulls from an external data dictionary. For this demonstration we will use the population dataset contained in the tidyr R package. The strategy I describe is implemented in a fully functioning demonstration package at https://github.com/jsta/autodatadoc.

Step 1: Create a csv data dictionary

Most people will use a spreadsheet program to manually create their data dictionary. An alternative approach is to use a more complex tool such as the dataMeta R package or some other external tool besides a spreadsheet program.

Our example data has a data dictionary that looks like this:

name	description
country	Country name
year	Year
population	Population

We should probably store our data dictionary under version control alongside our package. This is conventionally done by storing it in the data-raw folder.

Step 2: Write a function to pass a tabular representation of our data dictionary as individual lines

We need to write a function to locate our data dictionary, read the contents, pass it through our tabular function above, and return the output as individual roxygen lines:

get_table_metadata <- function(path){
  dt <- read.csv(path, stringsAsFactors = FALSE)
  paste0(readLines(textConnection(tabular(dt))))
}

get_table_metadata("data-raw/dictionary.csv")

## [1] "\\tabular{ll}{"                             
## [2] "\\bold{name} \\tab \\bold{description} \\cr"
## [3] "country    \\tab Country name\\cr"          
## [4] "  year       \\tab Year        \\cr"        
## [5] "  population \\tab Population  "            
## [6] "}"                                          
## [7] ""

Step 3: Automatically include this output in a roxygen block using the `@eval` tag.

To have this function evaluated during the course of roxygen building, we include an @eval tag in our roxygen code:

#' World Health Organization TB data
#'
#' A subset of data from the World Health Organization Global Tuberculosis
#' Report, and accompanying global populations.
#'
#' @eval c("@format", get_table_metadata("data-raw/dictionary.csv"))
#'
"population"

Step 4: Run `roxygen::document`

Now when we build the documentation with roxygen::document(), the @eval tag will evaluate the function and include a nicely formatted data dictionary table in our docs!

population

R Documentation

World Health Organization TB data

Description

A subset of data from the World Health Organization Global Tuberculosis Report, and accompanying global populations.

Usage

population

Format

name	description
country	Country name
year	Year
population	Population