The Ecological Society of America (ESA) has recently adopted a new Open Science (and Open Data policy):
In general, I think this new policy is great and should go a long way in promoting open data, code, and reproduciblity in Ecology.
The purpose of this blog post is to talk about A) ESA data policy requirements that I view as too prescriptive such that they are too onerous for newcomers or risk becoming outdated as digital services (hopefully) come-and-go and B) requirements that I view as ambiguous such that people may have trouble following them.
My goal is to dissect the former in the hopes that the policy will be flexible and expound on the latter to hopefully bring some clarity. All of this is bearing in mind the difficulty inherent in arriving at the right amount of detail in these types of policy documents without going overboard. I think some ambiguity is unavoidable and I plan to amend this post with more resources so that people can be informed. What follows are merely my opinions as of the writing of this post and I’d love to hear from people where I might be off base.
The relevant text from the policy document is set in quoted blocks under each requirement with emphasis mine
1. Are graphical point-and-click figure making programs banned?
[Authors must formally archive the original script that generated] every number, every p-value, and every graph [in a given submission]
Despite the policy above, I think it is overly prescriptive to require code for every graph. In my experience, code to create most figures is not particularly novel and I don’t think the intent of this requirement is to ban point-and-click graph making tools like Sigmaplot, Excel, and the like. It is good enough to simply provide the code which produces the csv file used to make a given graph.
2. Do we have to use Zenodo for archiving Github material?
A Zenodo DOI must be obtained for GitHub material to ensure permanency and versioning.
I am fan of Zenodo but I disagree with codifying them as the required Github archival service. There other ways to get a DOI for Github material. For example, Figshare also has Github integration. I would also argue that we shouldn’t be promoting Github by name but that’s a topic for another day.
3. What is a proprietary data format? Are shapefiles banned?
Data should be provided in non-proprietary, machine-readable formats (e.g., “.csv”).
This is a great part of the policy in my opinion. While the policy gives an example of a non-proprietary format, there are differences in practice between what is technically proprietary and practically proprietary. For example, Access databases and raster ESRI geodatabases require paid software to open and are both technically and practically proprietary. Conversely, shapefiles are technically proprietary but they are not practically proprietary. The format has an open specification so as to not require paid software. Although I recommend using Geopackages rather than Shapefiles because they are technically non-proprietary (see prior post), I think it is OK to use and archive shapefiles.
4. What is and is not novel code?
Archiving of novel code is required for each submission. Code that is not novel (e.g., from a standard statistical package or publicly available model) must only be properly cited and referenced.
This is certainly a worthwhile requirement. However, I feel that it can be difficult to distinguish between what is and is not novel code. My interpretation of the requirement is that if you used a publicly available software package (e.g. an R package on CRAN, even if “non-standard”), you should cite it but you don’t need to archive its source code. In my experience, the biggest challenge to providing code with a paper is corralling disparate coding efforts across large multi-author teams. So, yes, the code your collaborator wrote in an unfamiliar (to you) language to produce one component of the paper counts as novel code. It doesn’t need to be integrated with your code in a single repository but it does need to be archived.
5. What does it mean to archive a derived data product?
Raw data and metadata used to generate tables, figures, plots, videos/animations [is required for each submission]. Derived data products [are required for each submission]. Derived data products constitutes data which are collected from database source(s) and collated and/or processed specific to the current research.
I think the Details sections of the policy does a good job defining raw data. I don’t see a lot of ambiguity there. However, I do see some potential for confusion with the requirement to archive derived data products. As an example, consider a study where elevation data is collected from something like the National Elevation Dataset (NED). My interpretation is that to satisfy the policy you would not need to archive the entire NED or even the entirety of any NED tiles you used in your paper. Because the NED is a very public dataset, you would provide a formal citation reference and details (or code) for data processing steps. Then, if you extracted point measurements from NED tiles, these would be archived as part of your “raw” data product.
I am excited for the future of Open Science. It is great for education and has the potential to at least somewhat puncture the Ivory Tower. One of the sticking points in past Open Science discussions I’ve seen is the question of enforcement. Who will be responsible for telling authors they are required to do the extra work of making their data and code open? Will it be granting agencies, university leadership, journal reviewers? On this point, I was particularly interested to see the promise that “ESA Publications staff will verify data and code archiving is complete before releasing files to the publisher”.