The dataone R package enables users to construct programmatic queries to DataONE data repositories. This should be very helpful for people interested in making their analyses reproducible! Here I show my workflow for locating, analyzing, and plotting Oneida Lake benthic invertebrate data all without leaving my RStudio session.
Setup
Let’s assume that I know that my target data is on KNB so I start by setting up dataone
for a node specific KNB search:
library(dataone)
cn <- CNode("PROD")
mn <- getMNode(cn, "urn:node:KNB")
Find data id
Let’s also assume that, at least initially, I do not know the id of my target dataset. I only know that the dataset title should include the benthic and Oneida keywords. I choose to use the Solr method to query the database on the Oneida keyword and filter the results to entries containing the benthic keyword. I limit the result fields to id, title, and dateModified. For more explanation of the Solr query method see this dataone vignette.
(qy <- dataone::query(cn, list(
rows = "300",
q = "title:*Oneida*",
fq = "(title:*benthic*)",
fl = "id,title,dateModified"),
as = "data.frame"))
## id dateModified
## 1 kgordon.4.51 2016-02-01
## 2 kgordon.4.52 2016-09-01
## 3 kgordon.4.56 2016-12-07
## 4 kgordon.4.38 2016-02-01
## 5 kgordon.4.37 2013-11-14
## 6 doi:10.5063/AA/kgordon.4.24 2015-01-05
## 7 doi:10.5063/AA/kgordon.4.9 2015-01-06
## 8 doi:10.5063/AA/kgordon.4.3 2015-01-06
## 9 doi:10.5063/AA/kgordon.4.34 2015-01-06
## 10 doi:10.5063/AA/kgordon.4.4 2015-01-06
## title
## 1 Benthic invertebrates in Oneida Lake, New York, 1956-
## 2 Benthic invertebrates in Oneida Lake, New York, 1956
## 3 Benthic invertebrates in Oneida Lake, New York, 1956
## 4 Ekman sampling of benthic invertebrates in Oneida Lake, New York,
## 5 Ekman sampling of benthic invertebrates in Oneida Lake, New York,
## 6 Ekman sampling of benthic invertebrates in Oneida Lake, New York,
## 7 Eckman sampling of benthic invertebrates in Oneida Lake, NY,
## 8 Eckman sampling of benthic invertebrates in Lake Oneida, NY,
## 9 Ekman sampling of benthic invertebrates in Oneida Lake, New York,
## 10 Eckman sampling of benthic invertebrates in Oneida Lake, NY,
As far as I know, the results returned by a KNB website query include only the entries not prefixed by a doi designation. Let’s filter our results to exclude the doi entries and sort on the dateModified field to find the most recent result version: :
library(dplyr)
(qy <- slice(qy, grep("^doi", id, invert = TRUE)))
## id dateModified
## 1 kgordon.4.51 2016-02-01
## 2 kgordon.4.52 2016-09-01
## 3 kgordon.4.56 2016-12-07
## 4 kgordon.4.38 2016-02-01
## 5 kgordon.4.37 2013-11-14
## 6 kgordon.4.55 2016-09-01
## 7 kgordon.4.57 2016-12-07
## title
## 1 Benthic invertebrates in Oneida Lake, New York, 1956-t
## 2 Benthic invertebrates in Oneida Lake, New York, 1956 t
## 3 Benthic invertebrates in Oneida Lake, New York, 1956 t
## 4 Ekman sampling of benthic invertebrates in Oneida Lake, New York,
## 5 Ekman sampling of benthic invertebrates in Oneida Lake, New York,
## 6 Benthic invertebrates in Oneida Lake, New York, 1956 t
## 7 Benthic invertebrates in Oneida Lake, New York, 1956 t
(qy <- arrange(qy, desc(id), desc(dateModified)))
## id dateModified
## 1 kgordon.4.57 2016-12-07
## 2 kgordon.4.56 2016-12-07
## 3 kgordon.4.55 2016-09-01
## 4 kgordon.4.52 2016-09-01
## 5 kgordon.4.51 2016-02-01
## 6 kgordon.4.38 2016-02-01
## 7 kgordon.4.37 2013-11-14
## title
## 1 Benthic invertebrates in Oneida Lake, New York, 1956 t
## 2 Benthic invertebrates in Oneida Lake, New York, 1956 t
## 3 Benthic invertebrates in Oneida Lake, New York, 1956 t
## 4 Benthic invertebrates in Oneida Lake, New York, 1956 t
## 5 Benthic invertebrates in Oneida Lake, New York, 1956-t
## 6 Ekman sampling of benthic invertebrates in Oneida Lake, New York,
## 7 Ekman sampling of benthic invertebrates in Oneida Lake, New York,
Get data package
Next, I download the data package with the getPackage
command. This command returns the location of a zip file in the machine’s temporary file system which can be fed to the unzip
function.
resource_path <- paste0("resourceMap_", qy[1,"id"])
dt <- getPackage(mn, id = resource_path)
unzip(dt)
# the unzipped folder has underscores as separators rather than periods
package_path <- file.path(gsub("\\.", "_", resource_path), "data")
(flist <- list.files(package_path))
## [1] "cbfs.140.3-Oneida_Benthos_1956_to_present.csv.csv"
## [2] "cbfs.141.3-Taxa_list_Oneida_Benthos.csv.csv"
## [3] "cbfs.27.8-Benthos_locations.csv.csv"
## [4] "kgordon.4.57-METADATA.pdf"
## [5] "kgordon.4.57-METADATA.xml"
## [6] "resourceMap_kgordon.4.57-RESOURCE.rdf"
fpath <- file.path(package_path, flist[1])
dt <- read.csv(fpath, stringsAsFactors = FALSE, na.strings = "-999.0")
head(dt)
## Year DepthGroup Season X.SamplingEvents Chironomidae Ephemeroptera
## 1 1957 deep Spring 1 359.9 121.7
## 2 1958 deep Spring 1 34.3 500.0
## 3 1959 deep Spring 1 822.7 34.8
## 4 1962 deep Spring 4 939.8 0.0
## 5 1963 deep Spring 2 454.2 0.0
## 6 1965 deep Spring 2 1782.5 0.0
## Trichoptera Other.insects Amphipoda Isopoda Leeches Oligochaeta Planaria
## 1 0.0 17.4 8.7 0.0 0.0 NA NA
## 2 0.0 0.0 0.0 0.0 0.0 NA NA
## 3 0.0 0.0 0.0 0.0 0.0 NA NA
## 4 0.0 6.5 0.0 0.0 39.1 NA NA
## 5 0.0 0.0 52.2 0.0 0.0 4.3 NA
## 6 4.3 0.0 8.7 69.6 13.0 21.7 NA
## Snails Clams Quagga.Mussels Zebra.Mussels
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
Plot data
Now that I have the dataset located, downloaded, and read into a data.frame
object I go ahead and plot the result while facetting on organism class.
library(dplyr)
library(ggplot2)
dt <- group_by(dt, Year)
dt <- summarise_each(dt, funs(mean), 5:17)
dt <- reshape2::melt(dt, "Year")
gg <- ggplot(data = dt) +
geom_line(aes(x = Year, y = value)) +
facet_wrap(~variable)
gg