Scholarly metadata in R
R-bloggers 2013-03-16
Scholarly metadata - the meta-information surrounding articles - can be super useful. Although metadata does not contain the full content of articles, it contains a lot of useful information, including title, authors, abstract, URL to the article, etc.
One of the largest sources of metadata is provided via the Open Archives Initiative Protocol for Metadata Harvesting or OAI-PMH. Many publishers, provide their metadata through their own endpoint, and implement the standard OAI-PMH methods: GetRecord, Identify, ListIdentifiers, ListMetadataFormats, ListRecords, and ListSets. Many providers use OAI-PMH, including DataCite, Dryad, and PubMed.
Some data-/article-providers provide their metadata via their own APIs. For example, Nature Publishing Group provides their own metadata API here in non OAI-PMH format; you can get PLoS metadata through their search API, and the BHL (see below) provides their own custom metadata service.
In addition, CrossRef provides a number of metadata search services: metadata search and openurl.
What about the other publishers? (please tell me if I'm wrong about these three)
- Springer has a metadata API, but it is terrible, soooo...
- Elsevier, are you kidding? Well, they do have some sort of API service, but its a pain in the ass.
- Wiley, no better than Elsevier.
Note that metadata can live in other places:
- Another package being developed by David Springate, rpubmed can get PubMed metadata.
- Our wrapper to the Mendeley API, RMendeley, gets article metadata via Mendeley's database.
- Our wrapper to the Biodiversity Heritage Library API here gets their metadata.
No, you can't get metadata via Google Scholar - the don't allow scraping, and don't have expose their data via an API.
I have discussed this package in a previous blog post, but have since worked on the code a bit, and thought it deserved a new post.
You can see a tutorial for this package here, and contribute to the code here.
Install rmetadata
# install_github('rmetadata', 'ropensci') # uncomment to installlibrary(rmetadata)
Count OAI-PMH identifiers for a data provider.
# For DataCite.count_identifiers("datacite") provider count1 datacite 1216193
Lookup article info via CrossRef with DOI and get a citation.
As Bibtex
print(crossref_citation("10.3998/3336451.0009.101"), style = "Bibtex")@Article{, title = {In Google We Trust?}, author = {Geoffrey Bilder}, journal = {The Journal of Electronic Publishing}, year = {2006}, month = {01}, volume = {9}, doi = {10.3998/3336451.0009.101},}
As regular text
print(crossref_citation("10.3998/3336451.0009.101"), style = "text")Bilder G (2006). "In Google We Trust?" _The Journal of ElectronicPublishing_, *9*. <URL:>.
Search the CrossRef Metatdata for DOIs using free form references.
Search with title, author, year, and journal
crossref_search_free(query = "Piwowar Sharing Detailed Research Data Is Associated with Increased Citation Rate PLOS one 2007") text1 Piwowar Sharing Detailed Research Data Is Associated with Increased Citation Rate PLOS one 2007 match doi score1 TRUE 10.1038/npre.2007.361 4.905
Get a DOI and get the citation using \code{crossref_search}
# Get a DOI for a paperdoi <- crossref_search_free(query = "Piwowar sharing data PLOS one")$doi# Get the metadatacrossref_search(doi = doi)[, 1:3] doi score normalizedScore1 10.1371/journal.pone.0000308 18.19 100
Get a random set of DOI's through CrossRef.
# Default search gets 20 random DOIscrossref_r() [1] "10.4028/" [2] "10.2139/ssrn.548523" [3] "10.1016/S0012-821X(02)00562-9" [4] "10.1093/rsq/13.2-3.167" [5] "10.5772/55055" [6] "10.1515/BC.1999.050" [7] "10.1016/S0020-7292(98)90160-6" [8] "10.1111/j.1439-0418.1985.tb02788.x" [9] "10.1089/aid.2012.0115" [10] "10.1016/0002-9378(95)90155-8" [11] "10.1001/jama.1949.02900490055028" [12] "10.1051/jphyscol:1989172" [13] "10.1016/s0301-2115(03)00298-7" [14] "10.1007/BF02735292" [15] "10.1016/0003-4916(65)90026-6" [16] "10.4156/jdcta.vol5.issue5.12" [17] "10.1007/s10904-009-9316-2" [18] "10.1023/A:1021690001832" [19] "10.1007/s12262-012-0724-0" [20] "10.1007/bf02192860"
# Specify you want journal articles onlycrossref_r(type = "journal_article") [1] "10.1016/j.jacc.2011.09.055" [2] "10.1002/dev.420170603" [3] "10.4315/0362-028X.JFP-10-403" [4] "10.1016/S0925-4927(98)00016-X" [5] "10.1111/j.1933-1592.2002.tb00141.x" [6] "10.1541/ieejfms.127.629" [7] "10.5539/enrr.v3n1p62" [8] "10.1016/S0960-9776(96)90038-7" [9] "10.1016/0925-9635(94)05240-9" [10] "10.1016/s0929-693x(97)86846-7" [11] "10.1002/(SICI)1096-9071(199601)48:1<53::AID-JMV9>3.0.CO;2-K"[12] "10.1016/s0267-7261(01)00016-1" [13] "10.1111/j.1748-0361.2003.tb00575.x" [14] "10.1097/00005721-197701000-00011" [15] "10.1007/s00894-009-0593-z" [16] "10.1071/AR9830063" [17] "10.1186/gb-2009-10-4-r39" [18] "10.2165/00128415-201113540-00038" [19] "10.1007/BF00522986" [20] "10.1080/19407963.2011.539385"
Search the CrossRef Metatdata API.
# Search for two different query termscrossref_search(query = c("renear", "palmer"), rows = 4)[, 1:3] doi score normalizedScore1 10.1126/science.1157784 3.253 1002 10.1002/meet.2009.1450460141 2.169 663 10.4242/BalisageVol3.Renear01 2.102 644 10.4242/BalisageVol5.Renear01 2.102 64
# Get results for a certain yearcrossref_search(query = c("renear", "palmer"), year = 2010)[, 1:3] doi score normalizedScore1 10.1002/meet.14504701218 1.0509 1002 10.1002/meet.14504701240 1.0509 1003 10.5270/OceanObs09.cwp.68 1.0442 994 10.1353/mpq.2010.0003 0.6890 655 10.1353/mpq.0.0041 0.6890 656 10.1353/mpq.0.0044 0.6890 657 10.1353/mpq.0.0057 0.6890 658 10.1386/fm.1.1.2 0.6890 659 10.1386/fm.1.2.2 0.6890 6510 10.1386/fm.1.3.2 0.6890 6511 10.1097/ALN.0b013e3181f09404 0.6090 5712 10.1016/j.urology.2010.02.033 0.6090 5713 10.1353/ect.2010.0025 0.6090 5714 10.1117/2.4201001.04 0.6090 5715 10.1111/j.1835-9310.1977.tb01159.x 0.6090 5716 10.4067/S0717-69962010000100001 0.6090 5717 10.4067/S0717-69962010000200001 0.6090 5718 10.2105/AJPH.2009.191098 0.6029 5719 10.1353/mpq.2010.0004 0.5167 4920 10.1353/mpq.0.0048 0.5167 49
Get a short DOI from
# Geta a short DOI, just the short DOI returnedshort_doi(doi = "10.1371/journal.pone.0042793")[1] "10/f2bfz9"
# Geta a short DOI, all data returnedshort_doi(doi = "10.1371/journal.pone.0042793", justshort = FALSE)$DOI[1] "10.1371/journal.pone.0042793"$ShortDOI[1] "10/f2bfz9"$IsNew[1] FALSE
Get a record from a OAI-PMH data provider.
# Single provider, one identifiermd_getrecord(provider = "pensoft", identifier = "10.3897/zookeys.1.10") title1 A new candidate for a Gondwanaland distribution in the Zodariidae (Araneae): Australutica in Africa creator date type1 JocquƩ,Rudy 2008 Research Article
# Single provider, multiple identifiersmd_getrecord(provider = "pensoft", identifier = c("10.3897/zookeys.1.10", "10.3897/zookeys.4.57")) title1 A new candidate for a Gondwanaland distribution in the Zodariidae (Araneae): Australutica in Africa2 Studies of Tiger Beetles. CLXXVIII. A new Lophyra (Lophyra) from Somaliland (Coleoptera, Cicindelidae) creator date type1 JocquƩ,Rudy 2008 Research Article2 Cassola,Fabio 2008 Research Article
List available metadata formats from various providers.
# List metadata formats for a providermd_listmetadataformats(provider = "dryad") metadataPrefix1 oai_dc2 rdf3 ore4 mets schema1 metadataNamespace1
# List metadata formats for a specific identifier for a providermd_listmetadataformats(provider = "pensoft", identifier = "10.3897/zookeys.1.10") identifier metadataPrefix1 10.3897/zookeys.1.10 oai_dc2 10.3897/zookeys.1.10 mods schema1 metadataNamespace1
Some plotting - mean number of authors per paper
Okay, so this isn't a super useful visualization, but you can surely think of something better.
library(ggplot2)library(ggthemes)library(reshape)temp <- md_listrecords(provider = "pensoft", from = "2011-10-01", until = "2012-01-01")temp2 <- ldply(temp)[, -1]auths <- sapply(temp2$creator, function(x) length(strsplit(as.character(x), ";")[[1]]))toplot <- data.frame(authors = auths, articletype = temp2$type)toplot_ <- ddply(toplot, .(articletype), summarise, authors = mean(authors))toplot_$articletype <- reorder(toplot_$articletype, toplot_$authors)ggplot(toplot_, aes(articletype, authors)) + theme_tufte(base_size = 16) + geom_bar(stat = "identity") + coord_flip()
Get the .Rmd file used to create this post at my github account - or .md file.
Written in Markdown, with help from knitr, and knitcitations. offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...