rOpenSci | pubchunks: extract parts of scholarly XML articles
peter.suber's bookmarks 2018-10-23
"The goal of pubchunks is to fetch sections out of XML format scholarly articles. Users do not need to know about XML and all of its warts. They only need to know where their files or XML strings are and what sections they want of each article. Then the user can combine these sections and do whatever they wish downstream; for example, analysis of the text structure or a meta-analysis combining p-values or other data.
The other major format, and more common format, that articles come in is PDF. However, PDF has no structure other than perhaps separate pages, so it’s not really possible to easily extract specific sections of an article. Some publishers provide absolutely no XML versions (cough, Wiley) while others that do a good job of this are almost entirely paywalled (cough, Elsevier). There are some open access publishers that do provide XML (PLOS, Pensoft, Hindawi) - so you have the best of both worlds with those publishers...."