What do we know about DOIs - Crossref

peter.suber's bookmarks 2024-03-01


"Crossref holds metadata for approximately 150 million scholarly artifacts. These range from peer reviewed journal articles through to scholarly books through to scientific blog posts. In fact, amid such heterogeneity, the only singular factor that unites such items is that they have been assigned a document object identifier (DOI); a unique identification string that can be used to resolve to a resource pertaining to said metadata (often, but not always, a copy of the work identified by the metadata).

What, though, do we actually know about the state of persistence of these links? How many DOIs resolve correctly? How many landing pages, at the other end of the DOI resolution, contain the information that is supposed to be there, including the title and the DOI itself? How can we find out?

The first and seemingly most obvious way that we can obtain some of these data is by working through the most recent sample of DOIs and attempting to fetch metadata from each of them using a standard python script. This involves using the httpx library to attempt to resolve each of the DOIs to a resource, visiting that resource and seeing what the landing page yields.

Even this is not straightforward. Landing pages can be HTML resources or they can be PDF files, among other things. In the case of PDF files, to detect a run of text is not simple as a single line break can be enough to foil our search. Nonetheless, when using this strategy we find the following statistics: ..."



From feeds:

Open Access Tracking Project (OATP) ยป peter.suber's bookmarks


oa.new oa.dois oa.pids oa.crossref oa.scholcomm

Date tagged:

03/01/2024, 08:53

Date published:

03/01/2024, 03:53