Archivists Create a Searchable Index of 107 Million Science Articles
peter.suber's bookmarks 2021-10-14
"The General Index is here to serve as your map to human knowledge. Pulled from 107,233,728 journal articles, The General Index is a searchable collection of keywords and short sentences from published papers that can serve as a map to the paywalled domains of scientific knowledge.
In full, The General Index is a massive 38 terabyte archive of searchable terms. Compressed, it comes to 8.5 terabytes. It can be pulled directly from archive.org, which can be a difficult and lengthy process. People on the /r/DataHoarder subreddit have uploaded the data to a remote server and are spreading it across BitTorrent. You can help by grabbing a seed here.
The General Index does not contain the entirety of the journal articles it references, simply the keywords and n-grams—a string of simple phrases containing a keyword—that make tracking down a specific article easier. “This is an early release of the general index, a work in progress,” Carl Malamud, the founder of Public.Resource.org and co-creator of the General Index, said in a video about the archive. “In some cases text extraction failed, sometimes metadata is not available or is perhaps incorrect while the underlying corpus is large, it is not complete and it is not up to date.”..."