Snijder (2022) OK Computer, what are these books about? – An experiment in large-scale classification of open access books | https://osf.io/preprints/socarxiv/xdhuq/

flavoursofopenscience's bookmarks 2022-03-15

Summary:

Snijder, R. (2022, March 11). OK Computer, what are these books about? – An experiment in large-scale classification of open access books. https://doi.org/10.31235/osf.io/xdhuq Introduction: Can we automatically classify a large collection of open access books? This paper describes an experiment using the entity-fishing algorithm: it scans texts for terms that can be linked to Wikipedia pages. Based on the algorithm's results, new keywords are added to the book descriptions, plus a list of relevant Wikipedia pages. Description: In the OAPEN Library, the full text of 4,125 books and chapters in English and in German was analysed by the algorithm, resulting in a data set of 25 million records. The entity-fishing algorithm is not always aware of the context and the language of the books is another factor. Instead of blindly picking the most frequent Wikipedia pages, the results were filtered using a confidence score, plus a manual check. This brought the number of possible entities down from 25 million to slightly over 22,400 – a reduction of 99.9%. Evaluation: The goal of the experiment was to find only the most suitable Wikipedia pages to describe the books and chapters, and the results were evaluated. The percentage of rejected keywords is below 5%. The ratio between existing and newly added keywords: 81% of the keywords were newly added. Result: A large number of document descriptions in the OAPEN Library has been enriched and the procedure for automatically selecting the entities is now available. To run an experiment is to learn and we have learned that it is possible – with some human help – to let a computer find out what an open access book is about.

Link:

https://osf.io/preprints/socarxiv/xdhuq/

From feeds:

[IOI] Open Infrastructure Tracking Project » Items tagged with oa.oapen in Open Access Tracking Project (OATP)
Open Access Tracking Project (OATP) » flavoursofopenscience's bookmarks

Tags:

oa.new oa.data oa.wikipedia oa.oapen oa.new oa.metadata oa.data oa.cataloguing oa.books

Date tagged:

03/15/2022, 15:44

Date published:

03/15/2022, 11:44