Wikidata to split as sheer volume of information overloads infrastructure | Wikipedia Signpost

flavoursofopenscience's bookmarks 2024-05-17

Summary:

The Wikimedia Foundation will soon split parts of the WikiCite dataset off from the main Wikidata dataset. Both data collections will be available through the Wikidata Query Service: although in queries, by default users will get content from the main graph, and can afterwards take extra effort to request WikiCite content. This is the start of query federation for Wikidata content, and is a consequence of Wikidata having so much content that the server hosting resources of the Wikidata Query Service are under strain.

I support this as a WikiCite editor, because WikiCite is consuming considerable resources, and the split preserves the content by reducing its accessibility. This split could also be the start of dedicated support for Wikimedia citation data products.

I am wary of the split, because it only gives about three more years to look for another solution, and we have already been seeking one since 2018. The complete scholarly citation corpus of ~300 million citations is not a large dataset by contemporary standards, but our Blazegraph backend strains to include 40 million right now. Even after a split, Wikidata will fill with content again. Fear of the split has been slowing and deterring Wikidata content creation for years, and we do not have long-term plans for splitting and federating Wikibase instances repeatedly.

 

Link:

https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2024-05-16/Op-Ed

From feeds:

Open Access Tracking Project (OATP) » flavoursofopenscience's bookmarks

Tags:

oa.new oa.wikidata oa.data oa.rdm oa.infrastructure oa.wikicite oa.bibliometrics oa.metadata oa.wikimedia_foundation

Date tagged:

05/17/2024, 09:00

Date published:

05/17/2024, 05:00