Bridging the gap between real world repositories and scalable preservation environments - Springer

abernard102@gmail.com 2015-08-18

Summary:

[Abstract] Integrating large-scale processing environments, such as Hadoop, with traditional repository systems, such as Fedora Commons 3, has long proved to be a daunting task. In this paper, we will show how this integration can be achieved using software developed in the scalable preservation environments (SCAPE) project, and also how it can be achieved using a local more direct implementation at the Danish State and University Library inspired by the SCAPE project. Both allow full use of the Hadoop system for massively distributed processing without causing excessive load on the repository. We present a proof of concept SCAPE integration and an in-production local integration based on repository systems at the Danish State and University Library and the Hadoop execution environment. Both use data from the Newspaper Digitisation Project, a collection that will grow to more than 32 million JP2 images. The use case for the SCAPE integration is to perform feature extraction and validation of the JP2 images. The validation is done against an institutional preservation policy expressed in the machine readable SCAPE Control Policy vocabulary. The feature extraction is done using the Jpylyzer tool. We perform an experiment with various-sized sets of JP2 images, to test the scalability and correctness of the solution. The first use case considered from the local Danish State and University Library integration is also feature extraction and validation of the JP2 images, this time using Jpylyzer and Schematron requirements translated from the project specification by hand. We further look at two other use cases: generation of histograms of the tonal distributions of the images; and generation of dissemination copies. We discuss the challenges and benefits of the two integration approaches when having to perform preservation actions on massive collections stored in traditional digital repositories.

Link:

http://link.springer.com/article/10.1007%2Fs00799-015-0152-4

From feeds:

Open Access Tracking Project (OATP) ยป abernard102@gmail.com

Tags:

oa.new oa.libraries oa.librarians oa.ir oa.green oa.tools oa.floss oa.infrastructure oa.preservation oa.paywalled oa.repositories

Date tagged:

08/18/2015, 06:51

Date published:

08/18/2015, 02:51