Improving data quality in large-scale repositories through conflict resolution | SpringerLink

peter.suber's bookmarks 2021-12-06


Abstract:  Digital repositories rely on technical metadata to manage their objects. The output of characterization tools is aggregated and analyzed through content profiling. The accuracy and correctness of characterization tools vary; they frequently produce contradicting outputs, resulting in metadata conflicts. The resulting metadata conflicts limit scalable preservation risk assessment and repository management. This article presents and evaluates a rule-based approach to improving data quality in this scenario through expert-conducted conflict resolution. We characterize the data quality challenges and present a method for developing conflict resolution rules to improve data quality. We evaluate the method and the resulting data quality improvements in an experiment on a publicly available document collection. The results demonstrate that our approach enables the effective resolution of conflicts by producing rules that reduce the number of conflicts in the data set from 17 to 3%. This replicable method for presents a significant improvement in content profiling technology for digital repositories, since the enhanced data quality can improve risk assessment and preservation management in digital repository systems.



From feeds:

Open Access Tracking Project (OATP) » peter.suber's bookmarks

Tags: oa.repositories oa.rpd oa.metadata oa.quality oa.risks oa.preservation

Date tagged:

12/06/2021, 16:27

Date published:

12/06/2021, 11:27