Record linkage in public health datasets: a practical experience in a fast in-process analytical database
database[Title] 2025-12-10
Rev Bras Epidemiol. 2025 Nov 28;28:e250053. doi: 10.1590/1980-549720250053. eCollection 2025.
ABSTRACT
OBJECTIVE: This study presents the accuracy of an algorithm with a mixed approach for linking the Mortality Information System (SIM) and the Influenza Epidemiological Surveillance Information System (SIVEP-Gripe) records, implemented in DuckDB.
METHODS: The proposed algorithm was compared with a previously validated algorithm, in different prevalence scenarios. We employed a hybrid deterministic-probabilistic approach, using similarity metrics such as Jaro and Jaro-Winkler. The study highlights important advantages, including superior processing speed and scalability, maintaining high values in terms of sensitivity, specificity and predictive values.
RESULTS: The DuckDB-based solution processed datasets significantly faster, with execution times up to one hundred times shorter, making it particularly suitable for large-scale, real-time applications.
CONCLUSIONS: This study underscores the potential of DuckDB as a high-performance analytical database for efficiently managing complex data integration tasks and highlights its suitability for resource-limited environments in public health, where timely and accurate record linkage is often essential.
PMID:41337538 | PMC:PMC12667511 | DOI:10.1590/1980-549720250053