Enabling disaggregation of Asian American subgroups: a dataset of Wikidata names for disparity estimation
wikidata 2025-04-25
Sci Data. 2025 Apr 5;12(1):580. doi: 10.1038/s41597-025-04753-y.
ABSTRACT
Decades of research and advocacy have underscored the imperative of surfacing - as the first step towards mitigating - racial disparities, including among subgroups historically bundled into aggregated categories. Recent U.S. federal regulations have required increasingly disaggregated race reporting, but major implementation barriers mean that, in practice, reported race data continues to remain inadequate. While imputation methods have enabled disparity assessments in many research and policy settings lacking reported race, the leading name algorithms cannot recover disaggregated categories, given the same lack of disaggregated data from administrative sources to inform algorithm design. Leveraging a Wikidata sample of over 300,000 individuals from six Asian countries, we extract frequencies of 25,876 first names and 18,703 surnames which can be used as proxies for U.S. name-race distributions among six major Asian subgroups: Asian Indian, Chinese, Filipino, Japanese, Korean, and Vietnamese. We show that these data, when combined with public geography-race distributions to predict subgroup membership, outperform existing deterministic name lists in key prediction settings, and enable critical Asian disparity assessments.
PMID:40188111 | PMC:PMC11972315 | DOI:10.1038/s41597-025-04753-y