ASAS-NANP SYMPOSIUM: MATHEMATICAL MODELING IN ANIMAL NUTRITION: Synthetic Database Generation for Non-Normal Multivariate Distributions: A Rank-Based Method with Application to Ruminant Methane Emissions
database[Title] 2025-05-11
J Anim Sci. 2025 May 4:skaf136. doi: 10.1093/jas/skaf136. Online ahead of print.
ABSTRACT
This study addresses the challenge of limited data availability in animal science, particularly in modeling complex biological processes such as methane emissions from ruminants. We propose a novel rank-based method for generating synthetic databases with correlated non-normal multivariate distributions aimed at enhancing the accuracy and reliability of predictive modeling tools. Our rank-based approach involves a four-step process: (1) fitting distributions to variables using normal or best-fit non-normal distributions, (2) generating synthetic databases, (3) preserving relationships among variables using Spearman correlations, and (4) cleaning datasets to ensure biological plausibility. We compare this method with copula-based approaches to maintain a pre-established correlation structure. The rank-based method demonstrated superior performance in preserving original distribution moments (mean, variance, skewness, kurtosis) and correlation structures compared to copula-based methods. We generated two synthetic databases (normal and non-normal distributions) and applied random forest (RF) and multiple linear model (LM) regression analyses. RF regression outperformed LM in predicting methane emissions, showing higher R² values (0.927 vs. 0.622) and lower standard errors. However, cross-testing revealed that RF regressions exhibit high specificity to distribution types, underperforming when applied to data with differing distributions. In contrast, LM regressions showed robustness across different distribution types. Our findings highlight the importance of understanding distributional assumptions in regression techniques when generating synthetic databases. The study also underscores the potential of synthetic data in augmenting limited samples, addressing class imbalances, and simulating rare scenarios. While our method effectively preserves descriptive statistical properties, we acknowledge the possibility of introducing artificial (unknown) relationships within subsets of the synthetic database. This research uncovered a practical solution for creating realistic, statistically sound datasets when original data is scarce or sensitive. Its application in predicting methane emissions demonstrates the potential to enhance modeling accuracy in animal science. Future research directions include integrating this approach with deep learning, exploring real-world applications, and developing adaptive machine-learning models for diverse data distributions.
PMID:40319357 | DOI:10.1093/jas/skaf136