Abstract
This paper proposes a novel machine learning procedure for genome-wide association study (GWAS), named LightGWAS. It is based on the LightGBM framework, in addition to being a single, resilient, autonomous and scalable solution to address common limitations of GWAS implementations found in the literature. These include reliance on massive manual quality control steps and specific GWAS methods for each type of dataset morphology and size. Through this research, LightGWAS has been contrasted against PLINK2, one of the current state-of-the-art for GWAS implementations based on general linear model with support to firth regularisation. The mean differences measured upon standard classification metrics, extracted via quantitative empirical tests through k-fold cross-validation technique, indicated that LightGWAS outperforms PLINK2 for balanced, imbalanced, and high-imbalanced genomic datasets. Paired difference tests denoted statistical significance in the results extracted from the experiments with imbalanced datasets. This article contributes to the body of knowledge by presenting a potentially more efficient GWAS procedure based on nonparametric approaches. LightGWAS ensures adaptability with higher precision in the discovery of causal single-nucleotide polymorphisms, thanks to the leaf-wise tree growth algorithm offered by the state-of-the-art for gradient boosting decision trees. Control for false-positives and statistical power are automatically addressed by the model's training process, which significative reduces human dependency during the study design.
Original language | English |
---|---|
Pages (from-to) | 25-36 |
Number of pages | 12 |
Journal | CEUR Workshop Proceedings |
Volume | 2771 |
Publication status | Published - 2020 |
Event | 28th Irish Conference on Artificial Intelligence and Cognitive Science, AICS 2020 - Dublin, Ireland Duration: 7 Dec 2020 → 8 Dec 2020 |
Keywords
- Genome-wide association study
- LightGBM
- LightGWAS