Evaluation of Machine Learning and Deep Learning Method’s Performance on Different Sample Size Scenarios for Genome-wide Association Studies

Turkiye Klinikleri Journal of Biostatistics | , Vol 2020: pp. 1-7

Objective: The use of machine learning (ML) and deep learning (DL) methods in genetic association studies (GWAS) is becoming increasingly common. These methods help confirm the results obtained in addition to the classical GWAS methodology. Therefore; The aim of this study was to compare the most commonly used ML methods and DL methods in different sample widths and find the most accurate patient-control classification method.

Method: For ML and DL methods, data with the same prevalence and minor allele frequency (MAF) with different sample size and equal number of patient-controls were generated. In all data sets, 10-fold cross validity method was applied for all methods and 70% training data set was determined as 30% test data set for all methods and validity and predictive power of the models were tested. PLINK software for simulations, R programming language for ML and Python programming language for DL have been used with Microsoft Azure GPU machines.

Results: Sample size; N = 200; The best classification performances in ML methods, Random Forest (RF) 0.73 Accuracy (acc.) and CART 0.73 acc. while DL 0.60 acc. N = 600; MLM methods of DVM 0.78 acc. while DL 0.64 acc. value.

Conclusion: In recent years, DVM method is used for high estimation ratio, especially in data sets where the relationships between independent variables are very high. In this study, the highest result was obtained with DVM regardless of sample size. This proves the success of nonlinear ML methods that are frequently encountered in genetic data.