Exploring the consistency of the quality scores with machine learning for next-generation sequencing experiments (2018)

Min Oh; Erdal Cosgun

Exploring the consistency of the quality scores with machine learning for next-generation sequencing experiments (2018)

Min Oh ,
Erdal Cosgun

American Society of Human Genetics 2018 | October 2018

Organized by American Society of Human Genetics

PgmNr 1480/T

Download BibTex

Next-Generation Sequencing (NGS) enables massively parallel processing, allowing lower cost than the other sequencing technologies. In the subsequent analysis with the NGS data, one of the major concern is the reliability of variant calls. Although researchers can utilize raw quality scores of variant calling, they are forced to start the further analysis without any pre-evaluation of the quality scores. Here, we present a machine learning approach for estimating quality scores of variant calls derived from GATK best practice powered by Microsoft Genomics service. Based on three machine learning algorithms, including Multivariate Linear Regression (MLR), Random Forest Regression (RFR), and Neural Network Regression (NNR), data-driven predictive models were trained on variant call format (VCF) files, which contain technical values of GATK annotation module for each variant call. We analyzed correlations between the quality score and these annotations, specifying informative annotations which were used as features to predict variant quality scores. Some annotations that are simple statistics or scaled values originated from another annotation were excluded from the features. To test the predictive models, we simulated twenty-four paired-end Illumina sequencing reads with 30x coverage based on twenty-four artificial human genomes generated by perturbing a human reference genome with an injection of a wide range of variants, including SNPs, small indels, and large structural variants. Also, twenty-four human genome sequencing reads resulted from Illumina paired-end sequencing with at least 30x coverage were secured from Sequence Read Archive (SRA). Using Microsoft Genomics service, VCFs were derived from simulated and real sequencing reads. We trained the regression models on training data obtained by splitting each VCF data into training and test data, evaluating the prediction performance on the test data. We observed that the prediction models learned by RFR outperformed other algorithms in both simulated and real data. The quality scores of variant calls were highly predictable from informative features of GATK Annotation Modules in the simulated human genome VCF data (R2: 96.7%, 94.4%, and 89.8% for RFR, MLR, and NNR, respectively). Interestingly, the robustness of the proposed data-driven models was consistently maintained in the real human genome VCF data (R2: 97.8% and 96.5% for RFR and MLR, respectively).