GenomeMLModels

Introduction

This project aims to generate usable machine learning models for clinical datasets. Clinical datasets generated through several generations of improvements in bioinformatics in the past decade hold potentially missed findings. Easy sifting through years of data is complex due to evolving data structures and new reference databases. Better sequencing speeds and prices generate ever larger amounts of data for clinical analysis.

The project incorportes the Hail (https://hail.is/) data handling layer enabling VCF/gVCF input that coerces tabular data from different generation datasets: https://github.com/OligoGeneticDiseases/gen-toolbox. An example model will be trained on the Illumina Trusight Cancer(TM) and Trusight(TM) Hereditary Cancer panel VCFs. The dataset contains known pathogenic monogenic variants causative for hereditary cancer types (e.g. breast cancer). The final model is expected to annotate variants in new VCFs for potential pathogenicity based on input phenotype (i.e. breast cancer - hereditary). A decision boundary will be selected for single high probability variants. This tool would enable to quickly downselect a large number of variants or even tag batches of VCF files for potential monogenic variants. This project serves as the proof-of-concept for machine learning on genomic datasets using the aforementioned libraries.

Data structure

Most VCFs of different generations can be re-annotated with up-to-date frequency data using VEP. This project expects all VCFs or gVCFs to have the genotype caller allele data, VEP annotations for IMPACT, MAX_AF, HGNC_ID, (HPO phenotype terms) available. All columns are converted into features available to the model.

Boosted decision tree

A boosted decision tree will be used as the base model for simple tabular containing both strings and float values using available open source machine learning frameworks.

Expected project structure

A Jupyter Notebook entry point replacing main
- Config parameters
  - Maximum depth
  - Features
  - Learning rate and other boosting params
  - Dataset sizes
- The notebook should be section with data input, ML tree and loss function, output layer
  - Tabular output of VCFs and potential pathogenic variants
- Power calculations section

src/calcs_module
tests/
trained_model_published

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
machine-learning		machine-learning
src/config		src/config
test-data		test-data
.gitignore		.gitignore
README.md		README.md
TODO.md		TODO.md
Untitled Diagram.drawio		Untitled Diagram.drawio
ml_model.ipynb		ml_model.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenomeMLModels

Introduction

Data structure

Boosted decision tree

Expected project structure

About

Releases

Packages

Contributors 2

Languages

OligoGeneticDiseases/GenomeMLModels

Folders and files

Latest commit

History

Repository files navigation

GenomeMLModels

Introduction

Data structure

Boosted decision tree

Expected project structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages