The accurate measure of fetal fraction is important to assure the results of noninvasive prenatal testing. However, measuring fetal fraction could require a huge amount of data and additional costs. Therefore, this study proposes an alternative method of measuring fetal fraction under a limited sample size and low sequencing reads. The adaptive machine learning algorithms customized to each laboratory’s environment were used to measure fetal fraction. The pregnant women with female fetuses were tested to exclude the bias caused by training data of the women carrying male fetuses. The accuracy of fetal DNA fraction prediction was enhanced by increasing the training sample size. When trained with 1,000 samples (males) and tested with 45 samples (females), the optimal bin sizes using the read count and read size features were 300 kb and 800 kb, respectively. Comparing the new 300 kb bin to the 50 kb bin used by SeqFF at 4,000–5,000 training samples, the correlation is approximately 3-5% higher in the 300 kb bin. We have proposed an effective and tailored method to measure fetal fraction available in individual laboratories at limited sample collecting conditions and relatively low-coverage sequencing data.
for more search our paper
The bin info files in RC and RL folders
are like both rc_bin*... and rl_bininfo*... without headers.
Sample1.Fastq.sam.bam.sort.bam.rmdup.bam.sam.rl,19.05270566 Sample2.Fastq.sam.bam.sort.bam.rmdup.bam.sam.rl,17.65618359
Read Count (RC) and Read Length (RL) files must be in .rc and .rl formats with headers as the following.
"BIN","CHR","END","COUNT","GC"
chr1_0,chr1,300000,583,0.430783082518
chr1_1,chr1,600000,474,0.444418530072
"BIN","CHR","END","RRL"
chr1_0,chr1,800000,0.255024255024
chr1_1,chr1,1600000,0.262870514821
-
Pandas
-
numpy
-
scikit
Method to insall python library pip install <library name> e.g. pip install pandas.
- doParallel
- glmnet
- Matrix
- MASS
- methods
install.packages(c('Matrix', 'glmnet', 'MASS', 'foreach', 'doParallel', 'MASS'))
If any error rises, please check the specified format of files, the installed packages, and the path for Python and R in your system.
Keep all sam files inside the sam folder like e.g. TheragenGenomecare/sam/
.
Run python code python bam_rl_read.py
.
Convert sam files to Read Count (rc) and Read Length (rl) format files. This may take long time according to the input data size.
After the rc and rl files are ready, please keep all rc and rl files in the training and testing folders with corresponding bininfo files.
python GenomomFF_training.py
in the terminal where GenomomFF_training.py is located.
For 1000 sets of data, it took around 4 minutes in our system. After running GenomomFF_training successfully, this will create the rc and rl parameter files inside the training folder, which are used for testing the data.
python GenomomFF_testing.py
You can see the csv file with correlation values saved inside the testing folder at last.