Quality Control

Methods and scripts used for checking the quality of datasets at various stages of the project are documented here. Some methods are for the raw data inspection, while others are to check the source/line integrity, as well as the final data results quality.

Quality inspection was done for:

Illumina short reads generated for each NAM were inspected to make sure that sequencing was done on the right NAM line before proceeding with the PacBio sequencing and assembly. The data was compared with the HapMap data to make sure the newly sequenced line clusters with the correct accession within the HapMap data. RawSequence obtained from BaseSpace were already inspected for the sequencing quality, so it was skipped.
PacBio data generated from multiple sequencing facility was tested using SequelTools which provides various quality metrics for inspecting and accessing the data quality.
RNA-seq data downloaded from BaseSpace (which was already tested for quality, demultiplexed and trimmed of any adapter sequences) and were tested by first mapping against V4 B73 (to check the mapping percent), counts were generated against genic features provided by the V4 annotation, and count based clustering using DESeq2 was performed to test correct clustering of various tissues within the NAM genome. The RNAseq data was further tested to verify its source (accession and tissue) by calling variants using GATK recommended workflow.
Assembly QC Finished assembled sequences (contigs/scaffolds/pseudomolecule) were used as input sequences in FASTA format to the GenomeQC pipeline to calculates different quality metrics for assessing the contiguity and gene space completness of the NAM genome assemblies. For assessing gene space completeness, Embryophyta odb9 (lineage specific profile) and maize (AUGUSTUS gene prediction species profile) were selected as the input parameters for calculating % BUSCO completeness.

Illumina QC:

A clean data set containing HapMap2 reference maize SNPs was prepared
- HapMap2 maize SNP reference data was downloaded from the panzea website in the condensed HapMap format
- Monomorphic loci were removed using a custom Python scripts
- The file was converted from the condensed HapMap format to the standard Hapmap format using a custom Python script
- The file was then filtered so as to remove all SNPs other than those associated with relevant NAM founders and TIL11
A clean SNP data set was generated using raw Illumina reads from one or more newly sequenced NAM founders (modified from Li's scripts: 1, 2, 3
- read mapping was done with BWA in a bash pipeline
- SNP calling was done with GATK producing a vcf file
The HapMap reference SNPs and newly generated NAM founder SNPs were merged producing a SNP file in standard HapMap format
Percent Heterozygosity was calculated from merged SNP data set with the script
SNPphylo was used to build a phylogenec tree from the merged SNP data set
- The SNP file was randomly subsampled such that the SNPphylo input contains no more than 50,000 SNPS ,which is a requirement of SNPphylo, see
- SNPphylo was used to generate the phylogenetic tree which was visualized in FigTree

Scripts used for this parts are in the scripts directory.

PacBio QC:

SequelTools was run on PacBio data to collect QC metrics. SequelTools was run as described in the manual.

# for each NAM, collect the names of scraps and subread bam files
find $(pwd) -name "*subreads.bam"  > subreads.txt
find $(pwd) -name "*scraps.bam"  > scraps.txt
# Run SequeTools with default options
bash SequelTools.sh -t Q -u subreads.txt -c scraps.txt

The full Excel Sheet with summary stats for each SMRTcell is available here.

Summary Stats

NAM	Gsize (Gb)	Facility	SMRT cells	Mean SR read length	Median SR read length	Mean N50	Depth of coverage (SR)
CML277	2.337	Arizona	27	8840.81	6137.0	15682.2	70.7
KY21	2.286	OSU	21	11824.67	9602.0	18338.8	68.7
Tzi8	2.346	Arizona	20	11315.10	8717.0	18732.2	66.4
CML247	2.351	OSU	22	10721.71	8001.0	17945.6	68.3
IL14H	2.102	UGA	26	10795.13	8094.0	18152.4	85.5
Oh7b	2.203	BYU	26	12302.99	10963.0	18828.7	70.6
HP301	2.208	UGA	22	12721.05	10033.5	20426.8	73.8
M162W	2.238	OSU	22	13293.20	11109.5	20897.4	76.2
Ki3	2.348	UGA	22	12870.39	11029.5	19749.3	63.5
CML228	2.364	BYU	28	11296.39	9103.0	17818.4	68.3
NC358	2.272	Arizona	21	13711.53	11020.0	21816.6	71.2
CML103	2.309	UGA	23	11583.03	9071.0	18756.8	72.2
CML52	2.332	BYU	26	11234.27	9013.0	17672.0	69.1
OH43	2.222	OSU	17	13953.02	11157.0	22216.4	69.9
CML069	2.498	BYU	22	13392.14	9816.0	22256.7	64.4
CML322	2.305	OSU	18	13939.05	10760.0	22004.8	65.6
M37W	2.330	UGA	22	15110.58	12032.0	24461.1	65.9
MO18W	2.328	OSU	22	13742.10	10022.0	23843.7	85.2
MS71	2.192	Arizona	22	15387.96	11909.5	25324.4	76.1
P39	2.091	BYU	19	16060.80	11951.0	26507.0	81.3
CML333	2.401	BYU	21	15666.66	11660.0	26361.6	63.2
B73AB10	2.300	UGA	22	15965.76	13083.0	26462.7	62.7
Tx303	2.346	Arizona	20	16180.25	11968.0	27810.9	71.1
B73	2.300	UGA	18	17246.77	12437.5	29671.6	82.9
KI11	2.381	UGA	18	16579.53	11814.5	29169.9	65.9
B97	2.284	Arizona	23	16339.04	13496.0	29428.0	71.3
NC350	2.394	UGA	19	18654.96	14166.0	31717.6	65.9

RNA-Seq QC:

General Data Quality

RNA-seq data downloaded from BaseSpace were mapped to B73.v4 genome using STAR (2-pass mapping), counts were gathered using the v4 annotation in gff3 format using featureCounts, followed by clustering using DESeq2. Methods are as follows:

Steps:

soft-link all the input-files creating directory for each NAM: soft-link-rnaseq-files.sh
Download B73.v4 and index the genome using STAR: runSTARmap_index.sh
Create commands for each NAM directory: star-cmds-gen.sh, this will generate commands for
- Map raw reads to the indexed genome using STAR (1st pass): runSTARmap_round1.sh
  - Aggregate splice sites to generate a file for reliable splice sites found in the genome: sjCollapseSamples.awk
  - Map raw reads to the indexed genome again, but using the splice sites to recalibrate alignment and generate a bam file: runSTARmap_round2.sh
  - Run the featureCoutns from subread package to get counts for each gene present in the annotation: runSubreads.sh. It also runs multiqc for gathering quality metrics.
  - Clean the counts file and generate R script process it with DESeq2 package using: cleanSubreadCounts.sh. The base R-script is: DESeq2.R, which needs to be run to create plots and clustering data.
  - Gather summary stats using scripts tabulator.sh and getCoverageStats.sh
The plots and tables were imported to Excel and PowerPoint to evaluating the quality. The files are available here: RNA_seq_v5.2.pptx and RNAseq_inventory_V5.2.xlsx.

Accession and Tissue integrity

Each RNAseq library was mapped to B73.v4 and SNPs were called using GATK. The SNPs were then clustered to generate a phylogenetic tree, compared to the NAM founder phylogenetic tree (from HapMap2) to detect any sample/tissue/accession mislabeling or switching. The methods/scripts are listed below:

The bam file generated from the previous step was processed using GATK and Picard toolkit using the script process-rnaseq-bams.sh. Briefly, this scripts adds read-group information, marks duplicate reads in the mapped bam file, and splits N cigar string in mapped reads.
Run GATK HaplotypeCaller on the processed bam files. To speed it up, the commands were generated using intervals, processing 3Mb regions at a time, simultaneously, running them on the cluster. The script used: gatkcmds-gen.sh. The jobs script wrapper makeSLURMp.py was used.
The VCF files for each region were gathered and filtered to generate a single VCF file.
SNPphylo was used to contruct phylogenetic tree and the tree was inspected to check if the tissue samples from each accession clustered correctly. Any irregularities were flagged and removed from the final analyses (samples that could be easily obtained were re-sequenced). Script snp-phylo-run.sh.

Assembly QC

sh pipeline.sh genome.fasta output_assembly_stats output_buscometrics embryophyta_odb9 maize 32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Quality Control

Illumina QC:

PacBio QC:

Summary Stats

RNA-Seq QC:

General Data Quality

Accession and Tissue integrity

Assembly QC

Files

README.md

Latest commit

History

README.md

File metadata and controls

Quality Control

Illumina QC:

PacBio QC:

Summary Stats

RNA-Seq QC:

General Data Quality

Accession and Tissue integrity

Assembly QC