Sample label check using Radseq data

Our whole genome sequencing libraries were generated from DNA that was also used as part of a study that used RAD sequencing (provided by DaRT) to examine population structure across a broader geographical region in WA. DNA was extracted by DaRT and returned on 96 well plates to us prior to sending to QB3 (California Institute for Quantitative Bioscience, University of California, Berkeley) for whole genome sequencing.

Since this process involved many steps at which a mistake in sample identity could have been introduced we performed a relatedness check on our data against all corresponding samples that were also sequenced with DaRT. All raw reads from DaRT were mapped against the genome and use to call variants with freebayes. bcftools was then used to merge results from WGS and DaRT sequencing into a single vcf file as follows;

bcftools merge  Adigi.v2.filtered.vcf.gz allrad_filt.vcf.gz | 
    bcftools view -i 'F_MISSING<0.1' > allrad_merged.vcf

We then used the relatedness2 function in vcftools to calculate a pairwise relatedness statistic between all pairs of samples in the file

vcftools --vcf allrad_merged.vcf --relatedness2

A histogram of relatedness values shows that the majority of pairs are unrelated (PHI close to or less than 0). The small peak close to 0.5 represents samples with the maximum possible relatedness value for diploids (the expected value if the samples are the same, or for clones). This peak captures all the pairs where a sample is paired with itself (is_same=TRUE). We can therefore use PHI to determine the matches between RAD and WGS samples and potentially trace any mislabelled samples.

sample_code1	sample_code2	individual_1	individual_2	RELATEDNESS_PHI
BR_5_124_S124_L004	1439309	BR_5_124	BR_5_121	0.486267
1439312	1439309	BR_5_124	BR_5_121	0.495394
1439309	BR_5_124_S124_L004	BR_5_121	BR_5_124	0.486267
1439309	1439312	BR_5_121	BR_5_124	0.495394

Our sample labelled BR_5_121_S125_L004 matches sample with the ID 1439306 which matches the RAD-seq label of RS3_S_252. This sample is the same sample that we identified as having south offshore ancestry in our population structure analysis. By matching it with the RAD-seq data here we can see that it was mislabelled as BR_5_121_S125_L004 and based on its original RAD-seq label it is in fact a sample from Rowley Shoals (south offshore).

## # A tibble: 3 × 7
##   INDV1              INDV2              N_AaAa N_AAaa N1_Aa N2_Aa RELATEDNESS_PHI
##   <chr>              <chr>               <dbl>  <dbl> <dbl> <dbl>           <dbl>
## 1 BR_5_121_S125_L004 BR_5_121_S125_L004    540      0   540   540           0.5  
## 2 BR_5_121_S125_L004 1439306               519      0   540   522           0.489
## 3 1439306            BR_5_121_S125_L004    519      0   522   540           0.489

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

18.radseq_check.md

18.radseq_check.md

Sample label check using Radseq data

Files

18.radseq_check.md

Latest commit

History

18.radseq_check.md

File metadata and controls

Sample label check using Radseq data