Our whole genome sequencing libraries were generated from DNA that was also used as part of a study that used RAD sequencing (provided by DaRT) to examine population structure across a broader geographical region in WA. DNA was extracted by DaRT and returned on 96 well plates to us prior to sending to QB3 (California Institute for Quantitative Bioscience, University of California, Berkeley) for whole genome sequencing.
Since this process involved many steps at which a mistake in sample
identity could have been introduced we performed a relatedness check on
our data against all corresponding samples that were also sequenced with
DaRT. All raw reads from DaRT were mapped against the genome and use to
call variants with freebayes. bcftools
was then used to merge results
from WGS and DaRT sequencing into a single vcf file as follows;
bcftools merge Adigi.v2.filtered.vcf.gz allrad_filt.vcf.gz |
bcftools view -i 'F_MISSING<0.1' > allrad_merged.vcf
We then used the relatedness2 function in vcftools
to calculate a
pairwise relatedness statistic between all pairs of samples in the file
vcftools --vcf allrad_merged.vcf --relatedness2
A histogram of relatedness values shows that the majority of pairs are unrelated (PHI close to or less than 0). The small peak close to 0.5 represents samples with the maximum possible relatedness value for diploids (the expected value if the samples are the same, or for clones). This peak captures all the pairs where a sample is paired with itself (is_same=TRUE). We can therefore use PHI to determine the matches between RAD and WGS samples and potentially trace any mislabelled samples.
sample_code1 | sample_code2 | individual_1 | individual_2 | RELATEDNESS_PHI |
---|---|---|---|---|
BR_5_124_S124_L004 | 1439309 | BR_5_124 | BR_5_121 | 0.486267 |
1439312 | 1439309 | BR_5_124 | BR_5_121 | 0.495394 |
1439309 | BR_5_124_S124_L004 | BR_5_121 | BR_5_124 | 0.486267 |
1439309 | 1439312 | BR_5_121 | BR_5_124 | 0.495394 |
Our sample labelled BR_5_121_S125_L004
matches sample with the ID
1439306 which matches the RAD-seq label of RS3_S_252. This sample is
the same sample that we identified as having south offshore ancestry in
our population structure analysis. By matching it with the RAD-seq data
here we can see that it was mislabelled as BR_5_121_S125_L004
and
based on its original RAD-seq label it is in fact a sample from Rowley
Shoals (south offshore).
## # A tibble: 3 × 7
## INDV1 INDV2 N_AaAa N_AAaa N1_Aa N2_Aa RELATEDNESS_PHI
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BR_5_121_S125_L004 BR_5_121_S125_L004 540 0 540 540 0.5
## 2 BR_5_121_S125_L004 1439306 519 0 540 522 0.489
## 3 1439306 BR_5_121_S125_L004 519 0 522 540 0.489