Request an interactive node, connected via sleep
or tmux
so that you don't lose progress when disconnected.
Following programs are required (can be installed via conda
, see instructions below):
bwa
samtools
bioawk
pigz
read1="/ptmp/arnstrm/hic_mapping/AP8220001_R1.fastq.gz"
read2="/ptmp/arnstrm/hic_mapping/AP8220001_R2.fastq.gz"
refSeq="/ptmp/arnstrm/hic_mapping/tdacts_asm_hap1.fasta"
projectDir="/ptmp/arnstrm/hic_mapping/tdacts_asm_hap1"
threads=${SLURM_CPUS_ON_NODE}
splitthreads=$(expr ${threads} / 2)
memory=$(free -g | awk 'NR==2{print $4"G"}')
juicer_version=1.6.2
ext=$(basename ${projectDir})
site="DpnII"
ligation="GATCGATC"
groupname="a$(date +%s)"
justexact=0
genomeID="${ext}"
baseoutname=$(basename ${read1} |cut -f1 -d "_")
You can use other options for site
and ligation
depending on your experiment. The official Juicer supports following options:
Site | Ligation |
---|---|
HindIII | "AAGCTAGCTT" |
MseI | "TTATAA" |
DpnII | "GATCGATC" |
MboI | "GATCGATC" |
NcoI | "CCATGCATGG" |
Arima | "'(GAATAATC|GAATACTC|GAATAGTC|GAATATTC|GAATGATC|GACTAATC|GACTACTC|GACTAGTC|GACTATTC|GACTGATC|GAGTAATC|GAGTACTC|GAGTAGTC|GAGTATTC|GAGTGATC|GATCAATC|GATCACTC|GATCAGTC|GATCATTC|GATCGATC|GATTAATC|GATTACTC|GATTAGTC|GATTATTC|GATTGATC)'" |
none | "XXXX" |
This is main Juicer version 1.6.
cd ${projectDir}
git clone git@github.com:HuffordLab/Juicer_PanAnd.git
mv Juicer_PanAnd juicerdir
juiceDir="${projectDir}/juicerdir"
cd ${juiceDir}
conda env create -f juicer.yml
conda activate juicer
mkdir -p ${projectDir}/fastq
mkdir -p ${projectDir}/references
mkdir -p ${projectDir}/aligned
mkdir -p ${projectDir}/splits
mkdir -p ${projectDir}/debug
mkdir -p ${projectDir}/restriction_sites
mkdir -p ${projectDir}/HIC_tmp
outputdir="${projectDir}/aligned"
splitdir="${projectDir}/splits"
debugdir="${projectDir}/debug"
tmpdir="${projectDir}/HIC_tmp"
ln -s ${read1} ${projectDir}/fastq/
ln -s ${read2} ${projectDir}/fastq/
ln -s ${refSeq} ${projectDir}/references/
Juicer requries genome index (BWA), restriction sites, and scaffold sizes for the input genome. While the scaffold file size generation is fast, the other 2 files need considerable time depending on your genome size.
# chrom sizes file
bioawk -c fastx '{print $name"\t"length($seq)}' ${refSeq} > ${projectDir}/chrom.sizes
genomePath=${projectDir}/chrom.sizes
# restriction sites for the genome
python ${juiceDir}/scripts/generate_site_positions.py DpnII ${refSeq%.*} ${refSeq}
mv ${refSeq%.*}_${site}.txt ${projectDir}/restriction_sites/
site_file="${projectDir}/restriction_sites/$(basename ${refSeq%.*})_${site}.txt"
# bwa genome index
cd ${projectDir}/references
bwa index ${refSeq} && cd ${projectDir}
refSeq="${projectDir}/references/$(basename ${refSeq})"
Some information regarding the versions of programs is needed for the later summary stats files. Juicer captures the command you ran and saves it in the file along with the versions of various programs. We are going to simulate the same behaviour using the below commands. This is optional and is not needed for the generating the output file.
# creating header file
headfile="${projectDir}/debug/headers"
echo -en "Juicer version $juicer_version;" > ${headfile}
echo -en "BWA: $(bwa 2>&1 | awk '$1 ~/Version/ {print $NF}')" >> ${headfile}
echo -en "$threads threads;\n${memory} memory" >> ${headfile}
java --version | head -n 1 >> ${headfile}
${juiceDir}/scripts/juicer_tools -V | head -n 1 >> ${headfile}
cat >> ${headfile} <<- EOF
juicer.sh \
-d ${projectDir} \
-s ${site} \
-p ${genomePath} \
-y ${site_file} \
-z ${refSeq} \
-D ${juicerDir} \
-b ${ligation} \
-t ${threads}
EOF
# countLigations
num1=$(paste <(bioawk -c fastx '{print $seq}' ${read1}) <(bioawk -c fastx '{print $seq}' ${read2}) | grep -cE ${ligation})
num2=$(pigz -d -p ${threads} -c ${read1} | wc -l | awk '{print $1}')
echo -en "${num1} " > ${splitdir}/${ext}_${baseoutname}_norm.txt.res.txt
echo -e "${num2}" > ${splitdir}/${ext}_${baseoutname}_linecount.txt
This probably takes a long time. Depening on how many CPUs you have and the memory. If possible, run this separately as a slurm job.
bwa mem -SP5M -t ${threads} ${refSeq} ${read1} ${read2} > ${splitdir}/${ext}_${baseoutname}.sam
For a 3Gb genome, with 40Gb x 2 compressed fastq files (517,952,838 x 2 reads), it took about 39,004.633 seconds walltime (10.8 hrs) or 2,463,509.733 seconds CPU time (684.31 hrs)
The machine:
Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
500Gb RAM, 64 CPUs
This step will also take some time to finish as it is single threaded job and mostly disk I/O intensive.
awk -v "fname1"=${splitdir}/${ext}_${baseoutname}_norm.txt -v "fname2"=${splitdir}/${ext}_${baseoutname}_abnorm.sam -v "fname3"=${splitdir}/${ext}_${baseoutname}_unmapped.sam -f ${juiceDir}/scripts/chimeric_blacklist.awk ${splitdir}/${ext}_${baseoutname}.sam
perl ${juiceDir}/scripts/fragment.pl ${splitdir}/${ext}_${baseoutname}_norm.txt ${splitdir}/${ext}_${baseoutname}.frag.txt ${site_file}
This will sort by chr, frag, strand, and position (in that order), and it will take some time to finish (depending on your systemps resource availablity)
sort --parallel=${threads} \
--buffer-size=${memory} \
--temporary-directory=${tmpdir} \
--key=2,2d --key=6,6d --key=4,4n --key=8,8n --key=1,1n --key=5,5n --key=3,3n ${splitdir}/${ext}_${baseoutname}.frag.txt > ${splitdir}/${ext}_${baseoutname}.sort.txt
cp ${splitdir}/${ext}_${baseoutname}.sort.txt ${outputdir}/merged_sort.txt
We break this in to two parts. First we run the modified split_rmdups.awk
to split the merged_sort.txt
file, and rather than submitting them as individual slurm
job, we run them sequentially (does not take much time).
# generate commands and write cmds.sh file
awk -v dir=${projectDir} -v groupname=${groupname} -v debugdir=${debugdir} -v juicedir=${juiceDir} -v site=${site} -v genomeID=${genomeID} -v genomePath=${genomePath} -v justexact=0 -f ${juiceDir}/scripts/split_rmdups.awk ${outputdir}/merged_sort.txt > cmds.sh
# change permissions
chmod +x cmds.sh
# run
./cmds.sh
If your goal is using the HiC data for geneome assembly, you can stop here and use merged_nodups.txt
file with 3d-DNA
(scaffolding). However, you may want to check your data for few key metrics to ensure you have HiC data suffcient for scaffolding.
The page scaffolding_3dDNA.md details how to run 3D-DNA for scaffolding a draft (contig) assembly.
cat ${headfile} > ${outputdir}/inter.txt
cat ${splitdir}/*.res.txt |\
awk -f ${juiceDir}/scripts/stats_sub.awk >> ${outputdir}/inter.txt
${juiceDir}/scripts/juicer_tools LibraryComplexity ${outputdir} inter.txt >> ${outputdir}/inter.txt
cp ${outputdir}/inter.txt ${outputdir}/inter_30.txt
perl ${juiceDir}/scripts/statistics.pl \
-s ${site_file} \
-l ${ligation} \
-o ${outputdir}/inter.txt \
-q 1 ${outputdir}/merged_nodups.txt
perl ${juiceDir}/scripts/statistics.pl \
-s ${site_file} \
-l ${ligation} \
-o ${outputdir}/inter_30.txt \
-q 30 ${outputdir}/merged_nodups.txt
${juiceDir}/scripts/juicer_tools pre \
-f ${site_file} \
-s ${outputdir}/inter.txt \
-g ${outputdir}/inter_hists.m \
-q 1 ${outputdir}/merged_nodups.txt ${outputdir}/inter.hic ${genomePath}
${juiceDir}/scripts/juicer_tools pre \
-f ${site_file} \
-s ${outputdir}/inter_30.txt \
-g ${outputdir}/inter_30_hists.m \
-q 30 ${outputdir}/merged_nodups.txt ${outputdir}/inter_30.hic ${genomePath}
Some steps that are in Juicer scripts, if you need them to be run. For genome assemblies, we just need the merged_nodups.txt
file.
#collect collisions and dedup
cat $splitdir/*_abnorm.sam > $outputdir/abnormal.sam
cat $splitdir/*_unmapped.sam > $outputdir/unmapped.sam
awk -f ${juiceDir}/scripts/collisions.awk ${outputdir}/abnormal.sam > ${outputdir}/collisions.txt
gawk -v fname=${outputdir}/collisions.txt -f ${juiceDir}/scripts/collisions_dedup_rearrange_cols.awk ${outputdir}/collisions.txt |\
sort -k3,3n -k4,4n -k10,10n -k11,11n -k17,17n -k18,18n -k24,24n -k25,25n -k31,31n -k32,32n |\
awk -v name=${outputdir} -f ${juiceDir}/scripts/collisions_dups.awk
# annotating loops
${juiceDir}/scripts/juicer_hiccups.sh \
-j ${juiceDir}/scripts/juicer_tools \
-i $outputdir/inter_30.hic \
-m ${juiceDir}/references/motif \
-g $genomeID
# annotate contact domains
${juiceDir}/scripts/juicer_arrowhead.sh \
-j ${juiceDir}/scripts/juicer_tools \
-i ${outputdir}/inter_30.hic