The DualRNASeq Analysis Pipeline is a robust bioinformatics workflow designed for the analysis of host-pathogen dual RNA sequencing (DualRNASeq) data. This pipeline processes paired-end RNA-Seq data to explore transcriptional responses in host and pathogen during infection. The workflow covers essential steps including preprocessing, alignment, contamination screening, quality control, and gene quantification. This pipeline integrates host and pathogen RNA-Seq data in a single workflow, enabling simultaneous analysis of dual transcriptional responses. Unlike traditional RNA-Seq pipelines, it includes host-pathogen interaction-specific steps such as metagenomic classification (Kraken2) and pangenomic expression analysis for bacterial strains.
With a modular structure, the pipeline ensures flexibility and scalability, allowing users to adapt it to their specific projects. Each module is implemented as a separate snakemake rules for ease of maintenance and debugging, and configurations are centralized in a single config.yaml
file.
This pipeline has been effectively utilized in a serotype 3 Streptococcus pneumoniae (SPN) project aimed at unraveling the complexities of host-pathogen interactions during infection. By analyzing dual RNA-Seq data from mice infected with clade I and clade II strains of serotype 3 SPN, the study identified distinct transcriptional responses in both the host and pathogen. The investigation revealed critical insights into the expression of accessory genes, patterns of transcriptional dysregulation, and clade-specific virulence factors, providing a deeper and more nuanced understanding of SPN pathogenicity.
- Flexible Input Handling: Automatically detects paired-end FASTQ files in the data directory.
- Data Preprocessing: Includes trimming of low-quality reads and adapters using
fastp
. - Alignment: Supports alignment to both host reference genome and pathogen pangenomes using
STAR & Bowtie2
. - Contamination Screening: Screens unmapped reads from host for identifying any bacterial reads using
Kraken2
&blast
. - Quality Control: Performs quality checks using
FastQC
andQualimap
, with summary reports generated byMultiQC
. - Gene Quantification: Quantifies gene expression levels using
htseq-count
for both host and pathogen. - Benchmarking: Tracks runtime and resource usage for each pipeline rule, aiding in optimization.
- Comprehensive Reporting: Combines all QC and analysis outputs into a single MultiQC report.
Snakemake
Conda
(for environment management)
git clone https://github.com/J22160/DualRNASeq_Pipeline.git
Execute the setup script to install Snakemake
and its dependencies, and to download and configure the Kraken2 database:
chmod +x setup.sh
bash setup.sh
The script performs the following:
- Configures a Conda environment for Snakemake.
- Installs Snakemake and its dependencies into the environment.
To ensure accurate bacterial pathogen screening, download the Kraken2 database using the following commands:
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20240605.tar.gz
tar xzf k2_standard_08gb_20240605.tar.gz
rm k2_standard_08gb_20240605.tar.gz # Remove the compressed file after extraction
- Source: This database is provided by the Kraken2 team and is hosted on Amazon S3.
- Purpose: The Kraken2 database enables highly efficient and accurate taxonomic classification of metagenomic reads, aiding in bacterial pathogen screening. The database contains standard references optimized for microbiological studies and integrates seamlessly with the pipeline.
After extracting the database, update the kraken2
directory path in configs/config.yaml
to point to the location where the database is saved. For example:
kraken2_db: /path/to/extracted/kraken2/database
Edit the configs/config.yaml
file to specify:
- Paths to input data, reference files, and specific parameters for sequence alignment and gene quantification.
The DualRNASeq pipeline can be executed on a variety of platforms, including local machines, high-performance computing (HPC) environments, and cloud-based platforms. Below are detailed instructions for running the pipeline in different scenarios, as well as the steps for performing a dry run to validate the workflow.
Before executing the pipeline, it is recommended to perform a dry run to verify the workflow without executing any tasks. This ensures that all dependencies, file paths, and configurations are correct.
Run the following command:
snakemake --use-conda --cores <number_of_cores> -n
Why is a dry run useful?
- Validates the workflow configuration and rules.
- Detects missing input files or incorrect paths.
- Prevents unnecessary execution errors, saving time and resources.
To execute the pipeline on a local machine, use the following command:
snakemake --use-conda --cores <number_of_cores>
Replace <number_of_cores>
with the number of CPU cores you wish to allocate.
or
snakemake --use-conda --cores <number_of_cores> --sdm conda apptainer
For HPC environments, submit the pipeline as a job using a workload manager like Slurm. Here is an example Slurm submission script:
#!/bin/bash
#SBATCH --job-name=dualrnaseq
#SBATCH --output=logs/dualrnaseq_%j.log
#SBATCH --error=logs/dualrnaseq_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --time=24:00:00
#SBATCH --partition=compute
module load conda
conda activate dualrnaseq_env
salloc -c 16 snakemake --use-conda --cores $SLURM_NTASKS
Submit the script using:
sbatch <script_name>.sh
On a cloud-based platform, such as AWS or Google Cloud, ensure that a virtual machine or instance is configured with the required resources (e.g., CPU, memory, disk space). Install Conda and the pipeline dependencies, then execute:
snakemake --use-conda --cores <number_of_cores>
Alternatively, use a managed HPC service like AWS Batch or Google Cloud Batch with containerized workflows.
-
Resume from an Interrupted Run:
Use the--rerun-incomplete
flag to resume from where the pipeline stopped:snakemake --use-conda --cores <number_of_cores> --rerun-incomplete
-
Cluster Execution:
Use the--cluster
flag to submit jobs dynamically to a cluster:snakemake --use-conda --cores <number_of_cores> --cluster "sbatch --partition=compute --time=24:00:00"
-
Generate a DAG Visualization:
Create a Directed Acyclic Graph (DAG) of the pipeline for visualization:snakemake --dag | dot -Tpng > dag.png
These flexible options ensure compatibility across different computing environments, allowing seamless execution of the pipeline in local, HPC, or cloud-based setups.
Each rule in the pipeline incorporates Snakemake’s benchmark
feature, allowing runtime and resource usage to be tracked in .txt
files stored in the output/benchmarks/
directory. These files provide critical information for performance tuning and optimization.
Memory usage can vary depending on the size of your input data. To optimize performance:
- Edit the
config/config.yaml
file and adjust themem_mb
parameter for specific rules. - Use benchmarking results to identify rules requiring additional memory or compute resources.
project_root/
├── config/ # Configuration files
│ └── config.yaml # Centralized configuration file
├── data/ # Input data (raw FASTQ files)
├── references/ # Reference genome and annotation files
├── rules/ # Individual Snakefiles for pipeline steps
│ ├── get_data.smk
│ ├── preprocessing.smk
│ ├── alignment.smk
│ ├── alignment_qc.smk
│ ├── kraken_screening.smk
│ ├── gene_quantification.smk
│ └── multiqc.smk
├── environments/ # Conda environments for reproducibility
│ ├── fastqc.yaml
│ ├── fastp.yaml
│ ├── star.yaml
│ ├── bowtie2.yaml
│ ├── multiqc.yaml
│ ├── kraken2.yaml
│ ├── htseq-count.yaml
│ └── qualimap.yaml
|
└── Snakefile # Main pipeline file
For questions or support, please contact:
Jash Trivedi
- Email: jashtrivedi221@gmail.com
- GitHub: J22160
- LinkedIn: Jash Trivedi
This pipeline incorporates multiple open-source tools and is made possible by the contributions of the bioinformatics community. Special thanks to the developers of Snakemake and the individual tools used in this workflow.