16S rRNA is a gene that encodes the RNA component of the small subunit(30S subunit) of ribosomes in bacteria and archaea. 16S is a sedimentation coefficient(Dependency Map of Proteins in the Small Ribosomal Subunit). This is an essential gene required for initiating protein synthesis and stabilizing correct codon-anticodon pairing in the A site of the ribosome during mRNA translation(The distribution, diversity, and importance of 16S rRNA gene introns in the order Thermoproteales). It is called "the molecular fossil" of bacteria because of being highly conserved and specific. This makes it the most widely used gene marker for genus and species identification, as well as taxonomic significance(16S/18S/ITS Amplicon Sequencing). The gene is about 1500bp and is composed of both conserved regions and variable regions. The conserved region is shared while the variable regions have differences among different bacteria and therefore providing information on the specificity of the genus and the species(16S rRNA, One of the Most Important rRNAs).
The gene is used in microbiome analysis. Analysis pipelines have been developed and improved over the years and include QIIME2 and DADA2 pipelines. Our internship project required us to review some of the existing pipelines and come up with a conclusion on the best ones. Apart from that, we were also to extend any workflows by filling any gaps found in the functionality of a good pipeline.
- To review existing microbiome workflows, identify great ones and extend the workflows where there are gaps, especially to make them useful in insect and pathogen data.
- Testing workflows
- We tested the following pipelines according to the functionalities indicated in their documentations:
- We used different datasets in running the pieplines.
- Identifying gaps
- While testing the pipelines, we identified gaps using the following criteria:
- How easy are they to set up and use? Do they provide accessible documentation and tutorials?
- Are they fast and easily scalable based on available computer resources?
- Can they scale to the cloud?
- Can they be used on a variety of data, including insects and pathogen microbiome?
- Are they implemented in the latest specifications and versions of the tools? For example, whether the pipeline implements Nextflow DSL2 syntax and docker or singularity containers.
- Are they well and regularly maintained? When were they updated last?
- We were also able to find gaps from the errors we came across.
- Extending workflows
- We worked around the errors we got and extensively tested the final extended workflows using different sets of data.
WEEK | ACTIVITY |
---|---|
WEEK 1 | Obtaining test datasets |
Assessing workflow performance | |
WEEK 2 | Running and troubleshooting MBBU/16S-Accreditation, H3ABioNet-SOPs, H3ABioNet-TADA, H3ABioNet-16S, nf-core/ampliseq |
WEEK 3 | Testing the nf-core/ ampliseq using stingless-bee data |
Running Yosef's DADA2 pipeline and MBBU/16S-Accreditation DADA2 pipeline | |
WEEK 4 | Obtaining more test data |
Testing the nf-core pipeline using different datasets(ITS data and 18S data) | |
Solving week 3 errors and running Yosef's DADA2 pipeline, MBBU/16S-Accreditation DADA2 pipeline | |
WEEK 5 | Solving week 4 errors and running Yosef's DADA2 pipeline, MBBU/16S-Accreditation DADA2 pipeline |
Testing the MBBU/16S-Accreditation DADA2 pipeline using different datasets(stingless bee, dog, and nf-core/ampliseq data) | |
Testing the nf-core pipeline using different datasets(ITS, PacBio, 18S, stingless bee microbiome,and IonTorrent data) | |
Creating a test dataset, a test config and including flags for MBBU/16S-Accreditation QIIME2 pipeline | |
Making a new documentation for MBBU/16S-Accreditation | |
WEEK 6 | Running Yosef's pipeline |
Viewing nf-core/ampliseq functional analysis results using STAMP | |
WEEK 7 | Report writing |
- Some of the points in the criteria will be shown in reference to this range:
- 1 - Very good
- 2 - Good
- 3 - Fairly good
- 4 - Bad
- 5 - Very bad
- It is important to note that the MBBU-16S_Accreditation-QIIME2 and MBBU-16S_Accreditation-DADA2 are in the same github repository.
Criteria | nf-core | H3ABionet-SOPs | H3aBionet-TADA | mbbu-16S_Accreditation-QIIME2 | mbbu-16S_Accreditation-DADA2 |
---|---|---|---|---|---|
Sequences | 16S, 18S, ITS | 16S | 16S, ITS | 16S | 16S |
Tools and Databases | Tools,Databases | Tools, Databases | Not well defined | Tools, Databases | Tools, Databases |
QIIME2 | Yes | Yes | No | Yes | No |
DADA2 | No | No | Yes | No | Yes |
Practice Dataset and Metadata | Available | Available | N/A | N/A | N/A |
Versions | [9 Releases] (https://github.com/nf-core/ampliseq/releases) | N/A (It is an SOP) | 1 Release | 1 Release | 1 Release |
Command Arguments | Available | N/A (It is an SOP) | Available | N/A | N/A |
Results | Available | N/A (It is an SOP) | N/A | Available | N/A (While running this R-script, some results are saved: script |
Documentation | 1 | For the guidelines provided (2), Navigating to the SOP(4) | 3 | 4 | 4 |
Contributions and Support | Guidelines and Slack, 17 contributors, 75 stars, 50 forks | N/A | 7 contributors, 10 stars, 11 forks | 9 contributors, 1 star, 2 forks | 9 contributors, 1 star, 2 forks |
Running on cloud | Yes Results from AWS Cloud | N/A (It is an SOP) | Yes AWS configs | N/A | N/A |
Languages | Codes | N/A (It is an SOP) | Codes | Codes | Codes |
Issues | 366 | N/A (It is an SOP) | 32 | 10 | 10 |
Last updated | October 2021 | February 2019 | September 2021 | April 2021 | April 2021 |
- There are questions on operation, run-time and output analysis that one could consider having as criteria in reviewing workflows.
- It would be a good SOP to guide one in to creating a QIIME2 pipeline from scratch.
- This pipeline is a Targeted Amplicon Diversity Analysis(TADA) using DADA2, implemented in Nextflow.
- The typical command for running the pipeline is:
nextflow run uct-cbio/16S-rDNA-dada2-pipeline --reads '*_R{1,2}.fastq.gz' --trimFor 24 --trimRev 25 --reference 'gg_13_8_train_set_97.fa.gz' -profile uct_hex
- It outputs results mostly in RDS format.
- It supports paired-end Illumina or single-end Illumina, PacBio and IonTorrent data.
- Analysis of 16S rRNA gene amplicons sequenced paired-end with Illumina is the default analysis.
- It runs with Conda, Docker, Podman, Shifter, Charliecloud or Singularity.
- Command for running the pipeline:
nextflow run nf-core/ampliseq -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> --input "path/to/data" --FW_primer "forward-primer-sequence" --RV_primer "reverse-primer-sequence" --metadata "Path/to/metadata/file""
- We had difficulties running this and due to lack of an easy set-up, we did away with it.
This workflow comparison is only for the pipelines that we tested and found to be running without difficulties or with minimal difficulties.
Criteria | nf-core/ampliseq | TADA | MBBU Accreditation Qiime | MBBU Accreditation dada2 | Yosef's pipeline |
---|---|---|---|---|---|
Runtime | 21 minutes | 46 minutes | # | # | # |
Setup | Easy, 1 command | Easy, 1 command | Hard, edit configs | Hard | Easy |
Documentation | Well documented | Well documented | Lacks setup instructions | Not well documented | Not yet documented |
Gaps | None | Test data, visualization | Test config, functional analysis | Not automated | Not available on GitHub and Not automated |
We had many holidays and a long December break that in turn lessened the time to work on our objectives.
There was no internet for a period of time after the December holiday and that affected the progress of our work.
Some processes would not run because of having less space in the server.
The reads kept on being filtered out and seeing that the internship period was coming quickly to an end, we could not solve this. Also, the objective was majorly looking for a good pipeline for running 16S data.
We would recommend doing the following:
- Nextflow automation of Yosef's pipeline
- Adding picrust to Yosef's pipeline for functional analysis