Skip to content

Latest commit

 

History

History
190 lines (158 loc) · 12.6 KB

Review-Report.md

File metadata and controls

190 lines (158 loc) · 12.6 KB

Project Report

Introduction

16S rRNA is a gene that encodes the RNA component of the small subunit(30S subunit) of ribosomes in bacteria and archaea. 16S is a sedimentation coefficient(Dependency Map of Proteins in the Small Ribosomal Subunit). This is an essential gene required for initiating protein synthesis and stabilizing correct codon-anticodon pairing in the A site of the ribosome during mRNA translation(The distribution, diversity, and importance of 16S rRNA gene introns in the order Thermoproteales). It is called "the molecular fossil" of bacteria because of being highly conserved and specific. This makes it the most widely used gene marker for genus and species identification, as well as taxonomic significance(16S/18S/ITS Amplicon Sequencing). The gene is about 1500bp and is composed of both conserved regions and variable regions. The conserved region is shared while the variable regions have differences among different bacteria and therefore providing information on the specificity of the genus and the species(16S rRNA, One of the Most Important rRNAs).

The gene is used in microbiome analysis. Analysis pipelines have been developed and improved over the years and include QIIME2 and DADA2 pipelines. Our internship project required us to review some of the existing pipelines and come up with a conclusion on the best ones. Apart from that, we were also to extend any workflows by filling any gaps found in the functionality of a good pipeline.

Objectives

  • To review existing microbiome workflows, identify great ones and extend the workflows where there are gaps, especially to make them useful in insect and pathogen data.

Methods

  1. Testing workflows
  1. Identifying gaps
  • While testing the pipelines, we identified gaps using the following criteria:
    • How easy are they to set up and use? Do they provide accessible documentation and tutorials?
    • Are they fast and easily scalable based on available computer resources?
    • Can they scale to the cloud?
    • Can they be used on a variety of data, including insects and pathogen microbiome?
    • Are they implemented in the latest specifications and versions of the tools? For example, whether the pipeline implements Nextflow DSL2 syntax and docker or singularity containers.
    • Are they well and regularly maintained? When were they updated last?
  • We were also able to find gaps from the errors we came across.
  1. Extending workflows
  • We worked around the errors we got and extensively tested the final extended workflows using different sets of data.

Results

Summary of work done

WEEK ACTIVITY
WEEK 1 Obtaining test datasets
Assessing workflow performance
WEEK 2 Running and troubleshooting MBBU/16S-Accreditation, H3ABioNet-SOPs, H3ABioNet-TADA, H3ABioNet-16S, nf-core/ampliseq
WEEK 3 Testing the nf-core/ ampliseq using stingless-bee data
Running Yosef's DADA2 pipeline and MBBU/16S-Accreditation DADA2 pipeline
WEEK 4 Obtaining more test data
Testing the nf-core pipeline using different datasets(ITS data and 18S data)
Solving week 3 errors and running Yosef's DADA2 pipeline, MBBU/16S-Accreditation DADA2 pipeline
WEEK 5 Solving week 4 errors and running Yosef's DADA2 pipeline, MBBU/16S-Accreditation DADA2 pipeline
Testing the MBBU/16S-Accreditation DADA2 pipeline using different datasets(stingless bee, dog, and nf-core/ampliseq data)
Testing the nf-core pipeline using different datasets(ITS, PacBio, 18S, stingless bee microbiome,and IonTorrent data)
Creating a test dataset, a test config and including flags for MBBU/16S-Accreditation QIIME2 pipeline
Making a new documentation for MBBU/16S-Accreditation
WEEK 6 Running Yosef's pipeline
Viewing nf-core/ampliseq functional analysis results using STAMP
WEEK 7 Report writing

Pipelines' Functionalities

  • Some of the points in the criteria will be shown in reference to this range:
    • 1 - Very good
    • 2 - Good
    • 3 - Fairly good
    • 4 - Bad
    • 5 - Very bad
  • It is important to note that the MBBU-16S_Accreditation-QIIME2 and MBBU-16S_Accreditation-DADA2 are in the same github repository.
Criteria nf-core H3ABionet-SOPs H3aBionet-TADA mbbu-16S_Accreditation-QIIME2 mbbu-16S_Accreditation-DADA2
Sequences 16S, 18S, ITS 16S 16S, ITS 16S 16S
Tools and Databases Tools,Databases Tools, Databases Not well defined Tools, Databases Tools, Databases
QIIME2 Yes Yes No Yes No
DADA2 No No Yes No Yes
Practice Dataset and Metadata Available Available N/A N/A N/A
Versions [9 Releases] (https://github.com/nf-core/ampliseq/releases) N/A (It is an SOP) 1 Release 1 Release 1 Release
Command Arguments Available N/A (It is an SOP) Available N/A N/A
Results Available N/A (It is an SOP) N/A Available N/A (While running this R-script, some results are saved: script
Documentation 1 For the guidelines provided (2), Navigating to the SOP(4) 3 4 4
Contributions and Support Guidelines and Slack, 17 contributors, 75 stars, 50 forks N/A 7 contributors, 10 stars, 11 forks 9 contributors, 1 star, 2 forks 9 contributors, 1 star, 2 forks
Running on cloud Yes Results from AWS Cloud N/A (It is an SOP) Yes AWS configs N/A N/A
Languages Codes N/A (It is an SOP) Codes Codes Codes
Issues 366 N/A (It is an SOP) 32 10 10
Last updated October 2021 February 2019 September 2021 April 2021 April 2021

More information on individual pipelines

H3ABionet-SOPs/16s-rRNA-1-0.html

  • There are questions on operation, run-time and output analysis that one could consider having as criteria in reviewing workflows.
  • It would be a good SOP to guide one in to creating a QIIME2 pipeline from scratch.

h3abionet/TADA

  • This pipeline is a Targeted Amplicon Diversity Analysis(TADA) using DADA2, implemented in Nextflow.
  • The typical command for running the pipeline is:
nextflow run uct-cbio/16S-rDNA-dada2-pipeline --reads '*_R{1,2}.fastq.gz' --trimFor 24 --trimRev 25 --reference 'gg_13_8_train_set_97.fa.gz' -profile uct_hex
  • It outputs results mostly in RDS format.

nf-core/ampliseq pipeline

  • It supports paired-end Illumina or single-end Illumina, PacBio and IonTorrent data.
  • Analysis of 16S rRNA gene amplicons sequenced paired-end with Illumina is the default analysis.
  • It runs with Conda, Docker, Podman, Shifter, Charliecloud or Singularity.
  • Command for running the pipeline:
    nextflow run nf-core/ampliseq -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> --input "path/to/data" --FW_primer "forward-primer-sequence" --RV_primer "reverse-primer-sequence" --metadata "Path/to/metadata/file""
    

Image of how it runs and output expected

MBBU 16S-Accreditation

DADA2 MBBU 16S-Accreditation pipeline

  • Steps in the pipeline: DADA2 MBBU 16S-Accreditation pipeline

QIIME MBBU 16S-Accreditation Pipeline

  • It is summarized as follows: image

h3abionet16S

  • We had difficulties running this and due to lack of an easy set-up, we did away with it.

Workflow Comparison

This workflow comparison is only for the pipelines that we tested and found to be running without difficulties or with minimal difficulties.

Criteria nf-core/ampliseq TADA MBBU Accreditation Qiime MBBU Accreditation dada2 Yosef's pipeline
Runtime 21 minutes 46 minutes # # #
Setup Easy, 1 command Easy, 1 command Hard, edit configs Hard Easy
Documentation Well documented Well documented Lacks setup instructions Not well documented Not yet documented
Gaps None Test data, visualization Test config, functional analysis Not automated Not available on GitHub and Not automated

Challenges

Time

We had many holidays and a long December break that in turn lessened the time to work on our objectives.

Internet

There was no internet for a period of time after the December holiday and that affected the progress of our work.

Server space

Some processes would not run because of having less space in the server.

Running the nf-core/ampliseq pipeline using 18S data

The reads kept on being filtered out and seeing that the internship period was coming quickly to an end, we could not solve this. Also, the objective was majorly looking for a good pipeline for running 16S data.

Recommendations

We would recommend doing the following:

  • Nextflow automation of Yosef's pipeline
  • Adding picrust to Yosef's pipeline for functional analysis