Project Report

Introduction

16S rRNA is a gene that encodes the RNA component of the small subunit(30S subunit) of ribosomes in bacteria and archaea. 16S is a sedimentation coefficient(Dependency Map of Proteins in the Small Ribosomal Subunit). This is an essential gene required for initiating protein synthesis and stabilizing correct codon-anticodon pairing in the A site of the ribosome during mRNA translation(The distribution, diversity, and importance of 16S rRNA gene introns in the order Thermoproteales). It is called "the molecular fossil" of bacteria because of being highly conserved and specific. This makes it the most widely used gene marker for genus and species identification, as well as taxonomic significance(16S/18S/ITS Amplicon Sequencing). The gene is about 1500bp and is composed of both conserved regions and variable regions. The conserved region is shared while the variable regions have differences among different bacteria and therefore providing information on the specificity of the genus and the species(16S rRNA, One of the Most Important rRNAs).

The gene is used in microbiome analysis. Analysis pipelines have been developed and improved over the years and include QIIME2 and DADA2 pipelines. Our internship project required us to review some of the existing pipelines and come up with a conclusion on the best ones. Apart from that, we were also to extend any workflows by filling any gaps found in the functionality of a good pipeline.

Objectives

To review existing microbiome workflows, identify great ones and extend the workflows where there are gaps, especially to make them useful in insect and pathogen data.

Methods

Testing workflows

We tested the following pipelines according to the functionalities indicated in their documentations:
We used different datasets in running the pieplines.

Identifying gaps

While testing the pipelines, we identified gaps using the following criteria:
- How easy are they to set up and use? Do they provide accessible documentation and tutorials?
- Are they fast and easily scalable based on available computer resources?
- Can they scale to the cloud?
- Can they be used on a variety of data, including insects and pathogen microbiome?
- Are they implemented in the latest specifications and versions of the tools? For example, whether the pipeline implements Nextflow DSL2 syntax and docker or singularity containers.
- Are they well and regularly maintained? When were they updated last?
We were also able to find gaps from the errors we came across.

Extending workflows

We worked around the errors we got and extensively tested the final extended workflows using different sets of data.

Results

The new MBBU/16S-Accreditation DADA2 pipeline
Functional analysis using nf-core/ampliseq

Summary of work done

WEEK	ACTIVITY
WEEK 1	Obtaining test datasets
	Assessing workflow performance
WEEK 2	Running and troubleshooting MBBU/16S-Accreditation, H3ABioNet-SOPs, H3ABioNet-TADA, H3ABioNet-16S, nf-core/ampliseq
WEEK 3	Testing the nf-core/ ampliseq using stingless-bee data
	Running Yosef's DADA2 pipeline and MBBU/16S-Accreditation DADA2 pipeline
WEEK 4	Obtaining more test data
	Testing the nf-core pipeline using different datasets(ITS data and 18S data)
	Solving week 3 errors and running Yosef's DADA2 pipeline, MBBU/16S-Accreditation DADA2 pipeline
WEEK 5	Solving week 4 errors and running Yosef's DADA2 pipeline, MBBU/16S-Accreditation DADA2 pipeline
	Testing the MBBU/16S-Accreditation DADA2 pipeline using different datasets(stingless bee, dog, and nf-core/ampliseq data)
	Testing the nf-core pipeline using different datasets(ITS, PacBio, 18S, stingless bee microbiome,and IonTorrent data)
	Creating a test dataset, a test config and including flags for MBBU/16S-Accreditation QIIME2 pipeline
	Making a new documentation for MBBU/16S-Accreditation
WEEK 6	Running Yosef's pipeline
	Viewing nf-core/ampliseq functional analysis results using STAMP
WEEK 7	Report writing

Pipelines' Functionalities

Some of the points in the criteria will be shown in reference to this range:
- 1 - Very good
- 2 - Good
- 3 - Fairly good
- 4 - Bad
- 5 - Very bad
It is important to note that the MBBU-16S_Accreditation-QIIME2 and MBBU-16S_Accreditation-DADA2 are in the same github repository.

Criteria	nf-core	H3ABionet-SOPs	H3aBionet-TADA	mbbu-16S_Accreditation-QIIME2	mbbu-16S_Accreditation-DADA2
Sequences	16S, 18S, ITS	16S	16S, ITS	16S	16S
Tools and Databases	Tools,Databases	Tools, Databases	Not well defined	Tools, Databases	Tools, Databases
QIIME2	Yes	Yes	No	Yes	No
DADA2	No	No	Yes	No	Yes
Practice Dataset and Metadata	Available	Available	N/A	N/A	N/A
Versions	[9 Releases] (https://github.com/nf-core/ampliseq/releases)	N/A (It is an SOP)	1 Release	1 Release	1 Release
Command Arguments	Available	N/A (It is an SOP)	Available	N/A	N/A
Results	Available	N/A (It is an SOP)	N/A	Available	N/A (While running this R-script, some results are saved: script
Documentation	1	For the guidelines provided (2), Navigating to the SOP(4)	3	4	4
Contributions and Support	Guidelines and Slack, 17 contributors, 75 stars, 50 forks	N/A	7 contributors, 10 stars, 11 forks	9 contributors, 1 star, 2 forks	9 contributors, 1 star, 2 forks
Running on cloud	Yes Results from AWS Cloud	N/A (It is an SOP)	Yes AWS configs	N/A	N/A
Languages	Codes	N/A (It is an SOP)	Codes	Codes	Codes
Issues	366	N/A (It is an SOP)	32	10	10
Last updated	October 2021	February 2019	September 2021	April 2021	April 2021

More information on individual pipelines

H3ABionet-SOPs/16s-rRNA-1-0.html

There are questions on operation, run-time and output analysis that one could consider having as criteria in reviewing workflows.
It would be a good SOP to guide one in to creating a QIIME2 pipeline from scratch.

h3abionet/TADA

This pipeline is a Targeted Amplicon Diversity Analysis(TADA) using DADA2, implemented in Nextflow.
The typical command for running the pipeline is:

nextflow run uct-cbio/16S-rDNA-dada2-pipeline --reads '*_R{1,2}.fastq.gz' --trimFor 24 --trimRev 25 --reference 'gg_13_8_train_set_97.fa.gz' -profile uct_hex

It outputs results mostly in RDS format.

nf-core/ampliseq pipeline

It supports paired-end Illumina or single-end Illumina, PacBio and IonTorrent data.
Analysis of 16S rRNA gene amplicons sequenced paired-end with Illumina is the default analysis.
It runs with Conda, Docker, Podman, Shifter, Charliecloud or Singularity.

Command for running the pipeline:

nextflow run nf-core/ampliseq -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> --input "path/to/data" --FW_primer "forward-primer-sequence" --RV_primer "reverse-primer-sequence" --metadata "Path/to/metadata/file""

MBBU 16S-Accreditation

DADA2 MBBU 16S-Accreditation pipeline

Steps in the pipeline:

QIIME MBBU 16S-Accreditation Pipeline

It is summarized as follows:

h3abionet16S

We had difficulties running this and due to lack of an easy set-up, we did away with it.

Workflow Comparison

This workflow comparison is only for the pipelines that we tested and found to be running without difficulties or with minimal difficulties.

Criteria	nf-core/ampliseq	TADA	MBBU Accreditation Qiime	MBBU Accreditation dada2	Yosef's pipeline
Runtime	21 minutes	46 minutes	#	#	#
Setup	Easy, 1 command	Easy, 1 command	Hard, edit configs	Hard	Easy
Documentation	Well documented	Well documented	Lacks setup instructions	Not well documented	Not yet documented
Gaps	None	Test data, visualization	Test config, functional analysis	Not automated	Not available on GitHub and Not automated

Challenges

Time

We had many holidays and a long December break that in turn lessened the time to work on our objectives.

Internet

There was no internet for a period of time after the December holiday and that affected the progress of our work.

Server space

Some processes would not run because of having less space in the server.

Running the nf-core/ampliseq pipeline using 18S data

The reads kept on being filtered out and seeing that the internship period was coming quickly to an end, we could not solve this. Also, the objective was majorly looking for a good pipeline for running 16S data.

Recommendations

We would recommend doing the following:

Nextflow automation of Yosef's pipeline
Adding picrust to Yosef's pipeline for functional analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review-Report.md

Review-Report.md

Project Report

Introduction

Objectives

Methods

Results

Summary of work done

Pipelines' Functionalities

More information on individual pipelines

H3ABionet-SOPs/16s-rRNA-1-0.html

h3abionet/TADA

nf-core/ampliseq pipeline

MBBU 16S-Accreditation

DADA2 MBBU 16S-Accreditation pipeline

QIIME MBBU 16S-Accreditation Pipeline

h3abionet16S

Workflow Comparison

Challenges

Time

Internet

Server space

Running the nf-core/ampliseq pipeline using 18S data

Recommendations

Files

Review-Report.md

Latest commit

History

Review-Report.md

File metadata and controls

Project Report

Introduction

Objectives

Methods

Results

Summary of work done

Pipelines' Functionalities

More information on individual pipelines

H3ABionet-SOPs/16s-rRNA-1-0.html

h3abionet/TADA

nf-core/ampliseq pipeline

MBBU 16S-Accreditation

DADA2 MBBU 16S-Accreditation pipeline

QIIME MBBU 16S-Accreditation Pipeline

h3abionet16S

Workflow Comparison

Challenges

Time

Internet

Server space

Running the nf-core/ampliseq pipeline using 18S data

Recommendations