-
Notifications
You must be signed in to change notification settings - Fork 9
V. Developer instructions
PEWO mixes : a) python packages b) snakemake rules c) some accessory scripts in other languages
All these components are to placed in specific directories. Directories important to the addition of a new placement software are hilighted by * .
Current PEWO source tree:
./PEWO_workflow
|
|_ demos pre-configured demonstration workflows, used in tutorials
|
|_ envs conda environment definitions
|
|_ rules snakemake rules
| |_alignment rules calling alignment software
| |_op rules of other operations (pruning, stats, plots...)
| |_placement * rules calling placement software
| |_utils * rules input/output building functions
|
|_ pewo * pure python sub-packages developed for PEWO
| |_ alignment alignment functions
| |_ io input files functions
| |_ likelihood likelihood procedure functions
|
|_ scripts non-python scripts used in PEWO
| |_ R R scripts, (results plots)
| |_ java java sources, (tree manipulation)
|
|_ dataset benchmark datasets (aggregated from placement papers)
-
New PEWO functionalities are to be coded in python 3.
-
snakemake rules will be placed in
rules/
and python 3 packages inpewo/
. -
The
script/
diretory is intended to rapidly import existing piece of code that were initially written in different languages. For instance, tree pruning functions were initially coded in java (repository PEWO_java and are currently compiled and installed inscripts/java/
. But with time, these functions will ultimately be rewritten in pure python and moved to apewo/
. -
Adding new dependencies in PEWO is done via conda and pip package managers. These repositories are becoming a standard in many application where reproducibility is key. Any useful bioinformatic tool will rapidly end-up there.
-
Rules based on binaries or sources requiring Python 2 can be executed in PEWO. To do so, do not write any python 2 code, but used python 2 based environments in the corresponding snakemake rules (see snakemake integrated package management. The corresponding environment definitions will be stored in
envs/
. -
PEWO is not intended to be a dataset repository, the demos/ and datasets/ directories are these mostly for demonstration purposes and sharing some of the datasets used in our manuscripts. However, if you build new workflows targeting new phylogenetic placement applications of even new demos for those applications, you are welcome to add new datasets (PEWO contains mostly examples related to taxonomic markers and short viral genomes).
Early development stages
Before reaching the step of a published software, you are likely to use harcoded paths to your binary. Before launching PEWO workflow, load the PEWO environment, then add the path to your binary in PATH. For instance, APPLES main script can be targeted with:
conda activate PEWO
export PATH=$PATH:/path/to/run_apples.py
Via software repositories
Some of the purpose of PEWO are to facilitate benchmarking with different genetic markers or test the latest phylogenetic placement solutions. Using PEWO and software repository combination will remove the hurdle of setup different software prior to the analyses themselves.
PEWO dependencies are controlled via conda and pip, with several advantages:
- once set via a conda environment, future users will not need to care about the installation of the phylogenetic placement software, it be automated at PEWO installation.
- by default, always the most up-to-date version will be installed in PEWO.
- at the opposite, you can manually set a specific phylogenetic placement software version.
Conda example
If your software is available on a conda channel, just add the corresponding channel and the package name to the PEWO environment definition, which is located in env/environment.yaml
. For instance, adding RAPPAS required to add the following lines (APPLES is currently not available on any conda channel):
name: PEWO
channels:
- bioconda
- [...]
dependencies:
- rappas
- [...]
pip example
APPLES can be installed as a pip package. Conda allows pip environment encapsulation, so the package will be indirectly installed in the conda environment. For instance, adding APPLES requires:
name: PEWO
channels:
- [...]
dependencies:
- [...]
- pip :
- apples
- [...]
Software name, software parameters names and directory names are controlled throughout PEWO sources. Before building a snakemake rule for your placement software, you need to determine several names that:
- follow the naming convention described in the table below
- are not already reserved for other placement software already supported in PEWO
After addition to PEWO, these names will be reserved, used throughout the workflows and associated to your software.
category | rules | regexp | allowed | forbidden | already reserved in PEWO |
---|---|---|---|---|---|
software | lowercase, not numerical, no special characters | [a-z]+ | mysoft | my_soft, Mysoft, mysoft1, my-soft, ... | epa, epang, pplacer, rappas, apples, appspam |
software parameter | lowercase,not numerical,no special characters | [a-z]+ | size, wx, l, p, set, ... | Size, w1, l-p, p-34, ... | g, h, bigg, ms, sb, mp, k, o, red, ar, m, c, mode, w, pattern |
Note that only software parameters having an impact on placement results deserve to be reserved. Parameters related for instance to verbosity, output/input formats, threads... have no impact and are parameters combinations that should not be tested in PEWO workflows.
Some examples:
- EPA: An alignment-based method. While the alignment method will impact the placement, let's focus on the placement itself.
The -g command-line matters define the proportion of branch that will be optimized, has an impact on placement results. We reserved the names "epa" and the parameter "g".
- EPA-ng: An alignment-based method. While the alignment method has an impact on placement, let's focus on the placement itself.
It offers 4 modes, 3 of them being heuristics to accelerate placement. We reserved the name "epang" and the parameter "h" (heuristic). Then:
- heuristic 1 is a re-implementation of EPA approach, controlled by command line option -g . As this parameter is exactly the same as in EPA, we do not need to reserve a novel parameter name. "g" is already reserved.
- heuristic 2 is a novel, faster alternative, search depth being controlled by option -G . We reserve parameter name "bigg" (read "big G").
- heuristic 3 is a re-implementation of pplacer approach, however its parameters cannot be tuned via command-line. No parameter name is reserved.
- heuristic 4 is, in fact, "no heuristics" (heuristics deactivated. No parameter name to reserve.
With this approach, we can test the impact of -g and -G parameters values (in heuristics 1 and 2 respectively), but also to produces statistics comparing EPA-ng to itself, in its different heuristics.
- RAPPAS: An alignment-free, phylo-kmer base method, working in two phases: database (DB) construction and placement itself.
We reserve the name "rappas", but then several steps of its algorithm have an impact on its placements.
- reference alignment columns filtering (has impacts because this is a k-mer based approach): controlled by option --ratio-reduction. We reserved the parameter "red".
- which software is used for ancestral reconstruction: controlled by option -b. But to be more explicit we associated and reserved the parameter name "ar". Note that in theory Phyml, RAxML, PaML should produce similar ancestral reconstruction as they are supposed to exploit the same algorithms, in theory. Still, some minor differences in implementation and heuristics may impact the results.
- parameters related to phylo-kmer database construction: options -k and -o . We reserved the parameter names "k" and "o".
At the end, we reserved the following set of "software name" | "parameter names" :
- epa | g
- epang | h, g, bigg
- rappas | red, ar, k, o
Adding APPLES
APPLES provide two parameters which are important for placements:
- weighted least squares methods, option -m , with possible values being "OLS", "FM" &"BE".
- placement criterion, option -c , with possible values being "MLSE", "ME" & "HYBRID".
Consequently, we decide to reserve the following names:
- apples | m, c
PEWO loads workflow configuration from a single file config.yaml
located at the root of the PEWO package. Following the model of the file already available in PEWO, add a section for your software. This section should briefly describe the software, which parameters impact placement results and which values should be associated to these parameters.
Some rules:
- The
config.yaml
file is intended to be readable by anyone, not only bioinformaticians or computer scientists ! Do not forget to cite software ! - Parameter names in config.yaml do not necessarily match the parameters names reserved previously (they often don't). By all means, they should human-readable and intuitive. The association between those "human-readable" version and "reserved parameter names" defined in previous section will be set in next section.
- Parameters values have to match the [A-Z0-9.]+ pattern, meaning at least one only upper case letter, numericals and only . allowed as special character. In practice, this means that values are either a string, an int or a float, as you would write them in a command-line.
** Example **
Short version:
config_epa:
#float in ]0,1]
G: [0.01]
[...]
config_apples:
#methods available via option -m : [OLS,FM,BE]
methods: [OLS,BE]
#criteria available via option -c : [MLSE,FM,BE]
criteria: [MLSE,ME,HYBRID]
Final, explicit version:
config_epa:
#EPA is alignment-based and uses a ML evaluation of the placement.
#it uses a 2-step heuristic:
# 1) rapid ML evaluation after insertion in the midpoint of each branch
# 2) full optimization for top scoring branch selected at step 1.
#(Berger et al, 2011 ; doi: 10.1093/sysbio/syr010)
#proportion of top scoring branch for which full optimization is computed
#float in ]0,1]
G: [0.01]
[...]
config_apples:
#apples placements are based on distance computations between the query and the reference tree
#it allows different "methods" to compute these distance and different "criteria" to selection the best placement.
#(Balaban et al, 2019 ; doi: 10.1093/sysbio/syz063)
#List of weighted least squares method to test.
#Possible values are:
# OLS: k=0 ordinary least square (Cavalli-Sforza and Edwards 1967)
# FM : k=2 (Fitch and Margoliash, 1967)
# BE : k=1 (Beyer et al., 1974)
#methods: ["OLS","FM","BE"]
#!warning, be sure to set methods VALUES as UPPER CASE
methods: [OLS,BE]
#List of placement criterion to test.
#Possible values are:
# MLSE: Least Squares Phylogenetic Placement
# ME : Minimum Evolution
# HYBRID : MLSE then ME
#criteria: ["MLSE","ME","HYBRID"]
#!warning, be sure to set criteria VALUES as UPPER CASE
criteria: [MLSE,ME,HYBRID]
Here we explicitly write that "apples" placements will depend on a "methods" and "criteria", and they accepts a precise set of values. Future user can get an on which parameters can be tested just by reading the config.yaml
file.
Now names are reserved for your placement software and the config file can be used to set which values can be tested. It is now time to register your software in PEWO templates and build a Snakemake rule automatizing the launch of APPLES in the different PEWO workflows.
pewo/templates.py
provides general functions to generate output file names produced by these workflows. To incorporate your software in PEWO, update the following functions.
Add your software to pewo/software.py
(respect the UPPERCASE = "lowercase" approach):
class PlacementSoftware(Enum):
EPA = "epa"
EPANG = "epang"
[...]
APPLES = "apples"
You need to update a few functions in this file.
First, we need to describe the name convention for new files produced by software. Each run of software normally takes a query and a set of parameters as input, and produces an output file. In PEWO, we use the following name convention for output files:
SOFTWARE/CommonArg1/CommonArg2{value}/.../CommonArg1_CommonArg2{value}_..._SpecificArg1{value}_SpecificArg2{value}_SOFTWARE.EXTENSION
get_output_template_args
describes software arguments used to form a run of software, and how to retrieve them from the human-readable config representation. By default, every run takes common arguments (referred to as CommonArgX in the example above, also see get_common_template_args
) and software-specific arguments (referred to as SpecificArgX in the example). Add specific arguments related to your software, like in the example below:
def get_output_template_args(config: Dict, software: PlacementSoftware, **kwargs) -> Dict[str, Any]:
if software == PlacementSoftware.EPA:
# ...
elif software == PlacementSoftware.APPLES:
template_args["meth"] = config["config_apples"]["methods"]
template_args["crit"] = config["config_apples"]["criteria"]
# Add other arguments here if needed. Note that you can create a new name convention
# for the arguments here, or just use the same one used in the config:
#
# template_args.update(config["config_apples"])
After, you need to update get_queryname_template
which generate a name for every unique combination of software parameters:
def get_queryname_template(config: Dict, software: PlacementSoftware, **kwargs) -> str:
# ...
if software == PlacementSoftware.EPA:
# ...
elif software == PlacementSoftware.APPLES:
return get_common_queryname_template(config) + "_meth{meth}_crit{crit}"
In this example, meth
and crit
are the keys of dictionary described in get_output_template_args
.
Also we need to describe output directory name, updating get_experiment_dir_template
:
def get_experiment_dir_template(config: Dict, software: PlacementSoftware, **kwargs) -> str:
# ...
if software == PlacementSoftware.EPA:
# ...
elif software == PlacementSoftware.APPLES:
# corresponds to 'workdir/APPLES/{pruning}/meth{meth}_crit{crit}/'
return os.path.join(software_dir, input_set_dir_template, "meth{meth}_crit{crit}")
After updating this, you will be able to use get_output_template
to generate output file names for you rule (see the next section).
Most of your work is there: building a functional snakemake rule.
A good approach is to start from one one of the existing rules found in rules/placement
. Just copy-paste a rule from an already supported software and adapt it to yours.
Create the placement rule itself. The way you design the rule will mostly depend on the algorithm behind your placement software. The following inputs and outputs are managed at PEWO level (see @results_structure):
- input alignment
- input tree
- input queries
- input alignment+queries (via hmmalign)
- output jplace
You may need more than one rule, in particular if there is intermediary steps, like, in the case of RAPPAS, launching an external software of ancestral reconstruction and building a database.
In the case of APPLES, we have however a fast case as in a single command-line, we can set a) as input the alignment containing both reference alignment + queries aligned to it, and b) the output jplace result.
The rule will follow the following template:
from pewo.templates import get_output_template, get_log_template
rule placement_apples:
input:
# a function or list of files
hmm = path_to_hmm_alignment,
t = path_to_reference_tree
output:
# generate .jplace output file name based on the input parameters
jplace = get_output_template(config, PlacementSoftware.APPLES, "jplace")
log:
# generate a log file name based on the input parameters
get_log_template(config, PlacementSoftware.APPLES)
version: "1.0"
params:
# any additional parameters if needed
# param1 = ...
# param2 = ...
shell:
# the command-line itself, generalised using previous fields
"""
run_apples.py -s {input.r} -q {input.q} -t {input.t} -T 1 -m {wildcards.meth} -c {wildcards.crit} -o {output.jplace} &> {log}
"""
You may want to also modify a few R lines of code so that your software gets included in the plots generated at the end of each PEWO procedure.
Currently, there is one R script per PEWO procedure:
-
scripts/R/eval_accuracy_plots.R
: generates PAC procedure plots -
scripts/R/eval_likelihood_plots.R
: generates LAC procedure plots -
scripts/R/eval_accuracy_plots.R
: generates RES procedure plots
For PAC and LAC, just add two lines in the header of the R script :
pplacer<-c("ms","sb","mp")
rappas<-c("k","o","red","ar")
[...]
apples<-c("meth","crit") # <-- added line
soft_params<-list(
epa=epa,
pplacer=pplacer,
rappas=rappas,
[...]
apples=apples # <-- added line
)
For RES (resource evaluation), you need to consider a supplementary element: which analysis steps are required to actually produce some placements. While each step (alignment, database construction, placement itself) is measured and plotted independantly this script also generated plots combinig all steps.
pplacer<-c("ms","sb","mp")
rappas<-c("k","o","red","ar")
[...]
apples<-c("meth","crit") # <-- added line
soft_params<-list(
"hmm-align"=hmmbuild,
"epa-placement"=epa,
"pplacer-placement"=pplacer,
"rappas-dbbuild"=rappasdbbuild,
"rappas-placement"=rappasplacement,
[...]
"apples-placement"=apples # <-- added line
)
[...]
analyses["pplacer"]<-c("hmmer-align", "pplacer-placement")
analyses["rappas"]<-c("ansrec", "rappas-dbbuild","rappas-placement")
analyses["apples"]<-c("hmmer-align", "apples-placement") # <-- added line
Once you are satisfied with your changes, you may want to push them to the main repository of PEWO so that all the community can start to benchmatk your tool. To do so:
-
add your tool in the Circle CI (continuous integration) configuration. To do so, just add your tool in the config files
travis/tests/1_travis_accuracy_test/config.yaml
andtravis/tests/2_travis_likelihood_test/config.yaml
. -
These config files are identical to normal PEWO config files, just copy past what you already wrote previously (see "3) Configuration file" ).
test_soft: [rappas, epa, epang, pplacer, apples] #<--added ', apples' here
[...]
# added 'config_apples: [...]' block
config_apples:
methods: [OLS]
criteria: [MLSE]
-
Please restrain the analysis to a single parameter combination. The circle CI jobs have a time limit and will stop after 1 hour. In this example we restricted the test to a single method and criteria.
-
create a pull request to the parent PEWO repository, this will automatically launch minimal tests to verify that your changes did not introduce critical bugs in PEWO.
Building a new workflow is more complex task. This requires some understanding of SnakeMake as well as the python library backing PEWO pipelines.
If you are motivated to build new procedures, fell free to try.
Also, you are welcome to contact the authors for some support or for new cooperation.