V. Developer instructions

Developer instructions

PEWO package directories

PEWO mixes : a) python packages b) snakemake rules c) some accessory scripts in other languages

All these components are to placed in specific directories. Directories important to the addition of a new placement software are hilighted by * .

Current PEWO source tree:

./PEWO_workflow
  |
  |_ demos              pre-configured demonstration workflows, used in tutorials
  |
  |_ envs               conda environment definitions
  |
  |_ rules              snakemake rules
  |  |_alignment        rules calling alignment software
  |  |_op               rules of other operations (pruning, stats, plots...)
  |  |_placement     *  rules calling placement software
  |  |_utils         *  rules input/output building functions
  |
  |_ pewo            *  pure python sub-packages developed for PEWO
  |  |_ alignment       alignment functions
  |  |_ io              input files functions
  |  |_ likelihood      likelihood procedure functions
  |
  |_ scripts            non-python scripts used in PEWO
  |  |_ R               R scripts, (results plots)
  |  |_ java            java sources, (tree manipulation)
  |
  |_ dataset            benchmark datasets (aggregated from placement papers)

Contribution rules:

New PEWO functionalities are to be coded in python 3.
snakemake rules will be placed in rules/ and python 3 packages in pewo/.
The script/ diretory is intended to rapidly import existing piece of code that were initially written in different languages. For instance, tree pruning functions were initially coded in java (repository PEWO_java and are currently compiled and installed in scripts/java/. But with time, these functions will ultimately be rewritten in pure python and moved to a pewo/.
Adding new dependencies in PEWO is done via conda and pip package managers. These repositories are becoming a standard in many application where reproducibility is key. Any useful bioinformatic tool will rapidly end-up there.
Rules based on binaries or sources requiring Python 2 can be executed in PEWO. To do so, do not write any python 2 code, but used python 2 based environments in the corresponding snakemake rules (see snakemake integrated package management. The corresponding environment definitions will be stored in envs/.
PEWO is not intended to be a dataset repository, the demos/ and datasets/ directories are these mostly for demonstration purposes and sharing some of the datasets used in our manuscripts. However, if you build new workflows targeting new phylogenetic placement applications of even new demos for those applications, you are welcome to add new datasets (PEWO contains mostly examples related to taxonomic markers and short viral genomes).

Using your own placement software in PEWO

1) PEWO environment extension

** Early development stages **

Before reaching the step of a published software, you are likely to use harcoded paths to your binary. Before launching PEWO workflow, load the PEWO environment, then add the path to your binary in PATH. For instance, APPLES main script can be targeted with:

conda activate PEWO
export PATH=$PATH:/path/to/run_apples.py

** Via software repositories **

Some of the purpose of PEWO are to facilitate benchmarking with different genetic markers or test the latest phylogenetic placement solutions. Using PEWO and software repository combination will remove the hurdle of setup different software prior to the analyses themselves.

PEWO dependencies are controlled via conda and pip, with several advantages:

once set via a conda environment, future users will not need to care about the installation of the phylogenetic placement software, it be automated at PEWO installation.
by default, always the most up-to-date version will be installed in PEWO.
at the opposite, you can manually set a specific phylogenetic placement software version.

Conda example

If your software is available on a conda channel, just add the corresponding channel and the package name to the PEWO environment definition, which is located in env/environment.yaml . For instance, adding RAPPAS required to add the following lines (APPLES is currently not available on any conda channel):

name: PEWO
channels:
  - bioconda
  - [...]
dependencies:
  - rappas
  - [...]

pip example

APPLES can be installed as a pip package. Conda allows pip environment encapsulation, so the package will be indirectly installed in the conda environment. For instance, adding APPLES requires:

name: PEWO
channels:
  - [...]
dependencies:
  - [...]
  - pip :
     - apples
     - [...]

2) PEWO vocabulary extension

Software name, software parameters names and directory names are controlled throughout PEWO sources. Before building a snakemake rule for your placement software, you need to determine several names that:

follow the naming convention described in the table below
are not already reserved for other placement software already supported in PEWO

After addition to PEWO, these names will be reserved, used throughout the workflows and associated to your software.

category	rules	regexp	allowed	forbidden	already reserved in PEWO
software	lowercase, not numerical, no special characters	[a-z]+	mysoft	my_soft, Mysoft, mysoft1, my-soft, ...	epa, epang, pplacer, rappas, apples
software parameter	lowercase,not numerical,no special characters	[a-z]+	size, wx, l, p, set, ...	Size, w1, l-p, p-34, ...	g, h, bigg, ms, sb, mp, k, o, red, ar, m, c

Note that only software parameters having an impact on placement results deserves to be reserved. Parameters related for instance to verbosity, output/input formats, threads... have no impact and are parameters combinations that should not be tested in PEWO workflows.

Some examples:

EPA: An alignment-based method. While the alignment method will impact the placement, let's focus on the placement itself.

The -g command-line matters define the proportion of branch that will be optimized, has an impact on placement results. We reserved the names "epa" and the parameter "g".

EPA-ng: An alignment-based method. While the alignment method has an impact on placement, let's focus on the placement itself.

It offers 4 modes to accelerate placement via different heuristics. We reserved the name "epang" and the parameter "h" (heuristic). Then:

heuristic 1 is a re-implementation of EPA approach, controlled by command line option -g . As this parameter is exactly the same as in EPA, we do not need to reserve a novel parameter name. "g" is already reserved.
heuristic 2 is a novel, faster alternative, search depth being controlled by option -G . We reserve parameter name "bigg" (read "big G").
heuristic 3 is a re-implementation of pplacer approach, however its parameters cannot be tuned via command-line. No parameter name to reserve.
heuristic 4 is, in fact, "no heuristics" (heuristics deactivated. No parameter name to reserve.

attributing a parameter defining the heuristic will allow to test the impact of -g and -G parameters values (in heuristics 1 and 2 respectively), but also to produces statistics comparing EPA-ng to itself, in its different heuristics.

RAPPAS: An alignment-free, phylo-kmer base method, working in two phases: database (DB) construction and placement itself.

We reserve the name "rappas", but then several steps of its algorithm have an impact on its placements.

reference alignment columns filtering (has more impacts on a k-mer based approach): controlled by option --ratio-reduction. We reserved the parameter "red".
which software is used for ancestral reconstruction: controlled by option -b. But to be more explicit we associated and reserved the parameter name "ar".
parameters related to phylo-kmer database: options -k and -o . We reserved the parameter names "k" and "o".

At the end, we reserved the following set of "software name" | "parameter names" :

epa | g
epang | h, g, bigg
rappas | red, ar, k, o

** Adding APPLES **

APPLES provide two parameters which are important for placements:

weighted least squares methods, option -m , with possible values being "OLS", "FM" &"BE".
placement criterion, option -c , with possible values being "MLSE", "ME" & "HYBRID".

Consequently, we decide to reserve the following names:

apples | m, c

3) Configuration file

PEWO loads workflow configuration from a single file config.yaml located at the root of the PEWO package. Following the model of the file already available in PEWO, add a section for your software. This section should briefly describe the software, which parameters impact placement results and which values should be associated to these parameters.

Some rules:

The config.yaml file is intended to be readable by anyone, not only bioinformaticians or computer scientists ! Do not forget to cite software !
Parameter names in config.yaml do not necessarily match the parameters names reserved previously (they often don't). By all means, they should human-readable and intuitive. The association between those "human-readable" version and "reserved parameter names" defined in previous section will be set in next section.
Parameters values have to match the [A-Z0-9.]+ pattern, meaning at least one only upper case letter, numericals and only . allowed as special character. In practice, this means that values are either a string, an int or a float, as you would write them in a command-line.

** Example **

Short version:

config_epa:
  #float in ]0,1]
  G: [0.01]

[...]

config_apples:
  #methods available via option -m : [OLS,FM,BE]
  methods: [OLS,BE]
  #criteria available via option -c : [MLSE,FM,BE]
  criteria: [MLSE,ME,HYBRID]

Final, explicit version:

config_epa:

  #EPA is alignment-based and uses a ML evaluation of the placement.
  #it uses a 2-step heuristic:
  # 1) rapid ML evaluation after insertion in the midpoint of each branch
  # 2) full optimization for top scoring branch selected at step 1.
  #(Berger et al, 2011 ; doi: 10.1093/sysbio/syr010)

  #proportion of top scoring branch for which full optimization is computed
  #float in ]0,1]
  G: [0.01]

  [...]

config_apples:

  #apples placements are based on distance computations between the query and the reference tree
  #it allows different "methods" to compute these distance and different "criteria" to selection the best placement.
  #(Balaban et al, 2019 ; doi: 10.1093/sysbio/syz063)

  #List of weighted least squares method to test.
  #Possible values are:
  # OLS: k=0 ordinary least square (Cavalli-Sforza and Edwards 1967)
  # FM : k=2 (Fitch and Margoliash, 1967)
  # BE : k=1 (Beyer et al., 1974)
  #methods: ["OLS","FM","BE"]
  #!warning, be sure to set methods VALUES as UPPER CASE
  methods: [OLS,BE]

  #List of placement criterion to test.
  #Possible values are:
  # MLSE: Least Squares Phylogenetic Placement
  # ME : Minimum Evolution
  # HYBRID : MLSE then ME
  #criteria: ["MLSE","ME","HYBRID"]
  #!warning, be sure to set criteria VALUES as UPPER CASE
  criteria: [MLSE,ME,HYBRID]

Here we explicitly write that "apples" placements will depend on a "methods" and "criteria", and they accepts a precise set of values. Future user can get an on which parameters can be tested just by reading the config.yaml file.

Snakemake rule

Now that names are reserved for your placement software and that the config file can be used to set which values can be tested, it is time to build a Snakemake rule automatizing the launch of APPLES in the different PEWO workflows. pewo/templates.py provides general functions to generate output file names produced by these workflows. To incorporate your software in PEWO, you need to update some of these functions.

Register software

Add your software to pewo/software.py (respect the UPPERCASE = "lowercase" approach):

class PlacementSoftware(Enum):
    EPA = "epa"
    EPANG = "epang"
    [...]
    APPLES = "apples"

Update pewo/templates.py

You need to update a few functions in this file.

First, we need to describe the name convention for new files produced by software. Each run of software normally takes a query and a set of parameters as input, and produces an output file. In PEWO, we use the following name convention for output files:

SOFTWARE/CommonArg1/CommonArg2{value}/.../CommonArg1_CommonArg2{value}_..._SpecificArg1{value}_SpecificArg2{value}_SOFTWARE.EXTENSION

get_output_template_args describes software arguments used to form a run of software, and how to retrieve them from the human-readable config representation. By default, every run takes common arguments (referred to as CommonArgX in the example above, also see get_common_template_args) and software-specific arguments (referred to as SpecificArgX in the example). Add specific arguments related to your software, like in the example below:

def get_output_template_args(config: Dict, software: PlacementSoftware, **kwargs) -> Dict[str, Any]:
    if software == PlacementSoftware.EPA:
    # ...
    elif software == PlacementSoftware.APPLES:
        template_args["meth"] = config["config_apples"]["methods"]
        template_args["crit"] = config["config_apples"]["criteria"]

        # Add other arguments here if needed. Note that you can create a new name convention 
        # for the arguments here, or just use the same one used in the config:
        #
        # template_args.update(config["config_apples"])

After, you need to update get_queryname_template which generate a name for every unique combination of software parameters:

def get_queryname_template(config: Dict, software: PlacementSoftware, **kwargs) -> str:
    # ...
    if software == PlacementSoftware.EPA:
    # ...
    elif software == PlacementSoftware.APPLES:
        return get_common_queryname_template(config) + "_meth{meth}_crit{crit}"

In this example, meth and crit are the keys of dictionary described in get_output_template_args.

Also we need to describe output directory name, updating get_experiment_dir_template:

def get_experiment_dir_template(config: Dict, software: PlacementSoftware, **kwargs) -> str:
    # ...
    if software == PlacementSoftware.EPA:
    # ...
    elif software == PlacementSoftware.APPLES:
        # corresponds to 'workdir/APPLES/{pruning}/meth{meth}_crit{crit}/'
        return os.path.join(software_dir, input_set_dir_template, "meth{meth}_crit{crit}")

After updating this, you will be able to use get_output_template to generate output file names for you rule (see the next section).

Create a snakemake rule

Finally, create the placement rule itself. The way you design the rule will mostly depend on the algorithm behind your placement software. The following inputs and outputs are managed at PEWO level (see @results_structure):

input alignment
input tree
input queries
input alignment+queries (via hmmalign)
output jplace

You may need more than one rule, in particular if there is intermediary steps, like, in the case of RAPPAS, launching an external software of ancestral reconstruction and building a database.

In the case of APPLES, we have however a fast case as in a single command-line, we can set a) as input the alignment containing both reference alignment + queries aligned to it, and b) the output jplace result.

The rule will follow the following template:

from pewo.templates import get_output_template, get_log_template

rule placement_apples:
    input:
        # a function or list of files
        hmm = path_to_hmm_alignment,
        t = path_to_reference_tree
    output:
        # generate .jplace output file name based on the input parameters
        jplace = get_output_template(config, PlacementSoftware.APPLES, "jplace")
    log:
        # generate a log file name based on the input parameters
        get_log_template(config, PlacementSoftware.APPLES)
    version: "1.0"
    params:
        # any additional parameters if needed
        # param1 = ...
        # param2 = ...
    shell:
        # the command-line itself, generalised using previous fields
        """
        run_apples.py -s {input.r} -q {input.q} -t {input.t} -T 1 -m {wildcards.meth} -c {wildcards.crit} -o {output.jplace} &> {log}
        """

4) Optional: visualisation

Construction of new workflows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly