Commit 99f1f93f authored by Valentina Galata's avatar Valentina Galata
Browse files

updated readme

parent 46961fc8
# PathoFact 1.0
# PathoFact 1.0 (branch SpringClean)
PathoFact is an easy-to-use modular pipeline for the metagenomic analyses of toxins, virulence factors and antimicrobial resistance.
Additionally, PathoFact combines the prediction of these pathogenic factors with the identification of mobile genetic elements.
This provides further depth to the analysis by considering the localization of the genes on mobile genetic elements (MGEs), as well as on the chromosome.
Furthermore, each module (toxins, virulence factors, and antimicrobial resistance) of PathoFact is also a standalone component, making it a flexible and versatile tool.
# Requirements
# Requirements and installation
PathoFact requires a good working Python (3.6.4), snakemake (version 5.5.4) and (mini)conda installation.
If snakemake is not yet installed one could install this by using the provided conda file (snakemake.yaml)
To install run: conda env create -f snakemake.yaml
The main requirements are:
PathoFact provides the conda environments with the dependencies needed to run the incorporated tools.
Some of the tools itself, however, still need to be installed.
The following tools need to be installed by the user itself and the path to the tools adjusted within the config.yaml file:
* HMMER-3.2.1
* singalp-4.1
- `git` and `git lfs`
- `conda`
The following tools can either be installed manually or the script can be run to install automatically:
Most other dependencies are either included as a `git submodule` or will be installed automatically by `snakemake` using the `conda` YAML files in `envs/`.
However, some tools need to be installed and/or configured manually.
* deepARG (v1)
* PlasFlow (v1.1)
* VirSorter (v1.0.5)
* DeepVirFinder (v1.0)
## GIT and GIT LFS
It is recommended to install using the script.
If installed manually make sure that the tools are installed in the folder "scripts" and the pathways matches those within the config.yaml
After the installation of deepARG make sure to manually adjust the configurations.
For this go to the directory where the program was saved (in this case the scripts/deeparg-ss directory within the PathoFact directroy) and open the files
Replace the path '/home/gustavo1/tmp/deeparg-ss/'; with the current directory (deepARG path).
Finally for LINUX system to allow diamond to be executed go to ./bin within the deeparg-ss directory and run:
chmod +x diamond
For more explanations on the deepArg configuration see the deepArg documentation:
## Miniconda (conda)
# Usage
# install miniconda3
chmod u+x
./ # follow the instructions
##Input configuration
The input to the ViruTox pipeline consists of; (i) an amino acid fasta file of translated gene sequences for the prediction of toxins, virulence factors and antimicrobial resistance genes, (ii) a fasta file containing nucleotide sequences of the corresponding contigs for the prediction of MGEs, and (iii) a tab delimited table consisting of a first column of contig names with the corresponding gene names in the second column to combine predictions.
Contig and gene names need to correlate with the original names given in the fasta headers. Furhtermore, make sure that no white spaces are present in the fasta headers.
All three input files used by the pipeline for one sample need to be given the same sample name, followed by the suffix .faa (amino acid, gene fasta file), .fna (nucleotide contig fasta file), .contig (table with contig and gene names).
## Pipeline environment
## Run PathoFact
To run PathoFact the sample name is given in the config.yaml file at "input_file". If wanted more than one sample can be run at the simultanously, for example:
# create the conda environment
conda env create -f=envs/PathoFact.yaml
You can activate and deactivate the environment using `conda activate PathoFact` and `conda deactivate`, respectively.
## PathoFact
Clone the repository and its sub-modules:
# activate git lfs
git lfs install
# clone branch incl. sub-modules
git clone -b SpringClean --recursive
Perform futher installation/configuration steps:
# run set-up script
## Dependency: `SignalP`
* input_file: ["SAMPLE_A","SAMPLE_B"]
Required version: `4.1`
In "OUTDIR" the pathway to the samples are given and the PathoFact results are deposited in the same directory.
To download the tool you need to submit a request at
* input_file: /path/to/samples
- Look for "SignalP" and click on the link
- Click on "Downloads"
- Click on the link for your platform for the version "Version 4.1g"
- Fill and submit the form
In "project" an unique name for your project need to be given, for example:
After the installation, adjust the path for this tool in `config.yaml` (keyword `signalp`).
# Usage
## Input files
Each sample should have three input files:
- `*.fna`: FASTA file containing nucleotide sequences of the contigs
- no whitespaces in FASTA headers
- for prediction of mobile genetic elements
- `*.faa`: FASTA file conatining translated gene sequences, i.e. amino acid sequences
- no whitespaces in FASTA headers
- for prediction of toxins, virulence factors and antimicrobial resistance genes
- `*.contig`: TAB-delimited file containing a mapping from contig ID (1st column) to gene ID (2nd column)
- no header, one gene ID per line
- contig and gene IDs should be the same as in the FASTA files
The files should be located in the same directory.
For each sample, the corresponding input files should have the same basename, e.g. `SAMPLE_A.fna`, `SAMPLE_A.fna` and `SAMPLE_A.contig` for sample `SAMPLE_A`.
## Run PathoFact
* project: Project_A_PathoFact
### Configuration
Pathofact as default will run the complete pipeline for the prediction of virulence factors, toxins and antimicrobial resistance genes.
If it is desired to run only part of the pipeline this can be indicated **within** the "Snakefile" by changing "w" to a different option:
To run PathoFact you need to adjust some parameters in `config.yaml`.
* w = 'complete' (run complete pipeline, default setting)
* w = 'Tox' (run only workflow for Toxin prediction)
* w = 'Vir' (run only workflow for Virulence prediction)
* w = 'AMR' (run only workflow for Antimicrobial resistance and mobile genetic element prediction)
- `input_file`: This is a list of sample names, e.g. `input_file: ["SAMPLE_A","SAMPLE_B"]`
- `project`: A unique project name which will be used as the name of the output directory in `OUTDIR` path (see below).
- `OUTDIR`: Path to directory containing the sample data; the output directory will be created there.
- `workflow`: Pathofact can run the complete pipeline (default) or a specific step:
- "complete": complete pipeline = toxin + virulence + AMR + MGE prediction
- "Tox": toxin prediction
- "Vir": virulence prediction
- "AMR": antimicrobial resistance (AMR) & mobile genetic elements (MGEs) prediction
To run the snakemake pipeline and example script is given (, but the following is the basics to run the pipeline:
### Execution
* snakemake -s Snakefile --use-conda -p
alternatively one can adjust the number of threads per job (when analysing bigger files it is advised to run on either multiple "cores" or cores with "higher" memory)
Basic command to run the pipeline using `<cores>` CPUs:
* snakemake -s Snakefile -j [number of threads/jobs] --use-conda -p
# activate the env
conda activate PathoFact
# run the pipeline
# set <cores> to the number of cores to use, e.g. 10
snakemake -s Snakefile --use-conda --reason --cores <cores> -p
**NOTE**: Add parameter `-n` (or `--dry-run`) to the command to see which steps will be executed without running them.
**NOTE**: Add `--configfile <configfile.yaml>` to use a different config file than `config.yaml`.
**NOTE**: It is advised to run the pipeline using multiple CPUs or CPUs with "higher" memory.
For more options, see the [snakemake documentation](
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment