README.md 4.92 KB
Newer Older
Laura Denies's avatar
Laura Denies committed
1
# PathoFact v1.0 
Laura Denies's avatar
Laura Denies committed
2 3 4 5 6 7

PathoFact is an easy-to-use modular pipeline for the metagenomic analyses of toxins, virulence factors and antimicrobial resistance. 
Additionally, PathoFact combines the prediction of these pathogenic factors with the identification of mobile genetic elements. 
This provides further depth to the analysis by considering the localization of the genes on mobile genetic elements (MGEs), as well as on the chromosome. 
Furthermore, each module (toxins, virulence factors, and antimicrobial resistance) of PathoFact is also a standalone component, making it a flexible and versatile tool. 

Valentina Galata's avatar
Valentina Galata committed
8
# Requirements and installation
Laura Denies's avatar
Laura Denies committed
9

Valentina Galata's avatar
Valentina Galata committed
10
The main requirements are:
Laura Denies's avatar
Laura Denies committed
11

Valentina Galata's avatar
Valentina Galata committed
12 13
- `git` and `git lfs`
- `conda`
Laura Denies's avatar
Laura Denies committed
14

Valentina Galata's avatar
Valentina Galata committed
15 16
Most other dependencies are either included as a `git submodule` or will be installed automatically by `snakemake` using the `conda` YAML files in `envs/`.
However, some tools need to be installed and/or configured manually.
Laura Denies's avatar
Laura Denies committed
17

Laura Denies's avatar
Laura Denies committed
18

Valentina Galata's avatar
Valentina Galata committed
19
## Miniconda (conda)
Laura Denies's avatar
Laura Denies committed
20

Valentina Galata's avatar
Valentina Galata committed
21 22 23 24 25 26
```bash
# install miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod u+x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh # follow the instructions
```
Laura Denies's avatar
Laura Denies committed
27

Valentina Galata's avatar
Valentina Galata committed
28 29 30 31 32 33 34 35
## PathoFact

Clone the repository and its sub-modules:

```bash
# activate git lfs
git lfs install
# clone branch incl. sub-modules
Laura Denies's avatar
Laura Denies committed
36
git clone -b PathoFact_updates --recursive https://git-r3lab.uni.lu/laura.denies/PathoFact.git
Valentina Galata's avatar
Valentina Galata committed
37 38
```

Laura Denies's avatar
Laura Denies committed
39 40 41 42 43 44 45 46 47
## Pipeline environment

```bash
# create the conda environment
conda env create -f=envs/PathoFact.yaml
```

You can activate and deactivate the environment using `conda activate PathoFact` and `conda deactivate`, respectively.

Valentina Galata's avatar
Valentina Galata committed
48
## Dependency: `SignalP`
Laura Denies's avatar
Laura Denies committed
49

Laura Denies's avatar
Laura Denies committed
50
Required version: `5.0`
Laura Denies's avatar
Laura Denies committed
51

Valentina Galata's avatar
Valentina Galata committed
52
To download the tool you need to submit a request at https://services.healthtech.dtu.dk/:
Laura Denies's avatar
Laura Denies committed
53

Valentina Galata's avatar
Valentina Galata committed
54 55 56 57
- Look for "SignalP" and click on the link
- Click on "Downloads"
- Click on the link for your platform for the version "Version 4.1g"
- Fill and submit the form
Laura Denies's avatar
Laura Denies committed
58

Valentina Galata's avatar
Valentina Galata committed
59 60 61 62 63 64
After the installation, adjust the path for this tool in `config.yaml` (keyword `signalp`).

# Usage

## Input files

Laura Denies's avatar
Laura Denies committed
65
Each sample should have one input file:
Valentina Galata's avatar
Valentina Galata committed
66 67 68

- `*.fna`: FASTA file containing nucleotide sequences of the contigs
    - no whitespaces in FASTA headers
Laura Denies's avatar
Laura Denies committed
69 70 71 72
    - for prediction of mobile genetic elements and input for prodigal

The following files are generated by PathoFact itself:

Valentina Galata's avatar
Valentina Galata committed
73 74 75 76 77 78 79
- `*.faa`: FASTA file conatining translated gene sequences, i.e. amino acid sequences
    - no whitespaces in FASTA headers
    - for prediction of toxins, virulence factors and antimicrobial resistance genes
- `*.contig`: TAB-delimited file containing a mapping from contig ID (1st column) to gene ID (2nd column)
    - no header, one gene ID per line
    - contig and gene IDs should be the same as in the FASTA files

Laura Denies's avatar
Laura Denies committed
80 81
The input file for each sample should be located in the same directory.
For each sample, the corresponding input files should have the same basename, e.g. `SAMPLE_A.fna` for sample `SAMPLE_A`.
Valentina Galata's avatar
Valentina Galata committed
82

Laura Denies's avatar
Laura Denies committed
83 84
**NOTE**: For preprocessing and assembly of metagenomic reads we would suggest using IMP (https://imp.pages.uni.lu/web/)

Valentina Galata's avatar
Valentina Galata committed
85
## Run PathoFact
Laura Denies's avatar
Laura Denies committed
86

Valentina Galata's avatar
Valentina Galata committed
87
### Configuration
Laura Denies's avatar
Laura Denies committed
88

Valentina Galata's avatar
Valentina Galata committed
89
To run PathoFact you need to adjust some parameters in `config.yaml`.
Laura Denies's avatar
Laura Denies committed
90

91 92 93
- `sample`: This is a list of sample names, e.g. `sample: ["SAMPLE_A","SAMPLE_B"]`
- `project`: A unique project name which will be used as the name of the output directory in `datapath` path (see below).
- `datapath`: Path to directory containing the sample data; the output directory will be created there.
Valentina Galata's avatar
Valentina Galata committed
94 95 96 97 98
- `workflow`: Pathofact can run the complete pipeline (default) or a specific step:
    - "complete": complete pipeline = toxin + virulence + AMR + MGE prediction
    - "Tox": toxin prediction
    - "Vir": virulence prediction
    - "AMR": antimicrobial resistance (AMR) & mobile genetic elements (MGEs) prediction
Laura Denies's avatar
Laura Denies committed
99

Valentina Galata's avatar
Valentina Galata committed
100
### Execution
Laura Denies's avatar
Laura Denies committed
101

Valentina Galata's avatar
Valentina Galata committed
102
Basic command to run the pipeline using `<cores>` CPUs:
Laura Denies's avatar
Laura Denies committed
103

Valentina Galata's avatar
Valentina Galata committed
104 105 106 107 108 109 110
```bash
# activate the env
conda activate PathoFact
# run the pipeline
# set <cores> to the number of cores to use, e.g. 10
snakemake -s Snakefile --use-conda --reason --cores <cores> -p 
```
Laura Denies's avatar
Laura Denies committed
111

Valentina Galata's avatar
Valentina Galata committed
112
**NOTE**: Add parameter `-n` (or `--dry-run`) to the command to see which steps will be executed without running them.
Laura Denies's avatar
Laura Denies committed
113

Valentina Galata's avatar
Valentina Galata committed
114
**NOTE**: Add `--configfile <configfile.yaml>` to use a different config file than `config.yaml`. 
Laura Denies's avatar
Laura Denies committed
115

Valentina Galata's avatar
Valentina Galata committed
116
**NOTE**: It is advised to run the pipeline using multiple CPUs or CPUs with "higher" memory.
Laura Denies's avatar
Laura Denies committed
117

Valentina Galata's avatar
Valentina Galata committed
118
For more options, see the [snakemake documentation](https://snakemake.readthedocs.io/en/stable/index.html).
119 120 121 122 123 124 125 126 127

### Execution on a cluster

The pipeline can be run on a cluster using `slurm`.
The command can be found in the script `cluster.sh` which can also be used to submit the jobs to the cluster.

```bash
sbatch cluster.sh
```
Laura Denies's avatar
Laura Denies committed
128 129 130 131 132 133 134 135 136 137 138 139
### Test module

To test for the correct installation of the pipeline the testmodule can be run:

Include the required path to the SingalP v5.0 installation to the config file `test/test_config.yaml`

```
# activate env
conda activate PathoFact
# run the pipeline
snakemake -s test/Snakefile --use-conda --reason --cores 1 -p
```