README.md 4.31 KB
Newer Older
Valentina Galata's avatar
Valentina Galata committed
1
# PathoFact 1.0 (branch SpringClean)
Laura Denies's avatar
Laura Denies committed
2 3 4 5 6 7

PathoFact is an easy-to-use modular pipeline for the metagenomic analyses of toxins, virulence factors and antimicrobial resistance. 
Additionally, PathoFact combines the prediction of these pathogenic factors with the identification of mobile genetic elements. 
This provides further depth to the analysis by considering the localization of the genes on mobile genetic elements (MGEs), as well as on the chromosome. 
Furthermore, each module (toxins, virulence factors, and antimicrobial resistance) of PathoFact is also a standalone component, making it a flexible and versatile tool. 

Valentina Galata's avatar
Valentina Galata committed
8
# Requirements and installation
Laura Denies's avatar
Laura Denies committed
9

Valentina Galata's avatar
Valentina Galata committed
10
The main requirements are:
Laura Denies's avatar
Laura Denies committed
11

Valentina Galata's avatar
Valentina Galata committed
12 13
- `git` and `git lfs`
- `conda`
Laura Denies's avatar
Laura Denies committed
14

Valentina Galata's avatar
Valentina Galata committed
15 16
Most other dependencies are either included as a `git submodule` or will be installed automatically by `snakemake` using the `conda` YAML files in `envs/`.
However, some tools need to be installed and/or configured manually.
Laura Denies's avatar
Laura Denies committed
17

Laura Denies's avatar
Laura Denies committed
18

Valentina Galata's avatar
Valentina Galata committed
19
## Miniconda (conda)
Laura Denies's avatar
Laura Denies committed
20

Valentina Galata's avatar
Valentina Galata committed
21 22 23 24 25 26
```bash
# install miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod u+x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh # follow the instructions
```
Laura Denies's avatar
Laura Denies committed
27

Valentina Galata's avatar
Valentina Galata committed
28
## Pipeline environment
Laura Denies's avatar
Laura Denies committed
29

Valentina Galata's avatar
Valentina Galata committed
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
```bash
# create the conda environment
conda env create -f=envs/PathoFact.yaml
```

You can activate and deactivate the environment using `conda activate PathoFact` and `conda deactivate`, respectively.

## PathoFact

Clone the repository and its sub-modules:

```bash
# activate git lfs
git lfs install
# clone branch incl. sub-modules
git clone -b SpringClean --recursive https://git-r3lab.uni.lu/laura.denies/PathoFact.git
```

Perform futher installation/configuration steps:

```bash
# run set-up script
./set-up.sh
```

## Dependency: `SignalP`
Laura Denies's avatar
Laura Denies committed
56

Valentina Galata's avatar
Valentina Galata committed
57
Required version: `4.1`
Laura Denies's avatar
Laura Denies committed
58

Valentina Galata's avatar
Valentina Galata committed
59
To download the tool you need to submit a request at https://services.healthtech.dtu.dk/:
Laura Denies's avatar
Laura Denies committed
60

Valentina Galata's avatar
Valentina Galata committed
61 62 63 64
- Look for "SignalP" and click on the link
- Click on "Downloads"
- Click on the link for your platform for the version "Version 4.1g"
- Fill and submit the form
Laura Denies's avatar
Laura Denies committed
65

Valentina Galata's avatar
Valentina Galata committed
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
After the installation, adjust the path for this tool in `config.yaml` (keyword `signalp`).

# Usage

## Input files

Each sample should have three input files:

- `*.fna`: FASTA file containing nucleotide sequences of the contigs
    - no whitespaces in FASTA headers
    - for prediction of mobile genetic elements
- `*.faa`: FASTA file conatining translated gene sequences, i.e. amino acid sequences
    - no whitespaces in FASTA headers
    - for prediction of toxins, virulence factors and antimicrobial resistance genes
- `*.contig`: TAB-delimited file containing a mapping from contig ID (1st column) to gene ID (2nd column)
    - no header, one gene ID per line
    - contig and gene IDs should be the same as in the FASTA files

The files should be located in the same directory.
For each sample, the corresponding input files should have the same basename, e.g. `SAMPLE_A.fna`, `SAMPLE_A.fna` and `SAMPLE_A.contig` for sample `SAMPLE_A`.

## Run PathoFact
Laura Denies's avatar
Laura Denies committed
88

Valentina Galata's avatar
Valentina Galata committed
89
### Configuration
Laura Denies's avatar
Laura Denies committed
90

Valentina Galata's avatar
Valentina Galata committed
91
To run PathoFact you need to adjust some parameters in `config.yaml`.
Laura Denies's avatar
Laura Denies committed
92

Valentina Galata's avatar
Valentina Galata committed
93 94 95 96 97 98 99 100
- `input_file`: This is a list of sample names, e.g. `input_file: ["SAMPLE_A","SAMPLE_B"]`
- `project`: A unique project name which will be used as the name of the output directory in `OUTDIR` path (see below).
- `OUTDIR`: Path to directory containing the sample data; the output directory will be created there.
- `workflow`: Pathofact can run the complete pipeline (default) or a specific step:
    - "complete": complete pipeline = toxin + virulence + AMR + MGE prediction
    - "Tox": toxin prediction
    - "Vir": virulence prediction
    - "AMR": antimicrobial resistance (AMR) & mobile genetic elements (MGEs) prediction
Laura Denies's avatar
Laura Denies committed
101

Valentina Galata's avatar
Valentina Galata committed
102
### Execution
Laura Denies's avatar
Laura Denies committed
103

Valentina Galata's avatar
Valentina Galata committed
104
Basic command to run the pipeline using `<cores>` CPUs:
Laura Denies's avatar
Laura Denies committed
105

Valentina Galata's avatar
Valentina Galata committed
106 107 108 109 110 111 112
```bash
# activate the env
conda activate PathoFact
# run the pipeline
# set <cores> to the number of cores to use, e.g. 10
snakemake -s Snakefile --use-conda --reason --cores <cores> -p 
```
Laura Denies's avatar
Laura Denies committed
113

Valentina Galata's avatar
Valentina Galata committed
114
**NOTE**: Add parameter `-n` (or `--dry-run`) to the command to see which steps will be executed without running them.
Laura Denies's avatar
Laura Denies committed
115

Valentina Galata's avatar
Valentina Galata committed
116
**NOTE**: Add `--configfile <configfile.yaml>` to use a different config file than `config.yaml`. 
Laura Denies's avatar
Laura Denies committed
117

Valentina Galata's avatar
Valentina Galata committed
118
**NOTE**: It is advised to run the pipeline using multiple CPUs or CPUs with "higher" memory.
Laura Denies's avatar
Laura Denies committed
119

Valentina Galata's avatar
Valentina Galata committed
120
For more options, see the [snakemake documentation](https://snakemake.readthedocs.io/en/stable/index.html).