Gitlab migration complete. If you have any issue please read the FAQ.

README.md 5.21 KB
Newer Older
Laura Denies's avatar
Laura Denies committed
1
# PathoFact v1.0 
Laura Denies's avatar
Laura Denies committed
2 3 4 5 6 7

PathoFact is an easy-to-use modular pipeline for the metagenomic analyses of toxins, virulence factors and antimicrobial resistance. 
Additionally, PathoFact combines the prediction of these pathogenic factors with the identification of mobile genetic elements. 
This provides further depth to the analysis by considering the localization of the genes on mobile genetic elements (MGEs), as well as on the chromosome. 
Furthermore, each module (toxins, virulence factors, and antimicrobial resistance) of PathoFact is also a standalone component, making it a flexible and versatile tool. 

Laura Denies's avatar
Laura Denies committed
8
For further information regarding usage and generated reports, please see the [Documentation](https://git-r3lab.uni.lu/laura.denies/PathoFact/-/wikis/home)
Laura Denies's avatar
Laura Denies committed
9

Valentina Galata's avatar
Valentina Galata committed
10
# Requirements and installation
Laura Denies's avatar
Laura Denies committed
11

Valentina Galata's avatar
Valentina Galata committed
12
The main requirements are:
Laura Denies's avatar
Laura Denies committed
13
- `gcc/g++`
Valentina Galata's avatar
Valentina Galata committed
14 15
- `git` and `git lfs`
- `conda`
Laura Denies's avatar
Laura Denies committed
16

Valentina Galata's avatar
Valentina Galata committed
17 18
Most other dependencies are either included as a `git submodule` or will be installed automatically by `snakemake` using the `conda` YAML files in `envs/`.
However, some tools need to be installed and/or configured manually.
Laura Denies's avatar
Laura Denies committed
19

Laura Denies's avatar
Laura Denies committed
20

Valentina Galata's avatar
Valentina Galata committed
21
## Miniconda (conda)
Laura Denies's avatar
Laura Denies committed
22

Valentina Galata's avatar
Valentina Galata committed
23 24 25 26 27 28
```bash
# install miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod u+x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh # follow the instructions
```
Laura Denies's avatar
Laura Denies committed
29

Valentina Galata's avatar
Valentina Galata committed
30 31 32 33 34 35 36 37
## PathoFact

Clone the repository and its sub-modules:

```bash
# activate git lfs
git lfs install
# clone branch incl. sub-modules
Laura Denies's avatar
Laura Denies committed
38
git clone -b master --recursive https://git-r3lab.uni.lu/laura.denies/PathoFact.git
Valentina Galata's avatar
Valentina Galata committed
39 40
```

Laura Denies's avatar
Laura Denies committed
41 42 43 44 45 46 47
## Pipeline environment

```bash
# create the conda environment
conda env create -f=envs/PathoFact.yaml
```

Laura Denies's avatar
Laura Denies committed
48
You can activate and deactivate the environment using `conda activate PathoFact` and `conda deactivate`, respectively. PathoFact requires a `snakemake version >= 5.5.4`.
Laura Denies's avatar
Laura Denies committed
49

Valentina Galata's avatar
Valentina Galata committed
50
## Dependency: `SignalP`
Laura Denies's avatar
Laura Denies committed
51

Laura Denies's avatar
Laura Denies committed
52
Required version: `5.0`
Laura Denies's avatar
Laura Denies committed
53

Valentina Galata's avatar
Valentina Galata committed
54
To download the tool you need to submit a request at https://services.healthtech.dtu.dk/:
Laura Denies's avatar
Laura Denies committed
55

Valentina Galata's avatar
Valentina Galata committed
56 57
- Look for "SignalP" and click on the link
- Click on "Downloads"
Laura Denies's avatar
Laura Denies committed
58
- Click on the link for your platform for the version "Version 5.0g"
Valentina Galata's avatar
Valentina Galata committed
59
- Fill and submit the form
Laura Denies's avatar
Laura Denies committed
60

Laura Denies's avatar
Laura Denies committed
61
After the installation, adjust the path for this tool in `config.yaml` and `test/test_config.yaml` (keyword `signalp`).
Valentina Galata's avatar
Valentina Galata committed
62 63 64 65 66

# Usage

## Input files

Laura Denies's avatar
Laura Denies committed
67
Each sample should have one input file:
Valentina Galata's avatar
Valentina Galata committed
68 69

- `*.fna`: FASTA file containing nucleotide sequences of the contigs
Laura Denies's avatar
Laura Denies committed
70
    - no whitespaces in FASTA headers (i.e. to remove whitespaces: sed -i '/^>/ s/ .*//' file.fna)
Laura Denies's avatar
Laura Denies committed
71 72 73 74
    - for prediction of mobile genetic elements and input for prodigal

The following files are generated by PathoFact itself:

Valentina Galata's avatar
Valentina Galata committed
75 76 77 78 79 80 81
- `*.faa`: FASTA file conatining translated gene sequences, i.e. amino acid sequences
    - no whitespaces in FASTA headers
    - for prediction of toxins, virulence factors and antimicrobial resistance genes
- `*.contig`: TAB-delimited file containing a mapping from contig ID (1st column) to gene ID (2nd column)
    - no header, one gene ID per line
    - contig and gene IDs should be the same as in the FASTA files

Laura Denies's avatar
Laura Denies committed
82 83
The input file for each sample should be located in the same directory.
For each sample, the corresponding input files should have the same basename, e.g. `SAMPLE_A.fna` for sample `SAMPLE_A`.
Valentina Galata's avatar
Valentina Galata committed
84

Laura Denies's avatar
Laura Denies committed
85 86
**NOTE**: For preprocessing and assembly of metagenomic reads we would suggest using IMP (https://imp.pages.uni.lu/web/)

Valentina Galata's avatar
Valentina Galata committed
87
## Run PathoFact
Laura Denies's avatar
Laura Denies committed
88

Valentina Galata's avatar
Valentina Galata committed
89
### Configuration
Laura Denies's avatar
Laura Denies committed
90

Valentina Galata's avatar
Valentina Galata committed
91
To run PathoFact you need to adjust some parameters in `config.yaml`.
Laura Denies's avatar
Laura Denies committed
92

93 94 95
- `sample`: This is a list of sample names, e.g. `sample: ["SAMPLE_A","SAMPLE_B"]`
- `project`: A unique project name which will be used as the name of the output directory in `datapath` path (see below).
- `datapath`: Path to directory containing the sample data; the output directory will be created there.
Valentina Galata's avatar
Valentina Galata committed
96 97 98 99 100
- `workflow`: Pathofact can run the complete pipeline (default) or a specific step:
    - "complete": complete pipeline = toxin + virulence + AMR + MGE prediction
    - "Tox": toxin prediction
    - "Vir": virulence prediction
    - "AMR": antimicrobial resistance (AMR) & mobile genetic elements (MGEs) prediction
Laura Denies's avatar
Laura Denies committed
101

Valentina Galata's avatar
Valentina Galata committed
102
### Execution
Laura Denies's avatar
Laura Denies committed
103

Valentina Galata's avatar
Valentina Galata committed
104
Basic command to run the pipeline using `<cores>` CPUs:
Laura Denies's avatar
Laura Denies committed
105

Valentina Galata's avatar
Valentina Galata committed
106 107 108 109 110 111 112
```bash
# activate the env
conda activate PathoFact
# run the pipeline
# set <cores> to the number of cores to use, e.g. 10
snakemake -s Snakefile --use-conda --reason --cores <cores> -p 
```
Laura Denies's avatar
Laura Denies committed
113

Valentina Galata's avatar
Valentina Galata committed
114
**NOTE**: Add parameter `-n` (or `--dry-run`) to the command to see which steps will be executed without running them.
Laura Denies's avatar
Laura Denies committed
115

Valentina Galata's avatar
Valentina Galata committed
116
**NOTE**: Add `--configfile <configfile.yaml>` to use a different config file than `config.yaml`. 
Laura Denies's avatar
Laura Denies committed
117

Valentina Galata's avatar
Valentina Galata committed
118
**NOTE**: It is advised to run the pipeline using multiple CPUs or CPUs with "higher" memory.
Laura Denies's avatar
Laura Denies committed
119

Valentina Galata's avatar
Valentina Galata committed
120
For more options, see the [snakemake documentation](https://snakemake.readthedocs.io/en/stable/index.html).
121 122 123 124 125 126 127 128 129

### Execution on a cluster

The pipeline can be run on a cluster using `slurm`.
The command can be found in the script `cluster.sh` which can also be used to submit the jobs to the cluster.

```bash
sbatch cluster.sh
```
Laura Denies's avatar
Laura Denies committed
130 131 132 133 134 135 136 137 138 139 140 141
### Test module

To test for the correct installation of the pipeline the testmodule can be run:

Include the required path to the SingalP v5.0 installation to the config file `test/test_config.yaml`

```
# activate env
conda activate PathoFact
# run the pipeline
snakemake -s test/Snakefile --use-conda --reason --cores 1 -p
```