# PathoFact v1.0 PathoFact is an easy-to-use modular pipeline for the metagenomic analyses of toxins, virulence factors and antimicrobial resistance. Additionally, PathoFact combines the prediction of these pathogenic factors with the identification of mobile genetic elements. This provides further depth to the analysis by considering the localization of the genes on mobile genetic elements (MGEs), as well as on the chromosome. Furthermore, each module (toxins, virulence factors, and antimicrobial resistance) of PathoFact is also a standalone component, making it a flexible and versatile tool. For further information regarding usage and generated reports, please see the [Documentation](https://git-r3lab.uni.lu/laura.denies/PathoFact/-/wikis/home) # Requirements and installation The main requirements are: - `gcc/g++` - `git` and `git lfs` - `conda (version 4.9.2)` Most other dependencies are either included as a `git submodule` or will be installed automatically by `snakemake` using the `conda` YAML files in `envs/`. However, some tools need to be installed and/or configured manually. ## Miniconda (conda) ```bash # install miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh chmod u+x Miniconda3-py37_4.9.2-Linux-x86_64.sh ./Miniconda3-py37_4.9.2-Linux-x86_64.sh # follow the instructions ``` ## PathoFact Clone the repository and its sub-modules: ```bash # activate git lfs git lfs install # clone branch incl. sub-modules git clone -b master --recursive https://git-r3lab.uni.lu/laura.denies/PathoFact.git ``` ## Pipeline environment ```bash # create the conda environment conda env create -f=envs/PathoFact.yaml ``` You can activate and deactivate the environment using `conda activate PathoFact` and `conda deactivate`, respectively. PathoFact requires a `snakemake version >= 5.5.4`. ## Dependency: `SignalP` Required version: `5.0` To download the tool you need to submit a request at https://services.healthtech.dtu.dk/: - Look for "SignalP" and click on the link - Click on "Downloads" - Click on the link for your platform for the version "Version 5.0g" - Fill and submit the form After the installation, adjust the path for this tool in `config.yaml` and `test/test_config.yaml` (keyword `signalp`). # Usage ## Input files Each sample should have one input file: - `*.fna`: FASTA file containing nucleotide sequences of the contigs - no whitespaces in FASTA headers (i.e. to remove whitespaces: sed -i '/^>/ s/ .*//' file.fna) - for prediction of mobile genetic elements and input for prodigal The following files are generated by PathoFact itself: - `*.faa`: FASTA file conatining translated gene sequences, i.e. amino acid sequences - no whitespaces in FASTA headers - for prediction of toxins, virulence factors and antimicrobial resistance genes - `*.contig`: TAB-delimited file containing a mapping from contig ID (1st column) to gene ID (2nd column) - no header, one gene ID per line - contig and gene IDs should be the same as in the FASTA files The input file for each sample should be located in the same directory. For each sample, the corresponding input files should have the same basename, e.g. `SAMPLE_A.fna` for sample `SAMPLE_A`. **NOTE**: For preprocessing and assembly of metagenomic reads we would suggest using IMP (https://imp.pages.uni.lu/web/) ## Run PathoFact ### Configuration To run PathoFact you need to adjust some parameters in `config.yaml`. - `sample`: This is a list of sample names, e.g. `sample: ["SAMPLE_A","SAMPLE_B"]` - `project`: A unique project name which will be used as the name of the output directory in `datapath` path (see below). - `datapath`: Path to directory containing the sample data; the output directory will be created there. - `workflow`: Pathofact can run the complete pipeline (default) or a specific step: - "complete": complete pipeline = toxin + virulence + AMR + MGE prediction - "Tox": toxin prediction - "Vir": virulence prediction - "AMR": antimicrobial resistance (AMR) & mobile genetic elements (MGEs) prediction ### Execution Basic command to run the pipeline using `` CPUs: ```bash # activate the env conda activate PathoFact # run the pipeline # set to the number of cores to use, e.g. 10 snakemake -s Snakefile --use-conda --reason --cores -p ``` **NOTE**: Add parameter `-n` (or `--dry-run`) to the command to see which steps will be executed without running them. **NOTE**: Add `--configfile ` to use a different config file than `config.yaml`. **NOTE**: It is advised to run the pipeline using multiple CPUs or CPUs with "higher" memory. For more options, see the [snakemake documentation](https://snakemake.readthedocs.io/en/stable/index.html). ### Execution on a cluster The pipeline can be run on a cluster using `slurm`. The command can be found in the script `cluster.sh` which can also be used to submit the jobs to the cluster. ```bash sbatch cluster.sh ``` ### Test module To test for the correct installation of the pipeline the testmodule can be run: Include the required path to the SingalP v5.0 installation to the config file `test/test_config.yaml` ``` # activate env conda activate PathoFact # run the pipeline snakemake -s test/Snakefile --use-conda --reason --cores 1 -p ```