@@ -41,6 +41,10 @@ All config files are stored in the folder `config/`:
- figures workflow:
- config YAML file to create figures: `fig.yaml`
## Databases
All database files need to be in the same folder (can also be symlinks) defined in all sample config files in `db_dir`.
# Workflows
1. Raw data workflow: download public datasets (*Note: GDB is not included*)
...
...
@@ -89,37 +93,29 @@ If **not** running this step make sure to provide the correct path in `config/<s
Main analysis workflow: given SR and LR FASTQ files, run all the steps to generate required output. This includes:
- read preprocessing
- read preprocessing and QC
- assembly and assembly polishing
- assembly mapping (mapping rate and coverage)
-mapping reads to assembly (mapping rate and coverage)
- gene calling and annotation
- additional analyses of the assemblies and their annotations
- taxonomic analysis (optional)
The workflow is run per sample and might require a couple of days to run depending on the sample, used configuration and available computational resources.
Note that the workflow will create additional output files not necessarily required to re-create the figures shown in the manuscript.
- config:
- per sample
-`config/<sample>/config.yaml`
- change all path parameters
- change all path parameters (not all databases are required, see above)
-`config/<sample>/sbatch.yaml`
- change `SMK_ENV`
- if not using `slurm` to submit jobs remove `--cluster-config`, `--cluster` from the `snakemake` CMD
-`config/<sample>/slurm.yaml` (only relevant if using `slurm` for job submission)
- workflow: `workflow/`
Note, that for GDB, the publicly available metaG/metaT sequencing data has been already processed.
To skip the preprocessing step:
- remove "preprocessing" from the list of the attribute `steps` in `config/gdb/config.yaml`
- the following files need to exists (can also be symlinks)
- metaG, SR: `R1.proc.fastq.gz` and `R2.proc.fastq.gz` in `YOURPATH/results/preproc/metag/sr/`
- metaT, SR: `R1.proc.fastq.gz` and `R2.proc.fastq.gz` in `YOURPATH/results/preproc/metat/sr/`
All database files need to be in the same folder (can also be symlinks) defined in all sample config files in `db_dir`.
Some databases will be downloaded/created by the pipeline and some are not required (see below).
The used UniProtKB/TrEMBL database (DIAMOND format) needs to be downloaded and the name of the `*.dmnd` file has to be set in all sample config files in `diamond:db`.
Other database file names/paths can be defined as empty strings or lists
-`bbmap:rrna_refs` as empty list (`bbmap:host_refs` can be kept unchanged)
-`hmm:kegg` as empty string
-`kraken2:db` as empty string
-`kaiju:db` as empty string
-`GTDBTK:DATA` as empty string
## Workflows
The bash scripts to run the `snakemake` pipelines assume that `slurm` is used to submit the jobs.
They have to be modified if this is not the case.
For other notes, see `README.md`.
## GDB
Note, that for GDB, the publicly available metaG/metaT sequencing data has been already processed.
That means that the preprocessing step in the main workflow has to be skipped:
- remove "preprocessing" from the list of the attribute `steps` in `config/gdb/config.yaml`
- the following files need to exist (can also be symlinks)
- metaG, SR: `R1.proc.fastq.gz` and `R2.proc.fastq.gz` in `YOURPATH/results/preproc/metag/sr/`
- metaT, SR: `R1.proc.fastq.gz` and `R2.proc.fastq.gz` in `YOURPATH/results/preproc/metat/sr/`