Commit 00ba04e1 authored by Valentina Galata's avatar Valentina Galata
Browse files

readme: updated readme; added readme for reproducing results (for pub) (issue #127)

parent 76d42320
......@@ -41,6 +41,10 @@ All config files are stored in the folder `config/`:
- figures workflow:
- config YAML file to create figures: `fig.yaml`
## Databases
All database files need to be in the same folder (can also be symlinks) defined in all sample config files in `db_dir`.
# Workflows
1. Raw data workflow: download public datasets (*Note: GDB is not included*)
......@@ -89,37 +93,29 @@ If **not** running this step make sure to provide the correct path in `config/<s
Main analysis workflow: given SR and LR FASTQ files, run all the steps to generate required output. This includes:
- read preprocessing
- read preprocessing and QC
- assembly and assembly polishing
- assembly mapping (mapping rate and coverage)
- mapping reads to assembly (mapping rate and coverage)
- gene calling and annotation
- additional analyses of the assemblies and their annotations
- taxonomic analysis (optional)
The workflow is run per sample and might require a couple of days to run depending on the sample, used configuration and available computational resources.
Note that the workflow will create additional output files not necessarily required to re-create the figures shown in the manuscript.
- config:
- per sample
- `config/<sample>/config.yaml`
- change all path parameters
- change all path parameters (not all databases are required, see above)
- `config/<sample>/sbatch.yaml`
- change `SMK_ENV`
- if not using `slurm` to submit jobs remove `--cluster-config`, `--cluster` from the `snakemake` CMD
- `config/<sample>/slurm.yaml` (only relevant if using `slurm` for job submission)
- workflow: `workflow/`
Note, that for GDB, the publicly available metaG/metaT sequencing data has been already processed.
To skip the preprocessing step:
- remove "preprocessing" from the list of the attribute `steps` in `config/gdb/config.yaml`
- the following files need to exists (can also be symlinks)
- metaG, SR: `R1.proc.fastq.gz` and `R2.proc.fastq.gz` in `YOURPATH/results/preproc/metag/sr/`
- metaT, SR: `R1.proc.fastq.gz` and `R2.proc.fastq.gz` in `YOURPATH/results/preproc/metat/sr/`
- metaG, LR: `YOURPATH/results/preproc/metag/lr/lr.proc.fastq.gz`
where `YOURPATH` is the set value for `work_dir` in `config/gdb/config.yaml`.
## Report workflow
This workflow creates plots and an HTML report for a sample using the output of the main workflow.
This workflow creates various summary files, plots and an HTML report for a sample using the output of the main workflow.
- config:
- sample configs used for the main workflow
......
# About
Notes for those trying to reproduce the results from the publication.
# Setup
Download
- the code archive from XXX and extract it
- DIAMOND DB from XXX and extract it
- GDB data from XXX
Install `conda` and create the main `snakemake` environment (see `README.md`).
Clone `OPERA-MS` repository
```bash
# in the code directory
git clone https://github.com/CSB5/OPERA-MS/tree/c18b4f3c933603a7b35d0ea601a80417fe783964 submodules/operams
```
## Databases
All database files need to be in the same folder (can also be symlinks) defined in all sample config files in `db_dir`.
Some databases will be downloaded/created by the pipeline and some are not required (see below).
The used UniProtKB/TrEMBL database (DIAMOND format) needs to be downloaded and the name of the `*.dmnd` file has to be set in all sample config files in `diamond:db`.
Other database file names/paths can be defined as empty strings or lists
- `bbmap:rrna_refs` as empty list (`bbmap:host_refs` can be kept unchanged)
- `hmm:kegg` as empty string
- `kraken2:db` as empty string
- `kaiju:db` as empty string
- `GTDBTK:DATA` as empty string
## Workflows
The bash scripts to run the `snakemake` pipelines assume that `slurm` is used to submit the jobs.
They have to be modified if this is not the case.
For other notes, see `README.md`.
## GDB
Note, that for GDB, the publicly available metaG/metaT sequencing data has been already processed.
That means that the preprocessing step in the main workflow has to be skipped:
- remove "preprocessing" from the list of the attribute `steps` in `config/gdb/config.yaml`
- the following files need to exist (can also be symlinks)
- metaG, SR: `R1.proc.fastq.gz` and `R2.proc.fastq.gz` in `YOURPATH/results/preproc/metag/sr/`
- metaT, SR: `R1.proc.fastq.gz` and `R2.proc.fastq.gz` in `YOURPATH/results/preproc/metat/sr/`
- metaG, LR: `YOURPATH/results/preproc/metag/lr/lr.proc.fastq.gz`
where `YOURPATH` is the set value for `work_dir` in `config/gdb/config.yaml`.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment