Commit 90a677ba authored by Valentina Galata's avatar Valentina Galata
Browse files

updated readme (issue #127)

parent a860acc0
......@@ -28,25 +28,26 @@ conda env create -f requirements.yml -n "YourEnvName"
## Configs
All config files are stored in the folder `config/`: sub-folders contain files for all the samples and/or pipeline steps.
- samples: `gdb`, `nwc`, `rumen`, `zymo`
- pipeline steps: `rawdata`
*TODO: remove aquifer and gcall*
The sub-folders contain:
- config YAML file(s) (`config(.substep)?.yaml`) for a `Snakemake` workflow
- `slurm` config YAML file(s) (`slurm(.substep)?.yaml`) defining job submission parameters for a `Snakemake` workflow
- bash script(s) to execute a `Snakemake` workflow (`sbatch(.substep)?.sh`)
All config files are stored in the folder `config/`:
- raw data workflow: `rawdata/`
- does not include GDB
- FAST5 and main workflow: samples: Zymo (`zymo/`), NWC (`nwc/`), GDB (`gdb/`) and Rumen (`rumen/`) including
- config YAML file (`config*.yaml`) for a `Snakemake` workflow
- `slurm` config YAML file (`slurm*.yaml`) for a `Snakemake` workflow
- bash script (`sbatch*.sh`) to execute a `Snakemake` workflow
- report workflow:
- bash script to create sample reports: `reports.sh`
- figures workflow:
- config YAML file to create figures: `fig.yaml`
# Workflows
1. Download public datasets (*TODO: what about GDB?*)
2. Run the FAST5 workflow (per sample)
3. Run the main analysis workflow (per sample)
4. Create reports (per sample)
5. Create figures for the paper
1. Raw data workflow: download public datasets (*Note: GDB is not included*)
2. FAST5 workflow (per sample): process FAST5 files
3. Main workflow (per sample): data analysis
4. Report workflow (per sample): create sample reports
5. Figures workflow: create figures for the paper
Relevant paremters which have to be changed are listed for each workflow and config file.
Parameters defining system-relevant settings are not listed but should be also be changed if required, e.g. number of threads used by certain tools etc.
......@@ -55,8 +56,6 @@ Parameters defining system-relevant settings are not listed but should be also b
Download raw data required for the analysis.
*TODO: remove aquifer*
- config: `config/rawdata/`
- `config.yaml`:
- change `work_dir`
......@@ -73,6 +72,7 @@ Process raw FAST5 files of a sample
- do basecalling
This step is **not** required if the long-read FASTQ file is already available.
If **not** running this step make sure to provide the correct path in `config/<sample>/config.yaml` for the attribute `data:metag:ont:fastq`.
- config:
- per sample
......@@ -87,14 +87,15 @@ This step is **not** required if the long-read FASTQ file is already available.
## Main workflow
Main analysis workflow: given SR and LR FASTQ files, run all the steps to generate required output.
This includes:
Main analysis workflow: given SR and LR FASTQ files, run all the steps to generate required output. This includes:
- read preprocessing
- assembly and assembly polishing
- assembly mapping (mapping rate and coverage)
- gene calling and annotation
- additional analysis of the assemblies and their genes/proteins
- additional analyses of the assemblies and their annotations
The workflow is run per sample and might require a couple of days to run depending on the sample and used configuration.
The workflow is run per sample and might require a couple of days to run depending on the sample, used configuration and available computational resources.
- config:
- per sample
......@@ -106,22 +107,32 @@ The workflow is run per sample and might require a couple of days to run dependi
- `config/<sample>/slurm.yaml` (only relevant if using `slurm` for job submission)
- workflow: `workflow/`
*TODO: remove unsed parameters, e.g. canu, gtdbtk etc.*
Note, that for GDB, the publicly available metaG/metaT sequencing data has been already processed.
To skip the preprocessing step:
- remove "preprocessing" from the list of the attribute `steps` in `config/gdb/config.yaml`
- the following files need to exists (can also be symlinks)
- metaG, SR: `R1.proc.fastq.gz` and `R2.proc.fastq.gz` in `YOURPATH/results/preproc/metag/sr/`
- metaT, SR: `R1.proc.fastq.gz` and `R2.proc.fastq.gz` in `YOURPATH/results/preproc/metat/sr/`
- metaG, LR: `YOURPATH/results/preproc/metag/lr/lr.proc.fastq.gz`
where `YOURPATH` is the set value for `work_dir` in `config/gdb/config.yaml`.
## Report workflow
This workflow creates plots and an HTML report for a sample using the output of the main workflow.
*TODO*
- config:
- sample configs used for the main workflow
- workflow: `workflow_report/`
To execute this workflow for all samples:
```bash
./config/reports.sh "YourEnvName" "WhereToCreateCondEnvs"
```
## Creating figures for the paper
## Figures workflow
Re-create figures used in the manuscript.
Re-create figures (and tables) used in the manuscript.
This workflow should be only run after running the main workflow and report workflow for all samples.
- config: `config/fig.yaml`
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment