Commit fc407c63 authored by Valentina Galata's avatar Valentina Galata
Browse files

updated readme

parent 09e65b15
# About
Comparing genome and gene reconstruction when using short reads (Illumina), long reads (Oxford Nanopore Technology, ONT)
and a hybrid approach.
Comparing genome and gene reconstruction when using short reads (SR) (Illumina) only, long reads (LR) (Oxford Nanopore Technology) only and a hybrid approach (Hy).
# Setup
......@@ -9,35 +8,112 @@ and a hybrid approach.
git clone --recurse-submodules https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab
```
[About git submodules](https://git-scm.com/book/en/v2/Git-Tools-Submodules)
## Conda
TODO: other dependencies ???
[Conda user guide](https://docs.conda.io/projects/conda/en/latest/user-guide/index.html)
# Analysis
## On LCSB HPC server `iris`
Main `conda` environment: `/scratch/users/vgalata/miniconda3/ONT_pilot`
```bash
# install miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod u+x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh # follow the instructions
```
Example commands for `GDB`:
Create the main `snakemake` environment
```bash
# conda
conda activate /scratch/users/vgalata/miniconda3/ONT_pilot
# analysis: check which rules will be executed
snakemake -s workflow/Snakefile --configfile config/GDB/config.yaml --use-conda --conda-prefix ${CONDA_PREFIX}/pipeline --cores 1 -rpn
# submit jobs using slurm
./config/GDB/sbatch.sh
# report
snakemake -s workflow_report/Snakefile --configfile config/GDB/config.yaml --use-conda --conda-prefix ${CONDA_PREFIX}/pipeline --cores 1 -rpn
# create venv
conda env create -f requirements.yml -n "YourEnvName"
```
### `conda` environment
## Configs
If the path specified above does not exists, you can create the environment from `requirements.yaml`
and replace the env. path in `sbatch.sh` files by `"ONT_pilot"`.
All config files are stored in the folder `config/`: sub-folders contain files for all the samples and/or pipeline steps.
```bash
# will create env. ONT_pilot
conda env create -f requirements.yaml
```
\ No newline at end of file
- samples: `gdb`, `nwc`, `rumen`, `zymo`
- pipeline steps: `rawdata`
*TODO: remove aquifer and gcall*
The sub-folders contain:
- config YAML file(s) (`config(.substep)?.yaml`) for a `Snakemake` workflow
- `slurm` config YAML file(s) (`slurm(.substep)?.yaml`) defining job submission parameters for a `Snakemake` workflow
- bash script(s) to execute a `Snakemake` workflow (`sbatch(.substep)?.sh`)
# Workflows
1. Download public datasets (*TODO: what about GDB?*)
2. Run the FAST5 workflow (per sample)
3. Run the main analysis workflow (per sample)
4. Create reports (per sample)
5. Create figures for the paper
Relevant paremters which have to be changed are listed for each workflow and config file.
Parameters defining system-relevant settings are not listed but should be also be changed if required, e.g. number of threads used by certain tools etc.
## Raw data workflow
Download raw data required for the analysis.
*TODO: remove aquifer*
- config: `config/rawdata/`
- `config.yaml`:
- change `work_dir`
- `sbatch.sh`
- change `SMK_ENV`
- if not using `slurm` to submit jobs remove `--cluster-config`, `--cluster` from the `snakemake` CMD
- `slurm.yaml` (only relevant if using `slurm` for job submission)
- workflow: `workflow_rawdata/`
## FAST5 workflow
Process raw FAST5 files of a sample
- create a multi-FAST5 file from single-FAST5 files
- do basecalling
This step is **not** required if the long-read FASTQ file is already available.
- config:
- per sample
- `config/<sample>/config.fast5.yaml`
- change `work_dir`, `single_fast5_dir`, `multi_fast5_dir`, `basecalling_dir`
- change `guppy:gpu:path` and `gyppy:gpu:bin`
- `config/<sample>/sbatch.fast5.yaml`
- change `SMK_ENV`
- if not using `slurm` to submit jobs remove `--cluster-config`, `--cluster` from the `snakemake` CMD
- `config/<sample>/slurm.fast5.yaml` (only relevant if using `slurm` for job submission)
- workflow: `workflow_fast5/`
## Main workflow
Main analysis workflow: given SR and LR FASTQ files, run all the steps to generate required output.
This includes:
- read preprocessing
- assembly and assembly polishing
- gene calling and annotation
- additional analysis of the assemblies and their genes/proteins
The workflow is run per sample and might require a couple of days to run depending on the sample and used configuration.
- config:
- per sample
- `config/<sample>/config.yaml`
- change all path parameters
- `config/<sample>/sbatch.yaml`
- change `SMK_ENV`
- if not using `slurm` to submit jobs remove `--cluster-config`, `--cluster` from the `snakemake` CMD
- `config/<sample>/slurm.yaml` (only relevant if using `slurm` for job submission)
- workflow: `workflow/`
*TODO: remove unsed parameters, e.g. canu, gtdbtk etc.*
## Report workflow
This workflow creates plots and an HTML report for a sample using the output of the main workflow.
*TODO*
## Creating figures for the paper
*TODO*
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment