README.md 29.5 KB
Newer Older
1
# Bioinformatics workflows with snakemake and conda
Sarah Peter's avatar
Sarah Peter committed
2

Sarah Peter's avatar
Sarah Peter committed
3 4
Author: Sarah Peter

5
In this tutorial you will learn how to run a [ChIP-seq](https://en.wikipedia.org/wiki/ChIP-sequencing) analysis with the [snakemake workflow engine](https://snakemake.readthedocs.io/en/stable/)  on the cluster.
6 7

**Disclaimer:** In order to keep this tutorial simple, we use default parameters for the different tools as much as possible. However, for a real analysis you should always adapt the parameters to your dataset. Also, be aware that the results of some steps might be screwed, because we only work on data from one chromosome.
Sarah Peter's avatar
Sarah Peter committed
8

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
9 10 11 12 13 14 15 16 17 18 19 20
## Table of contents

1. [Setup the environment](#env)
2. [Create snakemake workflow](#snakemake)
   1. [Mapping](#mapping)
   2. [Peak calling](#peaks)
   3. [Generate bigWig files for visualisation](#bigwig)
   4. [Summary rule](#summary)
3. [Cluster configuration for snakemake](#cluster)
   1. [Adjust mapping step to run on multiple threads](#multithreading)
   2. [Configure job parameters with `cluster.yaml`](#job_params)
   3. [Run snakemake with cluster configuration](#cluster_config)
Sarah Peter's avatar
Sarah Peter committed
21
4. [Inspect results in IGV](#igv)
Sarah Peter's avatar
Add TOC  
Sarah Peter committed
22 23 24 25 26 27 28 29
5. [(Optional) Immediately submit all jobs](#immediate_submit)
6. [Useful stuff](#useful)
7. [References](#references)
8. [Acknowledgements](#acknowledgements)

<a name="env"></a>

## Setup the environment 
Sarah Peter's avatar
Sarah Peter committed
30

31 32 33 34 35 36 37 38 39
For this tutorial we will use the [`conda` package manager](https://www.anaconda.com/) to install the required tools. 

> Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.
>
> Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment.
> 
> &mdash; <cite>[Conda  manual](https://docs.conda.io/en/latest/index.html)</cite>

It can encapsulate software and packages in environments, so you can have multiple different versions of a software installed at the same time and avoid incompatibilities between different tools. It also has functionality to easily port and replicate environments, which is important to ensure reproducibility of analyses.
Sarah Peter's avatar
Sarah Peter committed
40

41 42
**Attention when dealing with sensitive data:** Everyone can very easily contribute installation recipies to the bioconda channel, without verified identity or double-checking from the community. Therefore it's possible to insert malicious software. If you use bioconda when processing sensitive data, you should check the recipes to verify that they install software from the official sources.

Sarah Peter's avatar
Sarah Peter committed
43 44
We will use conda on two levels in this tutorial. First we use a conda environment to install and run snakemake. Second, inside the snakemake workflow we will define separate conda environments for each step.

45 46 47
1. Connect to the cluster

2. Start an interactive job:
Sarah Peter's avatar
Sarah Peter committed
48 49 50 51

   ```bash
   (access)$> si
   ```
52 53
   
3. Install conda:
Sarah Peter's avatar
Sarah Peter committed
54

Sarah Peter's avatar
Sarah Peter committed
55
   ```bash
56 57
   (node)$> mkdir -p $SCRATCH/downloads
   (node)$> cd $SCRATCH/downloads
Sarah Peter's avatar
Sarah Peter committed
58 59 60 61
   (node)$> wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
   (node)$> chmod u+x Miniconda3-latest-Linux-x86_64.sh
   (node)$> ./Miniconda3-latest-Linux-x86_64.sh 
   ```
62
   You need to specify your installation destination, e.g. `/home/users/<your_username>/tools/miniconda3`. You must use the **full** path and can**not** use `$HOME/tools/miniconda3`. Answer `yes` to initialize Miniconda3. 
Sarah Peter's avatar
Sarah Peter committed
63 64 65 66 67 68 69 70 71 72 73

   The installation will modify your `.bashrc` to make conda directly available after each login. To activate the changes now, run

   ```bash
   (node)$> source ~/.bashrc
   ```

   Update conda to the latest version:
   ```bash
   (node)$> conda update conda 
   ```
Sarah Peter's avatar
Sarah Peter committed
74

75
4. Create a new conda environment and activate it:
Sarah Peter's avatar
Sarah Peter committed
76 77 78 79 80

   ```bash
   (node)$> conda create -n bioinfo_tutorial
   (node)$> conda activate bioinfo_tutorial
   ```
81
   After validation of the creation step, once activation you can see that your prompt will now be prefixed with `(bioinfo_tutorial)` to show which environment is active. For the rest of the tutorial make sure that you always have this environment active.
82 83

5. Install snakemake:
Sarah Peter's avatar
Sarah Peter committed
84 85

   ```bash
86
   (bioinfo_tutorial) (node)$> conda install -c bioconda -c conda-forge snakemake-minimal
Sarah Peter's avatar
Sarah Peter committed
87
   ```
88

Sarah Peter's avatar
Sarah Peter committed
89 90
   

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
91 92
<a name="snakemake"></a>

Sarah Peter's avatar
Sarah Peter committed
93 94
## Create snakemake workflow

95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
> The Snakemake workflow management system is a tool to create **reproducible and scalable** data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.
> 
> &mdash; <cite>[Snakemake manual](https://snakemake.readthedocs.io/en/stable/index.html)</cite>

Snakemake is a very useful tool if you need to combine multiple steps using different software into a coherent workflow. It comes with many features desired for running workflows, like 

* ensuring all input and result files are present
* restarting at a failed step
* rerunning (parts of a) pipeline when (some of) the input changed
* support for wildcards to apply a step to a set of files
* automatically parallelising where possible
* software management
* collecting benchmark data
* modularisation
* creating launcher scripts and submitting jobs to the cluster
* creating a visualisation of the workflow steps (see below)

Sarah Peter's avatar
Sarah Peter committed
112
In this tutorial we will analyse [ChIP-seq data](https://www.ebi.ac.uk/ena/data/view/PRJEB20933) from the paper [Gérard D, Schmidt F, Ginolhac A, Schmitz M, Halder R, Ebert P, Schulz MH, Sauter T, Sinkkonen L. Temporal enhancer profiling of parallel lineages identifies AHR and GLIS1 as regulators of mesenchymal multipotency. *Nucleic Acids Research*, Volume 47, Issue 3, 20 February 2019, Pages 1141–1163, https://doi.org/10.1093/nar/gky1240](https://www.ncbi.nlm.nih.gov/pubmed/30544251) published by our colleagues at LSRU.
Sarah Peter's avatar
Sarah Peter committed
113

114 115 116 117
We will set up the following workflow:

![DAG](img/dag.png)

118
To speed up computing time we use source files that only contain sequencing reads that map to chromosome 12. The files for input (control) and H3K4me3 (ChIP) are available on the cluster in the directory `/work/projects/ulhpc-tutorials/bio/snakemake/chip-seq` and the corresponding reference in `/work/projects/ulhpc-tutorials/bio/snakemake/reference`.
Sarah Peter's avatar
Sarah Peter committed
119

120 121
We have also already prepared the conda environments for each step in the workflow in `/work/projects/ulhpc-tutorials/bio/snakemake/envs`.

Sarah Peter's avatar
Sarah Peter committed
122
Create a working directory and link the necessary data:
Sarah Peter's avatar
Sarah Peter committed
123 124

```bash
125 126 127 128 129 130
(node)$> cd $SCRATCH
(node)$> mkdir bioinfo_tutorial
(node)$> cd bioinfo_tutorial
(node)$> ln -s /work/projects/ulhpc-tutorials/bio/snakemake/chip-seq .
(node)$> ln -s /work/projects/ulhpc-tutorials/bio/snakemake/reference .
(node)$> ln -s /work/projects/ulhpc-tutorials/bio/snakemake/envs .
Sarah Peter's avatar
Sarah Peter committed
131 132
```

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
133 134
<a name="mapping"></a>

Sarah Peter's avatar
Sarah Peter committed
135 136
### Mapping

Sarah Peter's avatar
Sarah Peter committed
137 138
> In Snakemake, workflows are specified as Snakefiles. Inspired by GNU Make, a Snakefile contains rules that denote how to create output files from input files. Dependencies between rules are handled implicitly, by matching filenames of input files against output files. Thereby wildcards can be used to write general rules.
> 
139
> &mdash; <cite>[Snakemake manual - Writing Workflows](https://snakemake.readthedocs.io/en/stable/snakefiles/writing_snakefiles.html)</cite>
140

141 142 143 144 145 146 147 148
> Most importantly, a rule can consist of a name (the name is optional and can be left out, creating an anonymous rule), input files, output files, and a shell command to generate the output from the input, i.e. 
> ```python
> rule NAME:
>     input: "path/to/inputfile", "path/to/other/inputfile"
>     output: "path/to/outputfile", "path/to/another/outputfile"
>     shell: "somecommand {input} {output}"
> ```
> &mdash; <cite>[Snakemake manual - Rules](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html)</cite>
Sarah Peter's avatar
Sarah Peter committed
149

150
For a detailed explanation of rules, have a look at the official [Snakemake tutorial](https://snakemake.readthedocs.io/en/stable/tutorial/basics.html).
Sarah Peter's avatar
Sarah Peter committed
151

152
A basic rule for mapping a fastq file with [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) could look like this:
153 154 155

```python
rule mapping:
Sarah Peter's avatar
Sarah Peter committed
156
  input: "chip-seq/H3K4-TC1-ST2-D0.12.fastq.gz"
157
  output: "bowtie2/H3K4-TC1-ST2-D0.12.bam"
158 159
  shell:
    """
160
    bowtie2 -x reference/Mus_musculus.GRCm38.dna_sm.chromosome.12 -U {input} | \  
161 162 163 164
    samtools sort - > {output}
    samtools index {output}
    """
```
Sarah Peter's avatar
Sarah Peter committed
165

166
Since we have two fastq files to map, we should generalise the rule with wildcards:
Sarah Peter's avatar
Sarah Peter committed
167 168 169

```python
rule mapping:
170
  input: "chip-seq/{sample}.fastq.gz"
171
  output: "bowtie2/{sample}.bam"
172 173
  shell:
    """
174
    bowtie2 -x reference/Mus_musculus.GRCm38.dna_sm.chromosome.12 -U {input} | \  
175 176 177 178 179
    samtools sort - > {output}
    samtools index {output}
    """
```

180
Now we need to tell snakemake to use a conda environment with bowtie2 and [samtools](http://www.htslib.org/) inside to run this rule. For this purpose there is a specific `conda` directive that can be added to the rule. It accepts a [yaml](https://yaml.org/spec/1.2/spec.html) file that defines the conda environment. 
181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196

```python
conda: "envs/bowtie2.yaml"
```

You can easily export existing conda environments to a yaml file with `conda env export` or write the yaml from scratch. For this step the yaml file looks like this:

```yaml
name: bowtie2
channels:
  - bioconda
dependencies:
  - bowtie2
  - samtools
```

Sarah Peter's avatar
Sarah Peter committed
197 198 199
We also use the `params` directive to define the path to the reference and declutter the command-line call, as well as the `log` directive to define a path to permanently store the output of the execution. This is especially useful in this step to store the bowtie2 mapping statistics, which are just written to the command-line (stderr) otherwise.

To track resource usage we add the `benchmark` directive, which will write performance measures to a tsv file.
200 201 202 203 204 205 206 207 208 209 210 211 212

Create a file called `Snakefile` in the current directory and open it in your favourite editor, e.g.

```bash
(node)$> nano Snakefile
```

Add the final rule for the mapping:

```python
rule mapping:
  input: "chip-seq/{sample}.fastq.gz"
  output: "bowtie2/{sample}.bam"
Sarah Peter's avatar
Sarah Peter committed
213
  params:
214
    idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12"
Sarah Peter's avatar
Sarah Peter committed
215
  log: "logs/bowtie2_{sample}.log"
Sarah Peter's avatar
Sarah Peter committed
216
  benchmark: "benchmarks/mapping/{sample}.tsv"
Sarah Peter's avatar
Sarah Peter committed
217
  conda: "envs/bowtie2.yaml"
Sarah Peter's avatar
Sarah Peter committed
218 219
  shell:
    """
220 221
    bowtie2 -x {params.idx} -U {input} 2> {log} | \
    samtools sort - > {output}
Sarah Peter's avatar
Sarah Peter committed
222 223 224 225
    samtools index {output}
    """
```

Sarah Peter's avatar
Sarah Peter committed
226 227 228
You can test the rule by specifying one of the potential outputs. We first do a dry-run with with option `-n`.

```bash
229
(node)$> snakemake -npr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
Sarah Peter's avatar
Sarah Peter committed
230 231
```

232
If everything is fine we can run the rule to create the file `bowtie2/INPUT-TC1-ST2-D0.12.bam`:
Sarah Peter's avatar
Sarah Peter committed
233 234

```bash
235 236 237
(node)$> snakemake -pr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
```

238 239
The first run will take a bit longer, because snakemake creates the conda environment. In subsequent runs it will just activate the existing environment. However, it will recognise if the yaml files changes and then recreate the environment.

240
Check the mapping statistics and the benchmark report:
241

Sarah Peter's avatar
Sarah Peter committed
242
```bash
243 244 245 246 247 248 249 250
(node)$> cat logs/bowtie2_INPUT-TC1-ST2-D0.12.log
400193 reads; of these:
  400193 (100.00%) were unpaired; of these:
    1669 (0.42%) aligned 0 times
    379290 (94.78%) aligned exactly 1 time
    19234 (4.81%) aligned >1 times
99.58% overall alignment rate

251
(node)$> cat benchmarks/mapping/INPUT-TC1-ST2-D0.12.tsv
Sarah Peter's avatar
Sarah Peter committed
252 253
s       h:m:s   max_rss max_vms max_uss max_pss io_in io_out mean_load
19.1737 0:00:19 262.14  1404.55 258.79  258.94  0.00  0.00   0.00
Sarah Peter's avatar
Sarah Peter committed
254 255
```

256
After this step your working directory should contain the following files (using the `tree` command):
257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273

```
.
|-- benchmarks
|   `-- mapping
|       `-- INPUT-TC1-ST2-D0.12.tsv
|-- bowtie2
|   |-- INPUT-TC1-ST2-D0.12.bam
|   `-- INPUT-TC1-ST2-D0.12.bam.bai
|-- chip-seq -> /work/projects/ulhpc-tutorials/bio/snakemake/chip-seq
|-- envs -> /work/projects/ulhpc-tutorials/bio/snakemake/envs
|-- logs
|   `-- bowtie2_INPUT-TC1-ST2-D0.12.log
|-- reference -> /work/projects/ulhpc-tutorials/bio/snakemake/reference/
|-- Snakefile
```

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
274
<a name="peaks"></a>
Sarah Peter's avatar
Sarah Peter committed
275 276 277

### Peak calling

Aurélien Ginolhac's avatar
Aurélien Ginolhac committed
278
The next step in the workflow is to call peaks with [`MACS2`](https://github.com/taoliu/MACS). This tells us where there is enrichment of the ChIP versus the input (control). 
Sarah Peter's avatar
Sarah Peter committed
279 280 281

You should always choose the peak caller based on how you expect your enriched regions to look like, e.g. narrow or broad peaks.

Aurélien Ginolhac's avatar
Aurélien Ginolhac committed
282
Besides the list of peaks in [BED](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) format, `MACS2` also produces coverage tracks.
283

284 285
Add the following rule to your `Snakefile`:

Sarah Peter's avatar
Sarah Peter committed
286 287
```python
rule peak_calling:
Sarah Peter's avatar
Sarah Peter committed
288
  input:
289 290
    control = "bowtie2/INPUT-{sample}.bam",
    chip = "bowtie2/H3K4-{sample}.bam"
Sarah Peter's avatar
Sarah Peter committed
291
  output:
292
    peaks = "output/{sample}_peaks.narrowPeak",
Sarah Peter's avatar
Sarah Peter committed
293 294
    control_bdg = "macs2/{sample}_control_lambda.bdg",
    chip_bdg = "macs2/{sample}_treat_pileup.bdg"
Sarah Peter's avatar
Sarah Peter committed
295
  conda: "envs/macs2.yaml"
296
  shell:
Sarah Peter's avatar
Sarah Peter committed
297
    """
298 299 300
    macs2 callpeak -t {input.chip} -c {input.control} -f BAM -g mm -n {wildcards.sample} -B -q 0.01 --outdir macs2

    cp macs2/{wildcards.sample}_peaks.narrowPeak {output.peaks}
Sarah Peter's avatar
Sarah Peter committed
301
    """
Sarah Peter's avatar
Sarah Peter committed
302
```
Sarah Peter's avatar
Sarah Peter committed
303

304 305 306 307 308 309 310 311 312 313
The conda environment `envs/macs2.yaml` for this step is:

```yaml
name: macs2
channels:
  - bioconda
dependencies:
  - macs2
```

Sarah Peter's avatar
Sarah Peter committed
314 315 316
Let's run this step with:

```bash
317
(node)$> snakemake -pr --use-conda output/TC1-ST2-D0.12_peaks.narrowPeak
Sarah Peter's avatar
Sarah Peter committed
318 319
```

320
Note that snakemake will not run the mapping step for `bowtie2/INPUT-TC1-ST2-D0.12.bam` again. It only runs rules for which the output is not present or the input has changed.
Sarah Peter's avatar
Sarah Peter committed
321

322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352
After this step your working directory should contain the following files:

```
.
|-- benchmarks
|   `-- mapping
|       |-- H3K4-TC1-ST2-D0.12.tsv
|       `-- INPUT-TC1-ST2-D0.12.tsv
|-- bowtie2
|   |-- H3K4-TC1-ST2-D0.12.bam
|   |-- H3K4-TC1-ST2-D0.12.bam.bai
|   |-- INPUT-TC1-ST2-D0.12.bam
|   `-- INPUT-TC1-ST2-D0.12.bam.bai
|-- chip-seq -> /work/projects/ulhpc-tutorials/bio/snakemake/chip-seq
|-- envs -> /work/projects/ulhpc-tutorials/bio/snakemake/envs
|-- logs
|   |-- bowtie2_H3K4-TC1-ST2-D0.12.log
|   `-- bowtie2_INPUT-TC1-ST2-D0.12.log
|-- macs2
|   |-- TC1-ST2-D0.12_control_lambda.bdg
|   |-- TC1-ST2-D0.12_model.r
|   |-- TC1-ST2-D0.12_peaks.narrowPeak
|   |-- TC1-ST2-D0.12_peaks.xls
|   |-- TC1-ST2-D0.12_summits.bed
|   `-- TC1-ST2-D0.12_treat_pileup.bdg
|-- output
|   `-- TC1-ST2-D0.12_peaks.narrowPeak
|-- reference -> /work/projects/ulhpc-tutorials/bio/snakemake/reference/
|-- Snakefile
```

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
353
<a name="bigwig"></a>
354 355 356

### Generate bigWig files for visualisation

Aurélien Ginolhac's avatar
Aurélien Ginolhac committed
357
For easier visualisation and faster transfer, we convert the two coverage tracks from the `MACS2` output to [bigWig](https://genome.ucsc.edu/goldenpath/help/bigWig.html) format.
Sarah Peter's avatar
Sarah Peter committed
358

359 360
Add the following rule to your `Snakefile`:

361 362
```python
rule bigwig:
Sarah Peter's avatar
Sarah Peter committed
363 364
  input: "macs2/{sample}.bdg"
  output: "output/{sample}.bigwig"
365
  params:
366
    idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai"
Sarah Peter's avatar
Sarah Peter committed
367
  conda: "envs/ucsc.yaml"
368
  shell:
Sarah Peter's avatar
Sarah Peter committed
369 370 371
    """
    bedGraphToBigWig {input} {params.idx} {output}
    """
372 373
```

374 375 376 377 378 379 380 381 382 383
The conda environment `envs/ucsc.yaml` for this step is:

```yaml
name: ucsc
channels:
  - bioconda
dependencies:
  - ucsc-bedgraphtobigwig
```

Sarah Peter's avatar
Sarah Peter committed
384 385 386
Let's test this step with:

```bash
387
(node)$> snakemake -pr --use-conda output/TC1-ST2-D0.12_treat_pileup.bigwig
Sarah Peter's avatar
Sarah Peter committed
388 389 390 391
```

This time snakemake will only run the "bigwig" rule for the one file we specified.

392 393 394 395 396 397 398
After this step the `output` directory should contain the following files:

```
TC1-ST2-D0.12_peaks.narrowPeak
TC1-ST2-D0.12_treat_pileup.bigwig
```

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
399
<a name="summary"></a>
400

Sarah Peter's avatar
Sarah Peter committed
401
### Summary rule
402

403
To avoid always having to specify which output file we want on the command-line, we add one rule with just inputs that defines the result files we want to have in the end. Since by default snakemake executes the first rule in the snakefile, we add this rule as the first one to the top and then we don't need to specify anything additional on the command-line.
Sarah Peter's avatar
Sarah Peter committed
404 405

First, at the **very top** of the Snakefile, define a variable for the name of the sample:
406 407

```python
408
SAMPLE = "TC1-ST2-D0.12"
409 410
```

411
This makes it easier to change the Snakefile and apply it to other datasets. Snakemake is based on Python so we can use Python code inside the Snakefile. We will use [f-Strings](https://docs.python.org/3.6/reference/lexical_analysis.html#f-strings) to include the variable in the file names.
412

Sarah Peter's avatar
Sarah Peter committed
413
Add this rule at the top of the `Snakefile` after the line above:
Sarah Peter's avatar
Sarah Peter committed
414

415 416
```python
rule all:
417
  input: f"output/{SAMPLE}_peaks.narrowPeak", f"output/{SAMPLE}_control_lambda.bigwig", f"output/{SAMPLE}_treat_pileup.bigwig"
418
```
Sarah Peter's avatar
Sarah Peter committed
419

Sarah Peter's avatar
Sarah Peter committed
420
Finally run the workflow again, this time without a specific target file:
Sarah Peter's avatar
Sarah Peter committed
421

Sarah Peter's avatar
Sarah Peter committed
422
```bash
Sarah Peter's avatar
Sarah Peter committed
423
(node)$> snakemake --use-conda -pr
Sarah Peter's avatar
Sarah Peter committed
424 425
```

426 427 428 429 430 431 432 433
After this step the `output` directory should contain the following files:

```
TC1-ST2-D0.12_control_lambda.bigwig
TC1-ST2-D0.12_peaks.narrowPeak
TC1-ST2-D0.12_treat_pileup.bigwig
```

Sarah Peter's avatar
Sarah Peter committed
434 435 436
Snakemake can visualise the dependency graph of the workflow with the following command:

```bash
Sarah Peter's avatar
Sarah Peter committed
437
(node)$> snakemake --dag | dot -Tpdf > dag.pdf
Sarah Peter's avatar
Sarah Peter committed
438 439
```

440
![DAG](img/dag.png)
Sarah Peter's avatar
Sarah Peter committed
441

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
442 443
<a name="cluster"></a>

Sarah Peter's avatar
Sarah Peter committed
444 445
## Cluster configuration for snakemake

446
Until now the workflow just runs on a single CPU on a single machine, which is not very efficient when we have much more resources available. To speed up the computation you should check in the documentation of the software you use how it can scale. For bioinformatics tools the most common option is multithreading.
Sarah Peter's avatar
Sarah Peter committed
447 448 449

In this workflow only bowtie2 has the option to run on multiple threads.

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
450 451
<a name="multithreading"></a>

Sarah Peter's avatar
Sarah Peter committed
452
### Adjust mapping step to run on multiple threads
453

454 455
We add the `thread` directive to the snakemake rule for the mapping step, to tell snakemake that this step can use multiple threads. 

456
> The specified threads have to be seen as a maximum. When Snakemake is executed with fewer cores, the number of threads will be adjusted, i.e. `threads = min(threads, cores)` with `cores` being the number of cores specified at the command line (option `-j`).
Sarah Peter's avatar
Sarah Peter committed
457
>
458
> &mdash; <cite>[Snakemake manual - Threads](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#threads)</cite>
459 460 461

So the value for threads should be the maximum that is reasonable for the respective software. For many software the speed-up plateaus at a certain number of threads or even starts to decrease again. For a regular bowtie2 run 16 is a good maximum, but for this tutorial we will only go up to 4 because we have a small dataset.

462
In the **mapping rule** in your `Snakefile` add the following line after the `conda` directive:
Sarah Peter's avatar
Sarah Peter committed
463

464 465 466 467 468 469 470 471 472 473 474
```python
  threads: 4
```

We also need to add the option `-p {threads}` to the bowtie2 command-line call, to make it actually use those threads:

```python
    bowtie2 -p {threads} -x {params.idx} -U {input} 2> {log} | \
```

such that the complete mapping rule now is the following:
475

Sarah Peter's avatar
Sarah Peter committed
476
```python {highlight=[8,11]}
Sarah Peter's avatar
Sarah Peter committed
477 478
rule mapping:
  input: "chip-seq/{sample}.fastq.gz"
479
  output: "bowtie2/{sample}.bam"
Sarah Peter's avatar
Sarah Peter committed
480
  params:
481
    idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12"
Sarah Peter's avatar
Sarah Peter committed
482
  log: "logs/bowtie2_{sample}.log"
Sarah Peter's avatar
Sarah Peter committed
483
  benchmark: "benchmarks/mapping/{sample}.tsv"
Sarah Peter's avatar
Sarah Peter committed
484 485 486 487
  conda: "envs/bowtie2.yaml"
  threads: 4
  shell:
    """
488 489
    bowtie2 -p {threads} -x {params.idx} -U {input} 2> {log} | \
    samtools sort - > {output}
Sarah Peter's avatar
Sarah Peter committed
490 491 492
    samtools index {output}
    """
```
493

494 495 496 497
If we want to rerun the workflow to compare different options, we need to delete the output files, otherwise snakemake will not run the steps again. Since we want to do a bit of testing, let's define a rule that removes all the output:

```python
rule clean:
498 499 500 501
  shell:  
    """
    rm -rf bowtie2/ macs2/ output/
    """
502 503 504 505 506 507 508 509 510 511
```

**Warning:** Be careful with `rm -rf` and double-check you're deleting the right directories, since it will remove everything without asking.

Run the clean-up rule:

```bash
(node)$> snakemake clean
```

512
Quit your current job and start a new one with more cores to test the multithreading:
513 514 515 516 517 518 519

```bash
(node)$> exit
(access)$> srun --cpu-bind=none -p interactive -t 0-0:15:0 -N 1 -c 6 --ntasks-per-node=1 --pty bash -i
(node)$> conda activate bioinfo_tutorial
(node)$> cd $SCRATCH/bioinfo_tutorial
```
520

521
Now we also need to tell snakemake that it has multiple cores available and can run steps multithreaded or run multiple tasks in parallel. This is done with the  `-j` option followed by the number of available cores (e.g. the number of cores you have reserved if you run it interactively).
522

Sarah Peter's avatar
Sarah Peter committed
523
```bash
524
(node)$> snakemake -j 4 -pr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
Sarah Peter's avatar
Sarah Peter committed
525 526
```

Sarah Peter's avatar
Sarah Peter committed
527 528
You should see in the output that the command-line call of bowtie2 now shows `-p 4`.

529 530 531 532 533 534 535 536 537 538
Check again the benchmark report:

```bash
(node)$> cat benchmarks/mapping/INPUT-TC1-ST2-D0.12.tsv
s      h:m:s   max_rss max_vms max_uss max_pss io_in io_out mean_load
6.7687 0:00:06 295.01  1728.68 291.64  291.79  0.00  16.00  0.00
```

Notice that the runtime has decreased, but I/O has increased.

539
**Exercise:** Try several options for `-j` up to the number of cores you reserved (6) and check the bowtie2 command and the values in the benchmark. Don't forget the clean-up between the tries.
540

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
541
<a name="job_params"></a>
Sarah Peter's avatar
Sarah Peter committed
542

543
### Configure job parameters with `cluster.yaml`
Sarah Peter's avatar
Sarah Peter committed
544

545
Instead of reserving an interactive job and running snakemake inside that job, we want to use snakemake's cluster functionality to make it submit jobs to Slurm. For this we create a configuration file named `cluster.yaml` to define the values for the different `sbatch` options.
Sarah Peter's avatar
Sarah Peter committed
546 547 548

Options under the `__default__` header apply to all rules, but it's possible to override them selectively with rule-specific options.

Sarah Peter's avatar
Sarah Peter committed
549 550
Create the file `cluster.yaml` in the same directory as the `Snakefile` with the following content:

551 552 553 554 555 556 557 558 559 560 561 562
```yaml
__default__:
  time: "0-00:01:00"
  partition: "batch"
  nodes: 1
  ntasks: 1
  ncpus: 1
  job-name: "{rule}"
  output: "slurm-%j-%x.out"
  error: "slurm-%j-%x.err"
mapping:
  ncpus: 4
Sarah Peter's avatar
Sarah Peter committed
563
```
564

565
**Attention:** Be aware that `ncpus` should match the `threads` directive in the respective rule. If `ncpus` is less than `threads` snakemake will reserve only  `ncpus` cores, but run the rule on the number of threads specified with `threads` .
566

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
567
<a name="cluster_config"></a>
568

Sarah Peter's avatar
Sarah Peter committed
569 570
### Run snakemake with cluster configuration

571
Make sure you quit your job and run the following from the access node.
572

573
Now we need to map the variables defined in `cluster.yaml` to the command-line parameters of `sbatch`. Check the documentation on the [HPC website](https://hpc.uni.lu/users/docs/slurm.html#basic-usage-commands) for details about the parameters.
Sarah Peter's avatar
Sarah Peter committed
574

575
The meaning of the option `-j` changes when running in cluster mode to denote the maximum number of simultaneous jobs.
Sarah Peter's avatar
Sarah Peter committed
576

Sarah Peter's avatar
Sarah Peter committed
577
```bash
Sarah Peter's avatar
Sarah Peter committed
578
(node)$> exit
Sarah Peter's avatar
Sarah Peter committed
579 580
(access)$> cd $SCRATCH/bioinfo_tutorial
(access)$> conda activate bioinfo_tutorial
581
(access)$> snakemake clean
Sarah Peter's avatar
Sarah Peter committed
582
 
583
(access)$> SLURM_ARGS="-p {cluster.partition} -N {cluster.nodes} -n {cluster.ntasks} -c {cluster.ncpus} -t {cluster.time} -J {cluster.job-name} -o {cluster.output} -e {cluster.error}"
Sarah Peter's avatar
Sarah Peter committed
584
 
585
(access)$> snakemake -j 10 -pr --use-conda --cluster-config cluster.yaml --cluster "sbatch $SLURM_ARGS"
Sarah Peter's avatar
Sarah Peter committed
586
```
Sarah Peter's avatar
Sarah Peter committed
587

588
Let's have a look at the jobs that were submitted:
589 590

```bash
591 592 593 594 595
# only job allocations
(access)$> sacct -X --name="mapping","peak_calling","bigwig" --format JobID%15,JobName%15,AllocCPUS,Submit,Start,End,Elapsed

# including all steps
(access)$> sacct --name="mapping","peak_calling","bigwig" --format JobID%15,JobName%15,NTasks,AllocCPUS,Submit,Start,End,Elapsed,MaxVMSize
596 597
```

598
Check the submit and end time to see which jobs were running at the same time and when snakemake waited for jobs to finish.
599

600 601
After this step you should see a bunch of `slurm-XXXXXX-<rule name>.out` and `slurm-XXXXXX-<rule name>.err` files in your working directory, which contain the (error) logs of the different snakemake rules.

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
602
<a name="igv"></a>
Sarah Peter's avatar
Sarah Peter committed
603 604 605

## Inspect results in IGV

Sarah Peter's avatar
Sarah Peter committed
606
Now that we have completed the workflow, let's have a look at the results.
607

608
For visualisation, [download IGV](http://software.broadinstitute.org/software/igv/download), or use any other genome browser of your choice.
Sarah Peter's avatar
Sarah Peter committed
609

610
To copy the results from the cluster to your laptop, run the following in a local terminal (Linux and MacOS) or a MobaXterm local session (Windows) and replace `<your_username>` with your ULHPC user login. For alternative ways to transfer files, see the [documentation on the HPC website](https://hpc.uni.lu/users/docs/filetransfer.html). Pay attention in which directory you are, so you can find the files again.
Sarah Peter's avatar
Sarah Peter committed
611 612 613 614

```bash
(laptop)$> mkdir bioinfo_tutorial
(laptop)$> cd bioinfo_tutorial
615 616 617 618 619 620

# check where you are
(laptop)$> pwd

# transfer the output directory
(laptop)$> rsync -avz --rsh='ssh -p 8022' <your_username>@access-iris.uni.lu:/scratch/users/<your_username>/bioinfo_tutorial/output .
621 622
```

Sarah Peter's avatar
Sarah Peter committed
623
Start IGV and select mouse mm10 as genome in the drop-down menu in the upper left. Go to "File" -> "Load from File…" and select all three files that you have copied from the cluster.
624

625 626 627 628
In the search box enter for example "Ahr" to check the signal around one of the genes highlighted in the paper.

Pay attention to the scale on which the two coverage tracks are shown. You can right-click on the track name on the left and select "Autoscale" to adjust the range.

629 630 631
When you hover over the blocks in the `TC1-ST2-D0.12_peaks.narrowPeak` track, you can see additional information about the called peaks, e.g. p-value. The peaks should be at the transcription start sites, because that is what H3K4me3 marks. Pay attention to the arrows in the "Refseq genes" track to see in which direction transcription goes.

![IGV](img/IGV_annotated.jpg)
632

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
633
<a name="immediate_submit"></a>
Sarah Peter's avatar
Sarah Peter committed
634

Sarah Peter's avatar
Sarah Peter committed
635 636
## (Optional) Immediately submit all jobs

637
Snakemake has an option to immediately submit all jobs to the cluster and tell the scheduler about the dependencies so they run in the right order. It submits the jobs one-by-one, collecting the job ID of each from the Slurm output, and then forwards those job IDs as dependencies to the follow-up jobs.
638

639
Unfortunately snakemake doesn't parse the job submission message from Slurm cleanly, so the dependency lists look like   ` 'Submitted', 'batch', 'job', '374519', 'Submitted', 'batch', 'job', '374520'` instead of being just a list of the job IDs. Therefore, we need a wrapper script to get the dependencies right.
Sarah Peter's avatar
Sarah Peter committed
640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662

Create a python script called `immediate_submit.py` with the following content:

```python
#!/usr/bin/env python3
import os
import sys

from snakemake.utils import read_job_properties

# last command-line argument is the job script
jobscript = sys.argv[-1]

# all other command-line arguments are the dependencies
dependencies = set(sys.argv[1:-1])

# parse the job script for the job properties that are encoded by snakemake within
job_properties = read_job_properties(jobscript)

# collect all command-line options in an array
cmdline = ["sbatch"]

# set all the slurm submit options as before
663
slurm_args = " -p {partition} -N {nodes} -n {ntasks} -c {ncpus} -t {time} -J {job-name} -o {output} -e {error} ".format(**job_properties["cluster"])
Sarah Peter's avatar
Sarah Peter committed
664 665 666 667 668 669 670 671 672 673 674 675 676 677

cmdline.append(slurm_args)

if dependencies:
    cmdline.append("--dependency")
    # only keep numbers in dependencies list
    dependencies = [ x for x in dependencies if x.isdigit() ]
    cmdline.append("afterok:" + ",".join(dependencies))

cmdline.append(jobscript)

os.system(" ".join(cmdline))
```

678
Besides the dependencies this script now also takes care of all the other Slurm options, so you don't need to define `SLURM_ARGS` anymore in the shell.
679

680 681 682 683 684 685 686
Make the script executable:

```bash
(access)$> chmod +x immediate_submit.py
```

Run snakemake with the following command and replace `<your_username>` with your ULHPC user login:
Sarah Peter's avatar
Sarah Peter committed
687 688

```bash
689
(access)$> snakemake clean
690
(access)$> snakemake --cluster-config cluster.yaml -j 50 -pr --use-conda --immediate-submit --notemp --cluster "/scratch/users/<your_username>/bioinfo_tutorial/immediate_submit.py {dependencies}"
Sarah Peter's avatar
Sarah Peter committed
691 692
```

693 694
With `squeue -u <your_username>` you can check the status of the submitted jobs and see when they all have finished.

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
695
<a name="useful"></a>
Sarah Peter's avatar
Sarah Peter committed
696

Sarah Peter's avatar
Sarah Peter committed
697 698
## Useful stuff

699
* To avoid too much overhead in the number of jobs submitted to Slurm, use the`group` directive to group rules that can run together in a single job.
700
* If your workflow runs for longer than just a few minutes, run snakemake inside`screen` or prefix it with `nohup`. This prevents the workflow from stopping when your SSH session get's disconnected.
Sarah Peter's avatar
Sarah Peter committed
701

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
702
<a name="references"></a>
Sarah Peter's avatar
Sarah Peter committed
703

704 705 706 707 708 709 710
## References

* [Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.](http://bioinformatics.oxfordjournals.org/content/28/19/2520)
* [Gérard D, Schmidt F, Ginolhac A, Schmitz M, Halder R, Ebert P, Schulz MH, Sauter T, Sinkkonen L. Temporal enhancer profiling of parallel lineages identifies AHR and GLIS1 as regulators of mesenchymal multipotency. *Nucleic Acids Research*, Volume 47, Issue 3, 20 February 2019, Pages 1141–1163, https://doi.org/10.1093/nar/gky1240](https://www.ncbi.nlm.nih.gov/pubmed/30544251)
* [Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. *Nature Methods*. 2012, 9:357-359.](http://www.nature.com/nmeth/journal/v9/n4/full/nmeth.1923.html)
* [Zhang et al. Model-based Analysis of ChIP-Seq (MACS). *Genome Biol* (2008) vol. 9 (9) pp. R137](http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=search&term=18798982%5Bpmid%5D)

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
711
<a name="acknowledgements"></a>
712

Sarah Peter's avatar
Sarah Peter committed
713 714
## Acknowledgements

715
Many thanks to Aurélien Ginolhac, Cedric Laczny, Nikola de Lange and Roland Krause for their help in developing this tutorial.