README.md 25.2 KB
Newer Older
1
# Bioinformatics workflows with snakemake and conda
Sarah Peter's avatar
Sarah Peter committed
2

Sarah Peter's avatar
Sarah Peter committed
3 4
Author: Sarah Peter

5
In this tutorial you will learn how to run a [ChIP-seq](https://en.wikipedia.org/wiki/ChIP-sequencing) analysis with the [snakemake workflow engine](https://snakemake.readthedocs.io/en/stable/)  on the cluster.
6 7

**Disclaimer:** In order to keep this tutorial simple, we use default parameters for the different tools as much as possible. However, for a real analysis you should always adapt the parameters to your dataset. Also, be aware that the results of some steps might be screwed, because we only work on data from one chromosome.
Sarah Peter's avatar
Sarah Peter committed
8

Sarah Peter's avatar
Sarah Peter committed
9 10
## Setup the environment

11 12 13 14 15 16 17 18 19
For this tutorial we will use the [`conda` package manager](https://www.anaconda.com/) to install the required tools. 

> Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.
>
> Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment.
> 
> &mdash; <cite>[Conda  manual](https://docs.conda.io/en/latest/index.html)</cite>

It can encapsulate software and packages in environments, so you can have multiple different versions of a software installed at the same time and avoid incompatibilities between different tools. It also has functionality to easily port and replicate environments, which is important to ensure reproducibility of analyses.
Sarah Peter's avatar
Sarah Peter committed
20

21 22
**Attention when dealing with sensitive data:** Everyone can very easily contribute installation recipies to the bioconda channel, without verified identity or double-checking from the community. Therefore it's possible to insert malicious software. If you use bioconda when processing sensitive data, you should check the recipes to verify that they install software from the official sources.

Sarah Peter's avatar
Sarah Peter committed
23 24
We will use conda on two levels in this tutorial. First we use a conda environment to install and run snakemake. Second, inside the snakemake workflow we will define separate conda environments for each step.

25 26 27
1. Connect to the cluster

2. Start an interactive job:
Sarah Peter's avatar
Sarah Peter committed
28 29 30 31

   ```bash
   (access)$> si
   ```
32 33
   
3. Install conda:
Sarah Peter's avatar
Sarah Peter committed
34

Sarah Peter's avatar
Sarah Peter committed
35
   ```bash
36 37
   (node)$> mkdir -p $SCRATCH/downloads
   (node)$> cd $SCRATCH/downloads
Sarah Peter's avatar
Sarah Peter committed
38 39 40 41
   (node)$> wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
   (node)$> chmod u+x Miniconda3-latest-Linux-x86_64.sh
   (node)$> ./Miniconda3-latest-Linux-x86_64.sh 
   ```
42
   You need to specify your installation destination, e.g. `/home/users/<your_username>/tools/miniconda3`. You must use the **full** path and can**not** user `$HOME/tools/miniconda3`. Answer `yes` to initialize Miniconda3. 
Sarah Peter's avatar
Sarah Peter committed
43 44 45 46 47 48 49 50 51 52 53

   The installation will modify your `.bashrc` to make conda directly available after each login. To activate the changes now, run

   ```bash
   (node)$> source ~/.bashrc
   ```

   Update conda to the latest version:
   ```bash
   (node)$> conda update conda 
   ```
Sarah Peter's avatar
Sarah Peter committed
54

55
4. Create a new conda environment and activate it:
Sarah Peter's avatar
Sarah Peter committed
56 57 58 59 60

   ```bash
   (node)$> conda create -n bioinfo_tutorial
   (node)$> conda activate bioinfo_tutorial
   ```
61
   You can see that your prompt will now be prefixed with `(bioinfo_tutorial)` to show which environment is active. For the rest of the tutorial make sure that you always have this environment active.
62 63

5. Install snakemake:
Sarah Peter's avatar
Sarah Peter committed
64 65

   ```bash
Sarah Peter's avatar
Sarah Peter committed
66
   (node)$> conda install -c bioconda -c conda-forge snakemake-minimal
Sarah Peter's avatar
Sarah Peter committed
67
   ```
68

Sarah Peter's avatar
Sarah Peter committed
69 70 71 72
   

## Create snakemake workflow

73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
> The Snakemake workflow management system is a tool to create **reproducible and scalable** data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.
> 
> &mdash; <cite>[Snakemake manual](https://snakemake.readthedocs.io/en/stable/index.html)</cite>

Snakemake is a very useful tool if you need to combine multiple steps using different software into a coherent workflow. It comes with many features desired for running workflows, like 

* ensuring all input and result files are present
* restarting at a failed step
* rerunning (parts of a) pipeline when (some of) the input changed
* support for wildcards to apply a step to a set of files
* automatically parallelising where possible
* software management
* collecting benchmark data
* modularisation
* creating launcher scripts and submitting jobs to the cluster
* creating a visualisation of the workflow steps (see below)

Sarah Peter's avatar
Sarah Peter committed
90
In this tutorial we will analyse [ChIP-seq data](https://www.ebi.ac.uk/ena/data/view/PRJEB20933) from the paper [Gérard D, Schmidt F, Ginolhac A, Schmitz M, Halder R, Ebert P, Schulz MH, Sauter T, Sinkkonen L. Temporal enhancer profiling of parallel lineages identifies AHR and GLIS1 as regulators of mesenchymal multipotency. *Nucleic Acids Research*, Volume 47, Issue 3, 20 February 2019, Pages 1141–1163, https://doi.org/10.1093/nar/gky1240](https://www.ncbi.nlm.nih.gov/pubmed/30544251) published by our colleagues at LSRU.
Sarah Peter's avatar
Sarah Peter committed
91

92 93 94 95
We will set up the following workflow:

![DAG](img/dag.png)

96
To speed up computing time we use source files that only contain sequencing reads that map to chromosome 12. The files for input (control) and H3K4me3 (ChIP) are available on the cluster in the directory `/work/projects/ulhpc-tutorials/bio/snakemake/chip-seq` and the corresponding reference in `/work/projects/ulhpc-tutorials/bio/snakemake/reference`.
Sarah Peter's avatar
Sarah Peter committed
97

98 99
We have also already prepared the conda environments for each step in the workflow in `/work/projects/ulhpc-tutorials/bio/snakemake/envs`.

Sarah Peter's avatar
Sarah Peter committed
100
Create a working directory and link the necessary data:
Sarah Peter's avatar
Sarah Peter committed
101 102

```bash
103 104 105 106 107 108
(node)$> cd $SCRATCH
(node)$> mkdir bioinfo_tutorial
(node)$> cd bioinfo_tutorial
(node)$> ln -s /work/projects/ulhpc-tutorials/bio/snakemake/chip-seq .
(node)$> ln -s /work/projects/ulhpc-tutorials/bio/snakemake/reference .
(node)$> ln -s /work/projects/ulhpc-tutorials/bio/snakemake/envs .
Sarah Peter's avatar
Sarah Peter committed
109 110 111 112
```

### Mapping

Sarah Peter's avatar
Sarah Peter committed
113 114
> In Snakemake, workflows are specified as Snakefiles. Inspired by GNU Make, a Snakefile contains rules that denote how to create output files from input files. Dependencies between rules are handled implicitly, by matching filenames of input files against output files. Thereby wildcards can be used to write general rules.
> 
115
> &mdash; <cite>[Snakemake manual - Writing Workflows](https://snakemake.readthedocs.io/en/stable/snakefiles/writing_snakefiles.html)</cite>
116

117 118 119 120 121 122 123 124
> Most importantly, a rule can consist of a name (the name is optional and can be left out, creating an anonymous rule), input files, output files, and a shell command to generate the output from the input, i.e. 
> ```python
> rule NAME:
>     input: "path/to/inputfile", "path/to/other/inputfile"
>     output: "path/to/outputfile", "path/to/another/outputfile"
>     shell: "somecommand {input} {output}"
> ```
> &mdash; <cite>[Snakemake manual - Rules](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html)</cite>
Sarah Peter's avatar
Sarah Peter committed
125 126


Sarah Peter's avatar
Sarah Peter committed
127

128
A basic rule for mapping a fastq file with [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) could look like this:
129 130 131

```python
rule mapping:
Sarah Peter's avatar
Sarah Peter committed
132
  input: "chip-seq/H3K4-TC1-ST2-D0.12.fastq.gz"
133
  output: "bowtie2/H3K4-TC1-ST2-D0.12.bam"
134 135
  shell:
    """
136
    bowtie2 -x reference/Mus_musculus.GRCm38.dna_sm.chromosome.12 -U {input} | \  
137 138 139 140
    samtools sort - > {output}
    samtools index {output}
    """
```
Sarah Peter's avatar
Sarah Peter committed
141

142
Since we have two fastq files to map, we should generalise the rule with wildcards:
Sarah Peter's avatar
Sarah Peter committed
143 144 145

```python
rule mapping:
146
  input: "chip-seq/{sample}.fastq.gz"
147
  output: "bowtie2/{sample}.bam"
148 149
  shell:
    """
150
    bowtie2 -x reference/Mus_musculus.GRCm38.dna_sm.chromosome.12 -U {input} | \  
151 152 153 154 155
    samtools sort - > {output}
    samtools index {output}
    """
```

156
Now we need to tell snakemake to use a conda environment with bowtie2 and [samtools](http://www.htslib.org/) inside to run this rule. For this purpose there is a specific `conda` directive that can be added to the rule. It accepts a [yaml](https://yaml.org/spec/1.2/spec.html) file that defines the conda environment. 
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172

```python
conda: "envs/bowtie2.yaml"
```

You can easily export existing conda environments to a yaml file with `conda env export` or write the yaml from scratch. For this step the yaml file looks like this:

```yaml
name: bowtie2
channels:
  - bioconda
dependencies:
  - bowtie2
  - samtools
```

Sarah Peter's avatar
Sarah Peter committed
173 174 175
We also use the `params` directive to define the path to the reference and declutter the command-line call, as well as the `log` directive to define a path to permanently store the output of the execution. This is especially useful in this step to store the bowtie2 mapping statistics, which are just written to the command-line (stderr) otherwise.

To track resource usage we add the `benchmark` directive, which will write performance measures to a tsv file.
176 177 178 179 180 181 182 183 184 185 186 187 188

Create a file called `Snakefile` in the current directory and open it in your favourite editor, e.g.

```bash
(node)$> nano Snakefile
```

Add the final rule for the mapping:

```python
rule mapping:
  input: "chip-seq/{sample}.fastq.gz"
  output: "bowtie2/{sample}.bam"
Sarah Peter's avatar
Sarah Peter committed
189
  params:
190
    idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12"
Sarah Peter's avatar
Sarah Peter committed
191
  log: "logs/bowtie2_{sample}.log"
Sarah Peter's avatar
Sarah Peter committed
192
  benchmark: "benchmarks/mapping/{sample}.tsv"
Sarah Peter's avatar
Sarah Peter committed
193
  conda: "envs/bowtie2.yaml"
Sarah Peter's avatar
Sarah Peter committed
194 195
  shell:
    """
196 197
    bowtie2 -x {params.idx} -U {input} 2> {log} | \
    samtools sort - > {output}
Sarah Peter's avatar
Sarah Peter committed
198 199 200 201
    samtools index {output}
    """
```

Sarah Peter's avatar
Sarah Peter committed
202 203 204
You can test the rule by specifying one of the potential outputs. We first do a dry-run with with option `-n`.

```bash
205
(node)$> snakemake -npr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
Sarah Peter's avatar
Sarah Peter committed
206 207
```

208
If everything is fine we can run the rule to create the file `bowtie2/INPUT-TC1-ST2-D0.12.bam`:
Sarah Peter's avatar
Sarah Peter committed
209 210

```bash
211 212 213
(node)$> snakemake -pr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
```

214 215
The first run will take a bit longer, because snakemake creates the conda environment. In subsequent runs it will just activate the existing environment. However, it will recognise if the yaml files changes and then recreate the environment.

216
Check the mapping statistics and the benchmark report:
217

Sarah Peter's avatar
Sarah Peter committed
218
```bash
219 220 221 222 223 224 225 226
(node)$> cat logs/bowtie2_INPUT-TC1-ST2-D0.12.log
400193 reads; of these:
  400193 (100.00%) were unpaired; of these:
    1669 (0.42%) aligned 0 times
    379290 (94.78%) aligned exactly 1 time
    19234 (4.81%) aligned >1 times
99.58% overall alignment rate

227
(node)$> cat benchmarks/mapping/INPUT-TC1-ST2-D0.12.tsv
Sarah Peter's avatar
Sarah Peter committed
228 229
s       h:m:s   max_rss max_vms max_uss max_pss io_in io_out mean_load
19.1737 0:00:19 262.14  1404.55 258.79  258.94  0.00  0.00   0.00
Sarah Peter's avatar
Sarah Peter committed
230 231 232 233 234 235
```



### Peak calling

236
The next step in the workflow is to call peaks with [MACS2](https://github.com/taoliu/MACS). This tells us where there is enrichment of the ChIP versus the input (control). 
Sarah Peter's avatar
Sarah Peter committed
237 238 239

You should always choose the peak caller based on how you expect your enriched regions to look like, e.g. narrow or broad peaks.

240
Besides the list of peaks in [BED](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) format, MACS2 also produces coverage tracks.
241

242 243
Add the following rule to your `Snakefile`:

Sarah Peter's avatar
Sarah Peter committed
244 245
```python
rule peak_calling:
Sarah Peter's avatar
Sarah Peter committed
246
  input:
247 248
    control = "bowtie2/INPUT-{sample}.bam",
    chip = "bowtie2/H3K4-{sample}.bam"
Sarah Peter's avatar
Sarah Peter committed
249
  output:
250
    peaks = "output/{sample}_peaks.narrowPeak",
Sarah Peter's avatar
Sarah Peter committed
251 252
    control_bdg = "macs2/{sample}_control_lambda.bdg",
    chip_bdg = "macs2/{sample}_treat_pileup.bdg"
Sarah Peter's avatar
Sarah Peter committed
253
  conda: "envs/macs2.yaml"
254
  shell:
Sarah Peter's avatar
Sarah Peter committed
255
    """
256 257 258
    macs2 callpeak -t {input.chip} -c {input.control} -f BAM -g mm -n {wildcards.sample} -B -q 0.01 --outdir macs2

    cp macs2/{wildcards.sample}_peaks.narrowPeak {output.peaks}
Sarah Peter's avatar
Sarah Peter committed
259
    """
Sarah Peter's avatar
Sarah Peter committed
260
```
Sarah Peter's avatar
Sarah Peter committed
261

262 263 264 265 266 267 268 269 270 271
The conda environment `envs/macs2.yaml` for this step is:

```yaml
name: macs2
channels:
  - bioconda
dependencies:
  - macs2
```

Sarah Peter's avatar
Sarah Peter committed
272 273 274
Let's run this step with:

```bash
275
(node)$> snakemake -pr --use-conda output/TC1-ST2-D0.12_peaks.narrowPeak
Sarah Peter's avatar
Sarah Peter committed
276 277
```

278
Note that snakemake will not run the mapping step for `bowtie2/INPUT-TC1-ST2-D0.12.bam` again. It only runs rules for which the output is not present or the input has changed.
Sarah Peter's avatar
Sarah Peter committed
279

Sarah Peter's avatar
Sarah Peter committed
280

281 282 283 284

### Generate bigWig files for visualisation

For easier visualisation and faster transfer, we convert the two coverage tracks from the MACS2 output to [bigWig](https://genome.ucsc.edu/goldenpath/help/bigWig.html) format.
Sarah Peter's avatar
Sarah Peter committed
285

286 287
Add the following rule to your `Snakefile`:

288 289
```python
rule bigwig:
Sarah Peter's avatar
Sarah Peter committed
290 291
  input: "macs2/{sample}.bdg"
  output: "output/{sample}.bigwig"
292
  params:
293
    idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai"
Sarah Peter's avatar
Sarah Peter committed
294
  conda: "envs/ucsc.yaml"
295
  shell:
Sarah Peter's avatar
Sarah Peter committed
296 297 298
    """
    bedGraphToBigWig {input} {params.idx} {output}
    """
299 300
```

301 302 303 304 305 306 307 308 309 310
The conda environment `envs/ucsc.yaml` for this step is:

```yaml
name: ucsc
channels:
  - bioconda
dependencies:
  - ucsc-bedgraphtobigwig
```

Sarah Peter's avatar
Sarah Peter committed
311 312 313
Let's test this step with:

```bash
314
(node)$> snakemake -pr --use-conda output/TC1-ST2-D0.12_treat_pileup.bigwig
Sarah Peter's avatar
Sarah Peter committed
315 316 317 318
```

This time snakemake will only run the "bigwig" rule for the one file we specified.

319 320


Sarah Peter's avatar
Sarah Peter committed
321
### Summary rule
322

323
To avoid always having to specify which output file we want on the command-line, we add one rule with just inputs that defines the result files we want to have in the end. Since by default snakemake executes the first rule in the snakefile, we add this rule as the first one to the top and then we don't need to specify anything additional on the command-line.
Sarah Peter's avatar
Sarah Peter committed
324 325

First, at the **very top** of the Snakefile, define a variable for the name of the sample:
326 327

```python
328
SAMPLE = "TC1-ST2-D0.12"
329 330
```

331
This makes it easier to change the Snakefile and apply it to other datasets. Snakemake is based on Python so we can use Python code inside the Snakefile. We will use [f-Strings](https://docs.python.org/3.6/reference/lexical_analysis.html#f-strings) to include the variable in the file names.
332

Sarah Peter's avatar
Sarah Peter committed
333
Add this rule at the top of the `Snakefile` after the line above:
Sarah Peter's avatar
Sarah Peter committed
334

335 336
```python
rule all:
337
  input: f"output/{SAMPLE}_peaks.narrowPeak", f"output/{SAMPLE}_control_lambda.bigwig", f"output/{SAMPLE}_treat_pileup.bigwig"
338
```
Sarah Peter's avatar
Sarah Peter committed
339

Sarah Peter's avatar
Sarah Peter committed
340
Finally run the workflow again, this time without a specific target file:
Sarah Peter's avatar
Sarah Peter committed
341

Sarah Peter's avatar
Sarah Peter committed
342
```bash
Sarah Peter's avatar
Sarah Peter committed
343
(node)$> snakemake --use-conda -pr
Sarah Peter's avatar
Sarah Peter committed
344 345 346 347 348
```

Snakemake can visualise the dependency graph of the workflow with the following command:

```bash
Sarah Peter's avatar
Sarah Peter committed
349
(node)$> snakemake --dag | dot -Tpdf > dag.pdf
Sarah Peter's avatar
Sarah Peter committed
350 351
```

352
![DAG](img/dag.png)
Sarah Peter's avatar
Sarah Peter committed
353 354 355

## Cluster configuration for snakemake

356
Until now the workflow just runs on a single CPU on a single machine, which is not very efficient when we have much more resources available. To speed up the computation you should check in the documentation of the software you use how it can scale. For bioinformatics tools the most common option is multithreading.
Sarah Peter's avatar
Sarah Peter committed
357 358 359

In this workflow only bowtie2 has the option to run on multiple threads.

Sarah Peter's avatar
Sarah Peter committed
360
### Adjust mapping step to run on multiple threads
361

362 363
We add the `thread` directive to the snakemake rule for the mapping step, to tell snakemake that this step can use multiple threads. 

364
> The specified threads have to be seen as a maximum. When Snakemake is executed with fewer cores, the number of threads will be adjusted, i.e. `threads = min(threads, cores)` with `cores` being the number of cores specified at the command line (option `-j`).
Sarah Peter's avatar
Sarah Peter committed
365
>
366
> &mdash; <cite>[Snakemake manual - Threads](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#threads)</cite>
367 368 369 370

So the value for threads should be the maximum that is reasonable for the respective software. For many software the speed-up plateaus at a certain number of threads or even starts to decrease again. For a regular bowtie2 run 16 is a good maximum, but for this tutorial we will only go up to 4 because we have a small dataset.

We also need to add the option `-p` to the bowtie2 command-line call, to make it actually use those threads.
Sarah Peter's avatar
Sarah Peter committed
371

372 373
Change the mapping rule in your `Snakefile` to the following:

Sarah Peter's avatar
Sarah Peter committed
374 375 376
```python
rule mapping:
  input: "chip-seq/{sample}.fastq.gz"
377
  output: "bowtie2/{sample}.bam"
Sarah Peter's avatar
Sarah Peter committed
378
  params:
379
    idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12"
Sarah Peter's avatar
Sarah Peter committed
380
  log: "logs/bowtie2_{sample}.log"
Sarah Peter's avatar
Sarah Peter committed
381
  benchmark: "benchmarks/mapping/{sample}.tsv"
Sarah Peter's avatar
Sarah Peter committed
382 383 384 385
  conda: "envs/bowtie2.yaml"
  threads: 4
  shell:
    """
386 387
    bowtie2 -p {threads} -x {params.idx} -U {input} 2> {log} | \
    samtools sort - > {output}
Sarah Peter's avatar
Sarah Peter committed
388 389 390
    samtools index {output}
    """
```
391

392 393 394 395
If we want to rerun the workflow to compare different options, we need to delete the output files, otherwise snakemake will not run the steps again. Since we want to do a bit of testing, let's define a rule that removes all the output:

```python
rule clean:
396 397 398 399
  shell:  
    """
    rm -rf bowtie2/ macs2/ output/
    """
400 401 402 403 404 405 406 407 408 409
```

**Warning:** Be careful with `rm -rf` and double-check you're deleting the right directories, since it will remove everything without asking.

Run the clean-up rule:

```bash
(node)$> snakemake clean
```

410
Quit your current job and start a new one with more cores to test the multithreading:
411 412 413 414 415 416 417

```bash
(node)$> exit
(access)$> srun --cpu-bind=none -p interactive -t 0-0:15:0 -N 1 -c 6 --ntasks-per-node=1 --pty bash -i
(node)$> conda activate bioinfo_tutorial
(node)$> cd $SCRATCH/bioinfo_tutorial
```
418

419
Now we also need to tell snakemake that it has multiple cores available and can run steps multithreaded or run multiple tasks in parallel. This is done with the  `-j` option followed by the number of available cores (e.g. the number of cores you have reserved if you run it interactively).
420

Sarah Peter's avatar
Sarah Peter committed
421
```bash
422
(node)$> snakemake -j 4 -pr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
Sarah Peter's avatar
Sarah Peter committed
423 424
```

Sarah Peter's avatar
Sarah Peter committed
425 426
You should see in the output that the command-line call of bowtie2 now shows `-p 4`.

427 428 429 430 431 432 433 434 435 436
Check again the benchmark report:

```bash
(node)$> cat benchmarks/mapping/INPUT-TC1-ST2-D0.12.tsv
s      h:m:s   max_rss max_vms max_uss max_pss io_in io_out mean_load
6.7687 0:00:06 295.01  1728.68 291.64  291.79  0.00  16.00  0.00
```

Notice that the runtime has decreased, but I/O has increased.

437
**Exercise:** Try several options for `-j` up to the number of cores you reserved (6) and check the bowtie2 command and the values in the benchmark. Don't forget the clean-up between the tries.
438

Sarah Peter's avatar
Sarah Peter committed
439 440


441
### Configure job parameters with `cluster.yaml`
Sarah Peter's avatar
Sarah Peter committed
442

443
Instead of reserving an interactive job and running snakemake inside that job, we want to use snakemake's cluster functionality to make it submit jobs to Slurm. For this we create a configuration file named `cluster.yaml` to define the values for the different `sbatch` options.
Sarah Peter's avatar
Sarah Peter committed
444 445 446

Options under the `__default__` header apply to all rules, but it's possible to override them selectively with rule-specific options.

Sarah Peter's avatar
Sarah Peter committed
447 448
Create the file `cluster.yaml` in the same directory as the `Snakefile` with the following content:

449 450 451 452 453 454 455 456 457 458 459 460
```yaml
__default__:
  time: "0-00:01:00"
  partition: "batch"
  nodes: 1
  ntasks: 1
  ncpus: 1
  job-name: "{rule}"
  output: "slurm-%j-%x.out"
  error: "slurm-%j-%x.err"
mapping:
  ncpus: 4
Sarah Peter's avatar
Sarah Peter committed
461
```
462

463
**Attention:** Be aware that `ncpus` should match the `threads` directive in the respective rule. If `ncpus` is less than `threads` snakemake will reserve only  `ncpus` cores, but run the rule on the number of threads specified with `threads` .
464

465 466


Sarah Peter's avatar
Sarah Peter committed
467 468
### Run snakemake with cluster configuration

469
Make sure you quit your job and run the following from the access node.
470

471
Now we need to map the variables defined in `cluster.yaml` to the command-line parameters of `sbatch`. Check the documentation on the [HPC website](https://hpc.uni.lu/users/docs/slurm.html#basic-usage-commands) for details about the parameters.
Sarah Peter's avatar
Sarah Peter committed
472

473
The meaning of the option `-j` changes when running in cluster mode to denote the maximum number of simultaneous jobs.
Sarah Peter's avatar
Sarah Peter committed
474

Sarah Peter's avatar
Sarah Peter committed
475
```bash
Sarah Peter's avatar
Sarah Peter committed
476
(node)$> exit
Sarah Peter's avatar
Sarah Peter committed
477 478
(access)$> cd $SCRATCH/bioinfo_tutorial
(access)$> conda activate bioinfo_tutorial
479
(access)$> snakemake clean
Sarah Peter's avatar
Sarah Peter committed
480
 
481
(access)$> SLURM_ARGS="-p {cluster.partition} -N {cluster.nodes} -n {cluster.ntasks} -c {cluster.ncpus} -t {cluster.time} -J {cluster.job-name} -o {cluster.output} -e {cluster.error}"
Sarah Peter's avatar
Sarah Peter committed
482
 
483
(access)$> snakemake -j 10 -pr --use-conda --cluster-config cluster.yaml --cluster "sbatch $SLURM_ARGS"
Sarah Peter's avatar
Sarah Peter committed
484
```
Sarah Peter's avatar
Sarah Peter committed
485

486
Let's have a look at the jobs that were submitted:
487 488

```bash
489 490 491 492 493
# only job allocations
(access)$> sacct -X --name="mapping","peak_calling","bigwig" --format JobID%15,JobName%15,AllocCPUS,Submit,Start,End,Elapsed

# including all steps
(access)$> sacct --name="mapping","peak_calling","bigwig" --format JobID%15,JobName%15,NTasks,AllocCPUS,Submit,Start,End,Elapsed,MaxVMSize
494 495
```

496
Check the submit and end time to see which jobs were running at the same time and when snakemake waited for jobs to finish.
497 498


Sarah Peter's avatar
Sarah Peter committed
499 500 501

## Inspect results in IGV

Sarah Peter's avatar
Sarah Peter committed
502
Now that we have completed the workflow, let's have a look at the results.
503

504
For visualisation, [download IGV](http://software.broadinstitute.org/software/igv/download), or use any other genome browser of your choice.
Sarah Peter's avatar
Sarah Peter committed
505

506
To copy the results from the cluster to your laptop, run the following in a local terminal (Linux and MacOS) or a MobaXterm local session (Windows) and replace `<your_username>` with your ULHPC user login. For alternative ways to transfer files, see the [documentation on the HPC website](https://hpc.uni.lu/users/docs/filetransfer.html). Pay attention in which directory you are, so you can find the files again.
Sarah Peter's avatar
Sarah Peter committed
507 508 509 510

```bash
(laptop)$> mkdir bioinfo_tutorial
(laptop)$> cd bioinfo_tutorial
511 512 513 514 515 516

# check where you are
(laptop)$> pwd

# transfer the output directory
(laptop)$> rsync -avz --rsh='ssh -p 8022' <your_username>@access-iris.uni.lu:/scratch/users/<your_username>/bioinfo_tutorial/output .
517 518
```

Sarah Peter's avatar
Sarah Peter committed
519
Start IGV and select mouse mm10 as genome in the drop-down menu in the upper left. Go to "File" -> "Load from File…" and select all three files that you have copied from the cluster.
520

521 522 523 524
In the search box enter for example "Ahr" to check the signal around one of the genes highlighted in the paper.

Pay attention to the scale on which the two coverage tracks are shown. You can right-click on the track name on the left and select "Autoscale" to adjust the range.

525 526 527
When you hover over the blocks in the `TC1-ST2-D0.12_peaks.narrowPeak` track, you can see additional information about the called peaks, e.g. p-value. The peaks should be at the transcription start sites, because that is what H3K4me3 marks. Pay attention to the arrows in the "Refseq genes" track to see in which direction transcription goes.

![IGV](img/IGV_annotated.jpg)
528

Sarah Peter's avatar
Sarah Peter committed
529 530


Sarah Peter's avatar
Sarah Peter committed
531 532
## (Optional) Immediately submit all jobs

533
Snakemake has an option to immediately submit all jobs to the cluster and tell the scheduler about the dependencies so they run in the right order. It submits the jobs one-by-one, collecting the job ID of each from the Slurm output, and then forwards those job IDs as dependencies to the follow-up jobs.
534

535
Unfortunately snakemake doesn't parse the job submission message from Slurm cleanly, so the dependency lists look like   ` 'Submitted', 'batch', 'job', '374519', 'Submitted', 'batch', 'job', '374520'` instead of being just a list of the job IDs. Therefore, we need a wrapper script to get the dependencies right.
Sarah Peter's avatar
Sarah Peter committed
536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558

Create a python script called `immediate_submit.py` with the following content:

```python
#!/usr/bin/env python3
import os
import sys

from snakemake.utils import read_job_properties

# last command-line argument is the job script
jobscript = sys.argv[-1]

# all other command-line arguments are the dependencies
dependencies = set(sys.argv[1:-1])

# parse the job script for the job properties that are encoded by snakemake within
job_properties = read_job_properties(jobscript)

# collect all command-line options in an array
cmdline = ["sbatch"]

# set all the slurm submit options as before
559
slurm_args = " -p {partition} -N {nodes} -n {ntasks} -c {ncpus} -t {time} -J {job-name} -o {output} -e {error} ".format(**job_properties["cluster"])
Sarah Peter's avatar
Sarah Peter committed
560 561 562 563 564 565 566 567 568 569 570 571 572 573

cmdline.append(slurm_args)

if dependencies:
    cmdline.append("--dependency")
    # only keep numbers in dependencies list
    dependencies = [ x for x in dependencies if x.isdigit() ]
    cmdline.append("afterok:" + ",".join(dependencies))

cmdline.append(jobscript)

os.system(" ".join(cmdline))
```

574
Besides the dependencies this script now also takes care of all the other Slurm options, so you don't need to define `SLURM_ARGS` anymore in the shell.
575

576 577 578 579 580 581 582
Make the script executable:

```bash
(access)$> chmod +x immediate_submit.py
```

Run snakemake with the following command and replace `<your_username>` with your ULHPC user login:
Sarah Peter's avatar
Sarah Peter committed
583 584

```bash
585
(access)$> snakemake clean
586
(access)$> snakemake --cluster-config cluster.yaml -j 50 -pr --use-conda --immediate-submit --notemp --cluster "/scratch/users/<your_username>/bioinfo_tutorial/immediate_submit.py {dependencies}"
Sarah Peter's avatar
Sarah Peter committed
587 588
```

589 590
With `squeue -u <your_username>` you can check the status of the submitted jobs and see when they all have finished.

Sarah Peter's avatar
Sarah Peter committed
591 592


Sarah Peter's avatar
Sarah Peter committed
593 594
## Useful stuff

595
* To avoid too much overhead in the number of jobs submitted to Slurm, use the`group` directive to group rules that can run together in a single job.
596
* If your workflow runs for longer than just a few minutes, run snakemake inside`screen` or prefix it with `nohup`. This prevents the workflow from stopping when your SSH session get's disconnected.
Sarah Peter's avatar
Sarah Peter committed
597

Sarah Peter's avatar
Sarah Peter committed
598 599


Sarah Peter's avatar
Sarah Peter committed
600 601
## Acknowledgements

602
Many thanks to Aurélien Ginolhac, Cedric Laczny, Nikola de Lange and Roland Krause for their help in developing this tutorial.