README.md 21.6 KB
Newer Older
1
# Bioinformatics workflows with snakemake and conda
Sarah Peter's avatar
Sarah Peter committed
2

Sarah Peter's avatar
Sarah Peter committed
3 4
Author: Sarah Peter

5
In this tutorial you will learn how to run a ChIP-seq analysis with the [snakemake workflow engine](https://snakemake.readthedocs.io/en/stable/)  on the cluster.
6 7

**Disclaimer:** In order to keep this tutorial simple, we use default parameters for the different tools as much as possible. However, for a real analysis you should always adapt the parameters to your dataset. Also, be aware that the results of some steps might be screwed, because we only work on data from one chromosome.
Sarah Peter's avatar
Sarah Peter committed
8

Sarah Peter's avatar
Sarah Peter committed
9 10
## Setup the environment

11
For this tutorial we will use the [`conda` package manager](https://www.anaconda.com/) to install the required tools. It can encapsulate software and packages in environments, so you can have multiple different versions of a software installed at the same time and avoid incompatibilities between different tools. It also has functionality to easily port and replicate environments, which is important to ensure reproducibility of analyses.
Sarah Peter's avatar
Sarah Peter committed
12

Sarah Peter's avatar
Sarah Peter committed
13 14
We will use conda on two levels in this tutorial. First we use a conda environment to install and run snakemake. Second, inside the snakemake workflow we will define separate conda environments for each step.

15
1. Start a job on the cluster:
Sarah Peter's avatar
Sarah Peter committed
16 17 18 19 20 21

   ```bash
   (laptop)$> ssh iris-cluster
   (access)$> si
   ```

22
2. Install conda:
Sarah Peter's avatar
Sarah Peter committed
23

Sarah Peter's avatar
Sarah Peter committed
24
   ```bash
25 26
   (node)$> mkdir -p $SCRATCH/downloads
   (node)$> cd $SCRATCH/downloads
Sarah Peter's avatar
Sarah Peter committed
27 28 29 30
   (node)$> wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
   (node)$> chmod u+x Miniconda3-latest-Linux-x86_64.sh
   (node)$> ./Miniconda3-latest-Linux-x86_64.sh 
   ```
31
   You need to specify your installation destination, e.g. `/home/users/<your_username>/tools/miniconda3`. You must use the **full** path and can**not** user `$HOME/tools/miniconda3`. Answer `yes` to initialize Miniconda3. 
Sarah Peter's avatar
Sarah Peter committed
32 33 34 35 36 37 38 39 40 41 42

   The installation will modify your `.bashrc` to make conda directly available after each login. To activate the changes now, run

   ```bash
   (node)$> source ~/.bashrc
   ```

   Update conda to the latest version:
   ```bash
   (node)$> conda update conda 
   ```
Sarah Peter's avatar
Sarah Peter committed
43

44
3. Create a new conda environment and activate it:
Sarah Peter's avatar
Sarah Peter committed
45 46 47 48 49

   ```bash
   (node)$> conda create -n bioinfo_tutorial
   (node)$> conda activate bioinfo_tutorial
   ```
50
   You can see that your prompt will now be prefixed with `(bioinfo_tutorial)` to show which environment is active. For the rest of the tutorial make sure that you always have this environment active.
51
4. Install snakemake:
Sarah Peter's avatar
Sarah Peter committed
52 53

   ```bash
Sarah Peter's avatar
Sarah Peter committed
54
   (node)$> conda install -c bioconda -c conda-forge snakemake-minimal
Sarah Peter's avatar
Sarah Peter committed
55
   ```
Sarah Peter's avatar
Sarah Peter committed
56
   
Sarah Peter's avatar
Sarah Peter committed
57 58 59 60
   

## Create snakemake workflow

61
In this tutorial we will analyse [ChIP-seq data](https://www.ebi.ac.uk/ena/data/view/PRJEB20933) from the paper [Gérard D, Schmidt F, Ginolhac A, Schmitz M, Halder R, Ebert P, Schulz MH, Sauter T, Sinkkonen L. Temporal enhancer profiling of parallel lineages identifies AHR and GLIS1 as regulators of mesenchymal multipotency. *Nucleic Acids Research*, Volume 47, Issue 3, 20 February 2019, Pages 1141–1163, https://doi.org/10.1093/nar/gky1240](https://www.ncbi.nlm.nih.gov/pubmed/30544251) published by our colleagues in LSRU.
Sarah Peter's avatar
Sarah Peter committed
62

63
To speed up computing time we use source files that only contain sequencing reads that map to chromosome 12. The files for input (control) and H3K4me3 (ChIP) are available on the cluster in the directory `/work/projects/ulhpc-tutorials/bio/snakemake/chip-seq` and the corresponding reference in `/work/projects/ulhpc-tutorials/bio/snakemake/reference`.
Sarah Peter's avatar
Sarah Peter committed
64

65 66
We have also already prepared the conda environments for each step in the workflow in `/work/projects/ulhpc-tutorials/bio/snakemake/envs`.

Sarah Peter's avatar
Sarah Peter committed
67
Create a working directory and link the necessary data:
Sarah Peter's avatar
Sarah Peter committed
68 69

```bash
70 71 72 73 74 75
(node)$> cd $SCRATCH
(node)$> mkdir bioinfo_tutorial
(node)$> cd bioinfo_tutorial
(node)$> ln -s /work/projects/ulhpc-tutorials/bio/snakemake/chip-seq .
(node)$> ln -s /work/projects/ulhpc-tutorials/bio/snakemake/reference .
(node)$> ln -s /work/projects/ulhpc-tutorials/bio/snakemake/envs .
Sarah Peter's avatar
Sarah Peter committed
76 77 78 79
```

### Mapping

Sarah Peter's avatar
Sarah Peter committed
80 81
> In Snakemake, workflows are specified as Snakefiles. Inspired by GNU Make, a Snakefile contains rules that denote how to create output files from input files. Dependencies between rules are handled implicitly, by matching filenames of input files against output files. Thereby wildcards can be used to write general rules.
> 
82
> &mdash; <cite>[Snakemake manual - Writing Workflows](https://snakemake.readthedocs.io/en/stable/snakefiles/writing_snakefiles.html)</cite>
83

84 85 86 87 88 89 90 91
> Most importantly, a rule can consist of a name (the name is optional and can be left out, creating an anonymous rule), input files, output files, and a shell command to generate the output from the input, i.e. 
> ```python
> rule NAME:
>     input: "path/to/inputfile", "path/to/other/inputfile"
>     output: "path/to/outputfile", "path/to/another/outputfile"
>     shell: "somecommand {input} {output}"
> ```
> &mdash; <cite>[Snakemake manual - Rules](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html)</cite>
Sarah Peter's avatar
Sarah Peter committed
92 93


Sarah Peter's avatar
Sarah Peter committed
94

95
A basic rule for mapping a fastq file with [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) could look like this:
96 97 98

```python
rule mapping:
99
  input: "chip-seq/H3K4-TC1-ST2-D0.12.fastq.gz "
100
  output: "bowtie2/H3K4-TC1-ST2-D0.12.bam"
101 102
  shell:
    """
103
    bowtie2 -x reference/Mus_musculus.GRCm38.dna_sm.chromosome.12 -U {input} | \  
104 105 106 107
    samtools sort - > {output}
    samtools index {output}
    """
```
Sarah Peter's avatar
Sarah Peter committed
108

109
Since we have two fastq files to map, we should generalise the rule with wildcards:
Sarah Peter's avatar
Sarah Peter committed
110 111 112

```python
rule mapping:
113
  input: "chip-seq/{sample}.fastq.gz"
114
  output: "bowtie2/{sample}.bam"
115 116
  shell:
    """
117
    bowtie2 -x reference/Mus_musculus.GRCm38.dna_sm.chromosome.12 -U {input} | \  
118 119 120 121 122
    samtools sort - > {output}
    samtools index {output}
    """
```

123
Now we need to tell snakemake to use a conda environment with bowtie2 and [samtools](http://www.htslib.org/) inside to run this rule. For this purpose there is a specific `conda` directive that can be added to the rule. It accepts a [yaml](https://yaml.org/spec/1.2/spec.html) file that defines the conda environment. 
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139

```python
conda: "envs/bowtie2.yaml"
```

You can easily export existing conda environments to a yaml file with `conda env export` or write the yaml from scratch. For this step the yaml file looks like this:

```yaml
name: bowtie2
channels:
  - bioconda
dependencies:
  - bowtie2
  - samtools
```

Sarah Peter's avatar
Sarah Peter committed
140 141 142
We also use the `params` directive to define the path to the reference and declutter the command-line call, as well as the `log` directive to define a path to permanently store the output of the execution. This is especially useful in this step to store the bowtie2 mapping statistics, which are just written to the command-line (stderr) otherwise.

To track resource usage we add the `benchmark` directive, which will write performance measures to a tsv file.
143 144 145 146 147 148 149 150 151 152 153 154 155

Create a file called `Snakefile` in the current directory and open it in your favourite editor, e.g.

```bash
(node)$> nano Snakefile
```

Add the final rule for the mapping:

```python
rule mapping:
  input: "chip-seq/{sample}.fastq.gz"
  output: "bowtie2/{sample}.bam"
Sarah Peter's avatar
Sarah Peter committed
156
  params:
157
    idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12"
Sarah Peter's avatar
Sarah Peter committed
158
  log: "logs/bowtie2_{sample}.log"
Sarah Peter's avatar
Sarah Peter committed
159
  benchmark: "benchmarks/mapping/{sample}.tsv"
Sarah Peter's avatar
Sarah Peter committed
160
  conda: "envs/bowtie2.yaml"
Sarah Peter's avatar
Sarah Peter committed
161 162
  shell:
    """
163 164
    bowtie2 -x {params.idx} -U {input} 2> {log} | \
    samtools sort - > {output}
Sarah Peter's avatar
Sarah Peter committed
165 166 167 168
    samtools index {output}
    """
```

Sarah Peter's avatar
Sarah Peter committed
169 170 171
You can test the rule by specifying one of the potential outputs. We first do a dry-run with with option `-n`.

```bash
172
(node)$> snakemake -npr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
Sarah Peter's avatar
Sarah Peter committed
173 174
```

175
If everything is fine we can run the rule to create the file `bowtie2/INPUT-TC1-ST2-D0.12.bam`:
Sarah Peter's avatar
Sarah Peter committed
176 177

```bash
178 179 180
(node)$> snakemake -pr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
```

181
Check the mapping statistics and the benchmark report:
182

Sarah Peter's avatar
Sarah Peter committed
183
```bash
184 185 186 187 188 189 190 191
(node)$> cat logs/bowtie2_INPUT-TC1-ST2-D0.12.log
400193 reads; of these:
  400193 (100.00%) were unpaired; of these:
    1669 (0.42%) aligned 0 times
    379290 (94.78%) aligned exactly 1 time
    19234 (4.81%) aligned >1 times
99.58% overall alignment rate

192
(node)$> cat benchmarks/mapping/INPUT-TC1-ST2-D0.12.tsv
Sarah Peter's avatar
Sarah Peter committed
193 194
s       h:m:s   max_rss max_vms max_uss max_pss io_in io_out mean_load
19.1737 0:00:19 262.14  1404.55 258.79  258.94  0.00  0.00   0.00
Sarah Peter's avatar
Sarah Peter committed
195 196 197 198 199 200
```



### Peak calling

201
The next step in the workflow is to call peaks with [MACS2](https://github.com/taoliu/MACS). This tells us where there is enrichment of the ChIP versus the input (control). 
Sarah Peter's avatar
Sarah Peter committed
202 203 204

You should always choose the peak caller based on how you expect your enriched regions to look like, e.g. narrow or broad peaks.

205
Besides the list of peaks in [BED](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) format, MACS2 also produces coverage tracks.
206

207 208
Add the following rule to your `Snakefile`:

Sarah Peter's avatar
Sarah Peter committed
209 210
```python
rule peak_calling:
Sarah Peter's avatar
Sarah Peter committed
211
  input:
212 213
    control = "bowtie2/INPUT-{sample}.bam",
    chip = "bowtie2/H3K4-{sample}.bam"
Sarah Peter's avatar
Sarah Peter committed
214
  output:
215
    peaks = "output/{sample}_peaks.narrowPeak",
Sarah Peter's avatar
Sarah Peter committed
216 217
    control_bdg = "macs2/{sample}_control_lambda.bdg",
    chip_bdg = "macs2/{sample}_treat_pileup.bdg"
Sarah Peter's avatar
Sarah Peter committed
218
  conda: "envs/macs2.yaml"
219
  shell:
Sarah Peter's avatar
Sarah Peter committed
220
    """
221 222 223
    macs2 callpeak -t {input.chip} -c {input.control} -f BAM -g mm -n {wildcards.sample} -B -q 0.01 --outdir macs2

    cp macs2/{wildcards.sample}_peaks.narrowPeak {output.peaks}
Sarah Peter's avatar
Sarah Peter committed
224
    """
Sarah Peter's avatar
Sarah Peter committed
225
```
Sarah Peter's avatar
Sarah Peter committed
226

227 228 229 230 231 232 233 234 235 236
The conda environment `envs/macs2.yaml` for this step is:

```yaml
name: macs2
channels:
  - bioconda
dependencies:
  - macs2
```

Sarah Peter's avatar
Sarah Peter committed
237 238 239
Let's run this step with:

```bash
240
(node)$> snakemake -pr --use-conda output/TC1-ST2-D0.12_peaks.narrowPeak
Sarah Peter's avatar
Sarah Peter committed
241 242
```

243
Note that snakemake will not run the mapping step for `bowtie2/INPUT-TC1-ST2-D0.12.bam` again. It only runs rules for which the output is not present or the input has changed.
Sarah Peter's avatar
Sarah Peter committed
244

Sarah Peter's avatar
Sarah Peter committed
245

246 247 248 249

### Generate bigWig files for visualisation

For easier visualisation and faster transfer, we convert the two coverage tracks from the MACS2 output to [bigWig](https://genome.ucsc.edu/goldenpath/help/bigWig.html) format.
Sarah Peter's avatar
Sarah Peter committed
250

251 252
Add the following rule to your `Snakefile`:

253 254
```python
rule bigwig:
Sarah Peter's avatar
Sarah Peter committed
255 256
  input: "macs2/{sample}.bdg"
  output: "output/{sample}.bigwig"
257
  params:
258
    idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai"
Sarah Peter's avatar
Sarah Peter committed
259
  conda: "envs/ucsc.yaml"
260
  shell:
Sarah Peter's avatar
Sarah Peter committed
261 262 263
    """
    bedGraphToBigWig {input} {params.idx} {output}
    """
264 265
```

266 267 268 269 270 271 272 273 274 275
The conda environment `envs/ucsc.yaml` for this step is:

```yaml
name: ucsc
channels:
  - bioconda
dependencies:
  - ucsc-bedgraphtobigwig
```

Sarah Peter's avatar
Sarah Peter committed
276 277 278
Let's test this step with:

```bash
279
(node)$> snakemake -pr --use-conda output/TC1-ST2-D0.12_treat_pileup.bigwig
Sarah Peter's avatar
Sarah Peter committed
280 281 282 283
```

This time snakemake will only run the "bigwig" rule for the one file we specified.

284 285


Sarah Peter's avatar
Sarah Peter committed
286
### Summary rule
287

288
To avoid always having to specify which output file we want on the command-line, we add one rule with just inputs that defines the result files we want to have in the end. Since by default snakemake executes the first rule in the snakefile, we add this rule as the first one to the top and then we don't need to specify anything additional on the command-line.
Sarah Peter's avatar
Sarah Peter committed
289 290

First, at the **very top** of the Snakefile, define a variable for the name of the sample:
291 292

```python
293
SAMPLE = "TC1-ST2-D0.12"
294 295
```

296
This makes it easier to change the Snakefile and apply it to other datasets. Snakemake is based on Python so we can use Python code inside the Snakefile. We will use [f-Strings](https://docs.python.org/3.6/reference/lexical_analysis.html#f-strings) to include the variable in the file names.
297

Sarah Peter's avatar
Sarah Peter committed
298
Add this rule at the top of the `Snakefile` after the line above:
Sarah Peter's avatar
Sarah Peter committed
299

300 301
```python
rule all:
302
  input: f"output/{SAMPLE}_peaks.narrowPeak", f"output/{SAMPLE}_control_lambda.bigwig", f"output/{SAMPLE}_treat_pileup.bigwig"
303
```
Sarah Peter's avatar
Sarah Peter committed
304

Sarah Peter's avatar
Sarah Peter committed
305
Finally run the workflow again, this time without a specific target file:
Sarah Peter's avatar
Sarah Peter committed
306

Sarah Peter's avatar
Sarah Peter committed
307
```bash
Sarah Peter's avatar
Sarah Peter committed
308
(node)$> snakemake --use-conda -pr
Sarah Peter's avatar
Sarah Peter committed
309 310 311 312 313
```

Snakemake can visualise the dependency graph of the workflow with the following command:

```bash
314
(node)$> snakemake --dag | dot -Tpng > dag.png
Sarah Peter's avatar
Sarah Peter committed
315 316 317
```

![DAG](dag.png)
Sarah Peter's avatar
Sarah Peter committed
318 319 320

## Cluster configuration for snakemake

321
Until now the workflow just runs on a single CPU on a single machine, which is not very efficient when we have much more resources available. To speed up the computation you should check in the documentation of the software you use how it can scale. For bioinformatics tools the most common option is multithreading.
Sarah Peter's avatar
Sarah Peter committed
322 323 324

In this workflow only bowtie2 has the option to run on multiple threads.

Sarah Peter's avatar
Sarah Peter committed
325
### Adjust mapping step to run on multiple threads
326

327 328
We add the `thread` directive to the snakemake rule for the mapping step, to tell snakemake that this step can use multiple threads. 

329
> The specified threads have to be seen as a maximum. When Snakemake is executed with fewer cores, the number of threads will be adjusted, i.e. `threads = min(threads, cores)` with `cores` being the number of cores specified at the command line (option `-j`).
Sarah Peter's avatar
Sarah Peter committed
330
>
331
> &mdash; <cite>[Snakemake manual - Threads](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#threads)</cite>
332 333 334 335

So the value for threads should be the maximum that is reasonable for the respective software. For many software the speed-up plateaus at a certain number of threads or even starts to decrease again. For a regular bowtie2 run 16 is a good maximum, but for this tutorial we will only go up to 4 because we have a small dataset.

We also need to add the option `-p` to the bowtie2 command-line call, to make it actually use those threads.
Sarah Peter's avatar
Sarah Peter committed
336

337 338
Change the mapping rule in your `Snakefile` to the following:

Sarah Peter's avatar
Sarah Peter committed
339 340 341
```python
rule mapping:
  input: "chip-seq/{sample}.fastq.gz"
342
  output: "bowtie2/{sample}.bam"
Sarah Peter's avatar
Sarah Peter committed
343
  params:
344
    idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12"
Sarah Peter's avatar
Sarah Peter committed
345
  log: "logs/bowtie2_{sample}.log"
Sarah Peter's avatar
Sarah Peter committed
346
  benchmark: "benchmarks/mapping/{sample}.tsv"
Sarah Peter's avatar
Sarah Peter committed
347 348 349 350
  conda: "envs/bowtie2.yaml"
  threads: 4
  shell:
    """
351 352
    bowtie2 -p {threads} -x {params.idx} -U {input} 2> {log} | \
    samtools sort - > {output}
Sarah Peter's avatar
Sarah Peter committed
353 354 355
    samtools index {output}
    """
```
356

357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374
If we want to rerun the workflow to compare different options, we need to delete the output files, otherwise snakemake will not run the steps again. Since we want to do a bit of testing, let's define a rule that removes all the output:

```python
rule clean:
    shell:  
        """
        rm -rf bowtie2/ macs2/ output/
        """
```

**Warning:** Be careful with `rm -rf` and double-check you're deleting the right directories, since it will remove everything without asking.

Run the clean-up rule:

```bash
(node)$> snakemake clean
```

375
Quit your current job and start a new one with more cores to test the multithreading:
376 377 378 379 380 381 382

```bash
(node)$> exit
(access)$> srun --cpu-bind=none -p interactive -t 0-0:15:0 -N 1 -c 6 --ntasks-per-node=1 --pty bash -i
(node)$> conda activate bioinfo_tutorial
(node)$> cd $SCRATCH/bioinfo_tutorial
```
383

384
Now we also need to tell snakemake that it has multiple cores available and can run steps multithreaded or run multiple tasks in parallel. This is done with the  `-j` option followed by the number of available cores (e.g. the number of cores you have reserved if you run it interactively).
385

Sarah Peter's avatar
Sarah Peter committed
386
```bash
387
(node)$> snakemake -j 4 -pr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
Sarah Peter's avatar
Sarah Peter committed
388 389
```

Sarah Peter's avatar
Sarah Peter committed
390 391
You should see in the output that the command-line call of bowtie2 now shows `-p 4`.

392 393 394 395 396 397 398 399 400 401 402
Check again the benchmark report:

```bash
(node)$> cat benchmarks/mapping/INPUT-TC1-ST2-D0.12.tsv
s      h:m:s   max_rss max_vms max_uss max_pss io_in io_out mean_load
6.7687 0:00:06 295.01  1728.68 291.64  291.79  0.00  16.00  0.00
```

Notice that the runtime has decreased, but I/O has increased.

**Exercise:** Try several options for `-j` up to the number of cores you reserved (6) and check the bowtie2 command and the values in the benchmark. 
403

Sarah Peter's avatar
Sarah Peter committed
404 405


406
### Configure job parameters with `cluster.yaml`
Sarah Peter's avatar
Sarah Peter committed
407

408
Instead of reserving an interactive job and running snakemake inside that job, we want to use snakemake's cluster functionality to make it submit jobs to SLURM. For this we create a configuration file named `cluster.yaml` to define the values for the different `sbatch` options.
Sarah Peter's avatar
Sarah Peter committed
409 410 411

Options under the `__default__` header apply to all rules, but it's possible to override them selectively with rule-specific options.

412 413 414 415 416 417 418 419 420 421 422 423
```yaml
__default__:
  time: "0-00:01:00"
  partition: "batch"
  nodes: 1
  ntasks: 1
  ncpus: 1
  job-name: "{rule}"
  output: "slurm-%j-%x.out"
  error: "slurm-%j-%x.err"
mapping:
  ncpus: 4
Sarah Peter's avatar
Sarah Peter committed
424
```
425

426
**Attention:** Be aware that `ncpus` should match the `threads` directive in the respective rule. If `ncpus` is less than `threads` snakemake will reserve only  `ncpus` cores, but run the rule on the number of threads specified with `threads` .
427

428 429


Sarah Peter's avatar
Sarah Peter committed
430 431
### Run snakemake with cluster configuration

432
Make sure you quit your job and run the following from the access node.
433

434
Now we need to map the variables defined in `cluster.yaml` to the command-line parameters of `sbatch`. Check the documentation on the [HPC website](https://hpc.uni.lu/users/docs/slurm.html#basic-usage-commands) for details about the parameters.
Sarah Peter's avatar
Sarah Peter committed
435

436
The meaning of the option `-j` changes when running in cluster mode to denote the maximum number of simultaneous jobs.
Sarah Peter's avatar
Sarah Peter committed
437

Sarah Peter's avatar
Sarah Peter committed
438
```bash
Sarah Peter's avatar
Sarah Peter committed
439 440
(access)$> cd $SCRATCH/bioinfo_tutorial
(access)$> conda activate bioinfo_tutorial
441
(access)$> snakemake clean
Sarah Peter's avatar
Sarah Peter committed
442
 
443
(access)$> SLURM_ARGS="-p {cluster.partition} -N {cluster.nodes} -n {cluster.ntasks} -c {cluster.ncpus} -t {cluster.time} -J {cluster.job-name} -o {cluster.output} -e {cluster.error}"
Sarah Peter's avatar
Sarah Peter committed
444
 
445
(access)$> snakemake -j 10 -pr --use-conda --cluster-config cluster.yaml --cluster "sbatch $SLURM_ARGS"
Sarah Peter's avatar
Sarah Peter committed
446
```
Sarah Peter's avatar
Sarah Peter committed
447

448
Let's have a look at the jobs that SLURM submitted:
449 450

```bash
451 452 453 454 455
# only job allocations
(access)$> sacct -X --name="mapping","peak_calling","bigwig" --format JobID%15,JobName%15,AllocCPUS,Submit,Start,End,Elapsed

# including all steps
(access)$> sacct --name="mapping","peak_calling","bigwig" --format JobID%15,JobName%15,NTasks,AllocCPUS,Submit,Start,End,Elapsed,MaxVMSize
456 457
```

458
Check the submit and end time to see which jobs were running at the same time and when snakemake waited for jobs to finish.
459 460


Sarah Peter's avatar
Sarah Peter committed
461 462 463

## Inspect results in IGV

Sarah Peter's avatar
Sarah Peter committed
464
Now that we have completed the workflow, let's have a look at the results.
465

466
For visualisation, [download IGV](http://software.broadinstitute.org/software/igv/download), or use any other genome browser of your choice.
Sarah Peter's avatar
Sarah Peter committed
467

468
To copy the results from the cluster to your laptop, run the following and replace `<your_username>` with your ULHPC user login. Pay attention in which directory you are, so you can find the files again.
Sarah Peter's avatar
Sarah Peter committed
469 470 471 472

```bash
(laptop)$> mkdir bioinfo_tutorial
(laptop)$> cd bioinfo_tutorial
473
(laptop)$> rsync -avz iris-cluster:/scratch/users/<your_username>/bioinfo_tutorial/output .
474 475
```

Sarah Peter's avatar
Sarah Peter committed
476
Start IGV and select mouse mm10 as genome in the drop-down menu in the upper left. Go to "File" -> "Load from File…" and select all three files that you have copied from the cluster.
477

478 479 480 481
In the search box enter for example "Ahr" to check the signal around one of the genes highlighted in the paper.

Pay attention to the scale on which the two coverage tracks are shown. You can right-click on the track name on the left and select "Autoscale" to adjust the range.

Sarah Peter's avatar
Sarah Peter committed
482
TODO: screenshot of IGV
483

Sarah Peter's avatar
Sarah Peter committed
484 485


Sarah Peter's avatar
Sarah Peter committed
486 487
## (Optional) Immediately submit all jobs

488 489 490
Snakemake has an option to immediately submit all jobs to the cluster and tell the scheduler about the dependencies so they run in the right order. It submits the jobs one-by-one, collecting the job ID of each from the slurm output, and then forwards those job IDs as dependencies to the follow-up jobs.

Unfortunately snakemake doesn't parse the job submission message from slurm cleanly, so the dependency lists look like   ` 'Submitted', 'batch', 'job', '374519', 'Submitted', 'batch', 'job', '374520'` instead of being just a list of the job IDs. Therefore, we need a wrapper script to get the dependencies right.
Sarah Peter's avatar
Sarah Peter committed
491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513

Create a python script called `immediate_submit.py` with the following content:

```python
#!/usr/bin/env python3
import os
import sys

from snakemake.utils import read_job_properties

# last command-line argument is the job script
jobscript = sys.argv[-1]

# all other command-line arguments are the dependencies
dependencies = set(sys.argv[1:-1])

# parse the job script for the job properties that are encoded by snakemake within
job_properties = read_job_properties(jobscript)

# collect all command-line options in an array
cmdline = ["sbatch"]

# set all the slurm submit options as before
514
slurm_args = " -p {partition} -N {nodes} -n {ntasks} -c {ncpus} -t {time} -J {job-name} -o {output} -e {error} ".format(**job_properties["cluster"])
Sarah Peter's avatar
Sarah Peter committed
515 516 517 518 519 520 521 522 523 524 525 526 527 528

cmdline.append(slurm_args)

if dependencies:
    cmdline.append("--dependency")
    # only keep numbers in dependencies list
    dependencies = [ x for x in dependencies if x.isdigit() ]
    cmdline.append("afterok:" + ",".join(dependencies))

cmdline.append(jobscript)

os.system(" ".join(cmdline))
```

529 530
Besides the dependencies this script now also takes care of all the other slurm options, so you don't need to define `SLURM_ARGS` anymore in the shell.

531 532 533 534 535 536 537
Make the script executable:

```bash
(access)$> chmod +x immediate_submit.py
```

Run snakemake with the following command and replace `<your_username>` with your ULHPC user login:
Sarah Peter's avatar
Sarah Peter committed
538 539

```bash
540
(access)$> snakemake --cluster-config cluster.yaml -j 50 -pr --use-conda --immediate-submit --notemp --cluster "/scratch/users/<your_username>/bioinfo_tutorial/immediate_submit.py {dependencies}"
Sarah Peter's avatar
Sarah Peter committed
541 542
```

543 544
With `squeue -u <your_username>` you can check the status of the submitted jobs and see when they all have finished.

Sarah Peter's avatar
Sarah Peter committed
545 546


Sarah Peter's avatar
Sarah Peter committed
547 548 549
## Useful stuff

* To avoid too much overhead in the number of jobs submitted to SLURM, use the`group` directive to group rules that can run together in a single job.
550
* If your workflow runs for longer than just a few minutes, run snakemake inside`screen` or prefix it with `nohup`. This prevents the workflow from stopping when your SSH session get's disconnected.
Sarah Peter's avatar
Sarah Peter committed
551

Sarah Peter's avatar
Sarah Peter committed
552 553


Sarah Peter's avatar
Sarah Peter committed
554 555
## Acknowledgements

Sarah Peter's avatar
Sarah Peter committed
556
Many thanks to Nikola de Lange, Aurélien Ginolhac, Roland Krause and Cedric Laczny for their help in developing this tutorial.