README.md 27.8 KB
Newer Older
1
# Bioinformatics workflows with snakemake and conda
Sarah Peter's avatar
Sarah Peter committed
2

Sarah Peter's avatar
Sarah Peter committed
3 4
Author: Sarah Peter

5
In this tutorial you will learn how to run a [ChIP-seq](https://en.wikipedia.org/wiki/ChIP-sequencing) analysis with the [snakemake workflow engine](https://snakemake.readthedocs.io/en/stable/).
6

Roland Krause's avatar
Roland Krause committed
7
**Disclaimer:** In order to keep this tutorial simple, we use default parameters for the different tools as much as possible. However, for a real analysis you should always adapt the parameters to your dataset. Also, be aware that the results of some steps might be skewed, because we only work on data from one chromosome.
Sarah Peter's avatar
Sarah Peter committed
8

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
9 10
## Table of contents

Sarah Peter's avatar
Sarah Peter committed
11
1. [Install conda](#conda)
12 13
2. [Setup the environment](#env)
3. [Create snakemake workflow](#snakemake)
Sarah Peter's avatar
Add TOC  
Sarah Peter committed
14 15 16
   1. [Mapping](#mapping)
   2. [Peak calling](#peaks)
   3. [Generate bigWig files for visualisation](#bigwig)
Sarah Peter's avatar
Sarah Peter committed
17 18
   4. [Summary rule](#summary)
4. [Inspect results in IGV](#igv)
Sarah Peter's avatar
Sarah Peter committed
19 20 21 22 23
5. [Exercises](#exercises)
6. [(Optional) Revert the changes to your environment](#revert)
7. [Useful stuff](#useful)
8. [References](#references)
9. [Acknowledgements](#acknowledgements)
24

Sarah Peter's avatar
Sarah Peter committed
25
<a name="conda"></a>
26

Sarah Peter's avatar
Sarah Peter committed
27
## Install conda
28

Sarah Peter's avatar
Sarah Peter committed
29
For this tutorial we will use the [`conda` package manager](https://www.anaconda.com/) to install the required tools.
Sarah Peter's avatar
Sarah Peter committed
30 31 32 33 34 35 36 37 38 39 40 41 42

> Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.
>
> Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment.
>
> &mdash; <cite>[Conda  manual](https://docs.conda.io/en/latest/index.html)</cite>

It can encapsulate software and packages in environments, so you can have multiple different versions of a software installed at the same time and avoid incompatibilities between different tools. It also has functionality to easily port and replicate environments, which is important to ensure reproducibility of analyses.

**Attention when dealing with sensitive data:** Everyone can very easily contribute installation recipies to the bioconda channel, without verified identity or double-checking from the community. Therefore it's possible to insert malicious software. If you use bioconda when processing sensitive data, you should check the recipes to verify that they install software from the official sources.

We will use conda on two levels in this tutorial. First we use a conda environment to install and run snakemake. Second, inside the snakemake workflow we will define separate conda environments for each step.

Sarah Peter's avatar
Sarah Peter committed
43
### Set up VirtualBox
44

Sarah Peter's avatar
Sarah Peter committed
45
Independent of the operating system you are using, we recommend you set up a VirtualBox as described in our [instructions](virtualbox.md). This ensures that the conda installation does not affect your local environment and all steps in the tutorial work as expected.
Sarah Peter's avatar
Sarah Peter committed
46

Sarah Peter's avatar
Sarah Peter committed
47
In case you still wish to install conda natively on your machine **at your own risk**, you can check [these installation instructions](conda_installation.md).
Sarah Peter's avatar
Sarah Peter committed
48

Sarah Peter's avatar
Sarah Peter committed
49
### Install conda in the VirtualBox VM
Sarah Peter's avatar
Sarah Peter committed
50

Sarah Peter's avatar
Sarah Peter committed
51
* Start the VM in VirtualBox.
Sarah Peter's avatar
Sarah Peter committed
52

Sarah Peter's avatar
Sarah Peter committed
53
* Start the `Powershell` (Windows) or other terminal application (MacOS, Linux).
54

Sarah Peter's avatar
Sarah Peter committed
55
* Connect to the VM with
56

Sarah Peter's avatar
Sarah Peter committed
57
  ```bash
Sarah Peter's avatar
Sarah Peter committed
58
  ssh -p 2222 tutorial@127.0.0.1
Sarah Peter's avatar
Sarah Peter committed
59
  ```
60

Sarah Peter's avatar
Sarah Peter committed
61
* Password is `lcsblcsb` if you have not changed it.
62

Sarah Peter's avatar
Sarah Peter committed
63
Once connected to the VM, follow these steps to install conda:
Sarah Peter's avatar
Add TOC  
Sarah Peter committed
64

Sarah Peter's avatar
Sarah Peter committed
65 66 67
```bash
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ chmod u+x Miniconda3-latest-Linux-x86_64.sh
Sarah Peter's avatar
Sarah Peter committed
68
$ ./Miniconda3-latest-Linux-x86_64.sh
Sarah Peter's avatar
Sarah Peter committed
69
```
Sarah Peter's avatar
Sarah Peter committed
70 71

You need to specify your installation destination, e.g. `/home/tutorial/tools/miniconda3`. You must use the **full** path and can**not** use `$HOME/tools/miniconda3`. Answer `yes` to initialize Miniconda3.
Sarah Peter's avatar
Sarah Peter committed
72 73 74 75 76 77 78

The installation will modify your `.bashrc` to make conda directly available after each login. To activate the changes now, run

```bash
$ source ~/.bashrc
```

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
79 80
<a name="env"></a>

Sarah Peter's avatar
Sarah Peter committed
81
## Setup the environment
Sarah Peter's avatar
Sarah Peter committed
82

Sarah Peter's avatar
Sarah Peter committed
83
1. Update conda to the latest version:
Sarah Peter's avatar
Sarah Peter committed
84
   ```bash
Sarah Peter's avatar
Sarah Peter committed
85
   $ conda update conda
Sarah Peter's avatar
Sarah Peter committed
86
   ```
Sarah Peter's avatar
Sarah Peter committed
87

Sarah Peter's avatar
Sarah Peter committed
88
2. Create a new conda environment and activate it:
Sarah Peter's avatar
Sarah Peter committed
89 90

   ```bash
91 92
   $ conda create -n bioinfo_tutorial
   $ conda activate bioinfo_tutorial
Sarah Peter's avatar
Sarah Peter committed
93
   ```
Sarah Peter's avatar
Sarah Peter committed
94
   After validation of the creation step and once activated, you can see that your prompt will now be prefixed with `(bioinfo_tutorial)` to show which environment is active. For the rest of the tutorial make sure that you always have this environment active.
95

96
3. Make sure Python does not pick up packages in your home directory:
97 98

   ```bash
99
   (bioinfo_tutorial) $ export PYTHONNOUSERSITE=True
100 101
   ```

Sarah Peter's avatar
Sarah Peter committed
102
4. Install snakemake:
Sarah Peter's avatar
Sarah Peter committed
103 104

   ```bash
105
   (bioinfo_tutorial) $ conda install -c bioconda -c conda-forge snakemake-minimal
Sarah Peter's avatar
Sarah Peter committed
106
   ```
107

Sarah Peter's avatar
Sarah Peter committed
108

Sarah Peter's avatar
Sarah Peter committed
109

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
110 111
<a name="snakemake"></a>

Sarah Peter's avatar
Sarah Peter committed
112 113
## Create snakemake workflow

114
> The Snakemake workflow management system is a tool to create **reproducible and scalable** data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.
Sarah Peter's avatar
Sarah Peter committed
115
>
116 117
> &mdash; <cite>[Snakemake manual](https://snakemake.readthedocs.io/en/stable/index.html)</cite>

Sarah Peter's avatar
Sarah Peter committed
118
Snakemake is a very useful tool if you need to combine multiple steps using different software into a coherent workflow. It comes with many features desired for running workflows, like
119 120 121 122 123 124 125 126 127 128 129 130

* ensuring all input and result files are present
* restarting at a failed step
* rerunning (parts of a) pipeline when (some of) the input changed
* support for wildcards to apply a step to a set of files
* automatically parallelising where possible
* software management
* collecting benchmark data
* modularisation
* creating launcher scripts and submitting jobs to the cluster
* creating a visualisation of the workflow steps (see below)

Sarah Peter's avatar
Sarah Peter committed
131
In this tutorial we will analyse [ChIP-seq data](https://www.ebi.ac.uk/ena/data/view/PRJEB20933) from the paper [Gérard D, Schmidt F, Ginolhac A, Schmitz M, Halder R, Ebert P, Schulz MH, Sauter T, Sinkkonen L. Temporal enhancer profiling of parallel lineages identifies AHR and GLIS1 as regulators of mesenchymal multipotency. *Nucleic Acids Research*, Volume 47, Issue 3, 20 February 2019, Pages 1141–1163, https://doi.org/10.1093/nar/gky1240](https://www.ncbi.nlm.nih.gov/pubmed/30544251) published by our colleagues at LSRU.
Sarah Peter's avatar
Sarah Peter committed
132

133 134 135 136
We will set up the following workflow:

![DAG](img/dag.png)

137
To speed up computing time we use source files that only contain sequencing reads that map to chromosome 12. The files for input (control) and H3K4me3 (ChIP) are available on our [WebDAV](https://webdav-r3lab.uni.lu/public/biocore/snakemake_tutorial/) server.
Sarah Peter's avatar
Sarah Peter committed
138

Sarah Peter's avatar
Sarah Peter committed
139
Create a working directory and download the necessary data:
Sarah Peter's avatar
Sarah Peter committed
140 141

```bash
142
$ mkdir bioinfo_tutorial
Sarah Peter's avatar
Sarah Peter committed
143
$ cd bioinfo_tutorial
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
$ curl https://webdav-r3lab.uni.lu/public/biocore/snakemake_tutorial/snakemake_tutorial_data.tar.gz.gpg -o snakemake_tutorial_data.tar.gz.gpg
$ curl https://webdav-r3lab.uni.lu/public/biocore/snakemake_tutorial/snakemake_tutorial_data.tar.gz.gpg.md5 -o snakemake_tutorial_data.tar.gz.gpg.md5
```

Check the md5 sum:

```bash
$ md5sum -c snakemake_tutorial_data.tar.gz.gpg.md5
snakemake_tutorial_data.tar.gz.gpg: OK
```

Make sure it says `OK`.

Then decrypt the data with gpg and finally unpack it. The passkey or passphrase for decryption is `elixirLU`. In real situations the passphrase should always be transmitted in a secure way and separate from the data!

```bash
$ gpg -o snakemake_tutorial_data.tar.gz --decrypt snakemake_tutorial_data.tar.gz.gpg
161
$ tar -xzf snakemake_tutorial_data.tar.gz
Sarah Peter's avatar
Sarah Peter committed
162 163
```

164
Your working directory should have the following layout now (using the `tree` command):
Sarah Peter's avatar
Sarah Peter committed
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186

```
.
|-- chip-seq
|   |-- H3K4-TC1-ST2-D0.12.fastq.gz
|   `-- INPUT-TC1-ST2-D0.12.fastq.gz
|-- envs
|   |-- bowtie2.yaml
|   |-- macs2.yaml
|   `-- ucsc.yaml
`-- reference
    |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.1.bt2
    |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.2.bt2
    |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.3.bt2
    |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.4.bt2
    |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.fa
    |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai
    |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.rev.1.bt2
    `-- Mus_musculus.GRCm38.dna_sm.chromosome.12.rev.2.bt2
```


Sarah Peter's avatar
Add TOC  
Sarah Peter committed
187 188
<a name="mapping"></a>

Sarah Peter's avatar
Sarah Peter committed
189 190
### Mapping

Sarah Peter's avatar
Sarah Peter committed
191
> In Snakemake, workflows are specified as Snakefiles. Inspired by GNU Make, a Snakefile contains rules that denote how to create output files from input files. Dependencies between rules are handled implicitly, by matching filenames of input files against output files. Thereby wildcards can be used to write general rules.
Sarah Peter's avatar
Sarah Peter committed
192
>
193
> &mdash; <cite>[Snakemake manual - Writing Workflows](https://snakemake.readthedocs.io/en/stable/snakefiles/writing_snakefiles.html)</cite>
194

Sarah Peter's avatar
Sarah Peter committed
195
> Most importantly, a rule can consist of a name (the name is optional and can be left out, creating an anonymous rule), input files, output files, and a shell command to generate the output from the input, i.e.
196 197 198 199 200 201 202
> ```python
> rule NAME:
>     input: "path/to/inputfile", "path/to/other/inputfile"
>     output: "path/to/outputfile", "path/to/another/outputfile"
>     shell: "somecommand {input} {output}"
> ```
> &mdash; <cite>[Snakemake manual - Rules](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html)</cite>
Sarah Peter's avatar
Sarah Peter committed
203

204
For a detailed explanation of rules, have a look at the official [Snakemake tutorial](https://snakemake.readthedocs.io/en/stable/tutorial/basics.html).
Sarah Peter's avatar
Sarah Peter committed
205

206
A basic rule for mapping a fastq file with [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) could look like this:
207 208 209

```python
rule mapping:
Sarah Peter's avatar
Sarah Peter committed
210
  input: "chip-seq/H3K4-TC1-ST2-D0.12.fastq.gz"
211
  output: "bowtie2/H3K4-TC1-ST2-D0.12.bam"
212 213
  shell:
    """
214
    bowtie2 -x reference/Mus_musculus.GRCm38.dna_sm.chromosome.12 -U {input} | \
215 216 217 218
    samtools sort - > {output}
    samtools index {output}
    """
```
Sarah Peter's avatar
Sarah Peter committed
219

220
In the `shell` section we first map the fastq file to the mouse reference genome with `bowtie2`. The output is a sam file that is forwarded to [`samtools`](http://www.htslib.org/) using pipes (`|`) for sorting and conversion to bam format. The `\` is just escaping the line break, so we can break the line in the rule without actually breaking the line in the shell command. The output of the sort is then written to the `output` file. Finally we also create an index for the output bam file, again with `samtools`.
Sarah Peter's avatar
Sarah Peter committed
221

222 223 224 225
Create a file called `Snakefile` in the current directory, open it in your favourite editor, e.g.

```bash
$ nano Snakefile
226 227
```

228
Paste the above rule into the file and save it.
229

230
Let us try to run this rule with:
231 232 233

```bash
$ snakemake -pr -j 1
234 235
```

236
We will see that the execution fails, because we do not have `bowtie2` or `samtools` installed (the error is `/bin/bash: bowtie2: command not found`).
237

238
We could now just install `bowtie2` and `samtools` in our current conda environment. We would have to do the same for all other tools that are used in the analysis as well. However, it is much smarter to make proper use of conda environments and separate the tools into individual environments to avoid conflicts. For example `snakemake` itself conflicts with `macs2`, which we will use in a later step for peak calling, because `snakemake` requires Python 3, while `macs2` only works on Python 2.
239 240 241

We will create one conda environment for each `snakemake` rule and then we will tell `snakemake` in each rule which environment it should use. Fortunately, we do not need to create the environments manually, we just need to provide `snakemake` a [yaml](https://yaml.org/spec/1.2/spec.html) file that defines the conda environment. It will then create the environment itself.

242 243 244 245 246 247 248 249 250 251 252
You can easily export existing conda environments to a yaml file with `conda env export` or write the yaml from scratch. For this step the yaml file looks like this:

```yaml
name: bowtie2
channels:
  - bioconda
dependencies:
  - bowtie2
  - samtools
```

253 254
The yaml files for all rules are already included in the tutorial data you downloaded and can be found in the `envs` subdirectory.

255
To specify the yaml file for the conda environment, there is a specific `conda` directive that can be added to the rule. Update your `Snakefile` to match the following (edit the existing rule, do not add a second rule):
256 257 258 259 260 261 262 263

```python
rule mapping:
  input: "chip-seq/H3K4-TC1-ST2-D0.12.fastq.gz"
  output: "bowtie2/H3K4-TC1-ST2-D0.12.bam"
  conda: "envs/bowtie2.yaml"
  shell:
    """
264
    bowtie2 -x reference/Mus_musculus.GRCm38.dna_sm.chromosome.12 -U {input} | \
265 266 267 268 269 270 271 272
    samtools sort - > {output}
    samtools index {output}
    """
```

We should now be able to successfully run our mapping rule, if we tell `snakemake` to use conda environments with the `--use-conda` option:

```bash
273
$ snakemake -pr -j 1 --use-conda
274 275 276 277
```

The first run will take a bit longer, because snakemake creates the conda environment. In subsequent runs it will just activate the existing environment. However, it will recognise if the yaml files changes and then recreate the environment.

278
Since we have two fastq files to map, we should generalise the rule with wildcards. Update your `Snakefile` to match the following (edit the existing rule, do not add a second rule):
279 280 281 282 283 284 285 286

```python
rule mapping:
  input: "chip-seq/{sample}.fastq.gz"
  output: "bowtie2/{sample}.bam"
  conda: "envs/bowtie2.yaml"
  shell:
    """
287
    bowtie2 -x reference/Mus_musculus.GRCm38.dna_sm.chromosome.12 -U {input} | \
288 289 290 291 292
    samtools sort - > {output}
    samtools index {output}
    """
```

293
Let us try to run this rule:
Sarah Peter's avatar
Sarah Peter committed
294

295 296 297
```bash
$ snakemake -pr -j 1 --use-conda
```
298

299
Now snakemake gives an error, because it does not know how to replace the `{sample}` wildcard. We need to tell it explicitly which output file we want to generate:
300

301
```bash
302 303 304 305 306 307 308
$ snakemake -j 1 -pr --use-conda bowtie2/H3K4-TC1-ST2-D0.12.bam
```

However, since this output file is already present, snakemake will not do anything. So let us specify the output file of our second sample instead:

```bash
$ snakemake -j 1 -pr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
309
```
310

311 312 313 314 315
If you want to force snakemake to execute the rule again, you can either add the option `-f` or delete the output file.

At this point we have a basic rule, but there are a couple of nice additional directives that we will have a look at in the following.

The first one is the `params` directive, which can be used to define parameters separately from the rule body. Update your `Snakefile` to match the following (edit the existing rule, do not add a second rule):
316 317 318 319 320

```python
rule mapping:
  input: "chip-seq/{sample}.fastq.gz"
  output: "bowtie2/{sample}.bam"
Sarah Peter's avatar
Sarah Peter committed
321
  params:
322
    idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12"
Sarah Peter's avatar
Sarah Peter committed
323
  conda: "envs/bowtie2.yaml"
Sarah Peter's avatar
Sarah Peter committed
324 325
  shell:
    """
326
    bowtie2 -x {params.idx} -U {input} | \
327
    samtools sort - > {output}
Sarah Peter's avatar
Sarah Peter committed
328 329 330 331
    samtools index {output}
    """
```

332 333 334
Here we can use it to define the path to the reference and declutter the command-line call. It is possible to use wildcards in `params`, as well as other advanced functionality (e.g. functions) like for `input` and `output`.

Rerunning the rule will not result in any change:
Sarah Peter's avatar
Sarah Peter committed
335 336

```bash
337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359
$ snakemake -j 1 -pr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
Building DAG of jobs...
Nothing to be done.
Complete log: /home/tutorial/bioinfo_tutorial/.snakemake/log/2020-05-12T153550.679522.snakemake.log
```


The second directive is the `log` directive, which defines a path to permanently store the output of the execution. This is especially useful in this step to store the bowtie2 mapping statistics, which are just written to the command-line (stderr) otherwise. Update your `Snakefile` to match the following (edit the existing rule, do not add a second rule):

```python
rule mapping:
  input: "chip-seq/{sample}.fastq.gz"
  output: "bowtie2/{sample}.bam"
  params:
    idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12"
  log: "logs/bowtie2_{sample}.log"
  conda: "envs/bowtie2.yaml"
  shell:
    """
    bowtie2 -x {params.idx} -U {input} 2> {log} | \
    samtools sort - > {output}
    samtools index {output}
    """
Sarah Peter's avatar
Sarah Peter committed
360 361
```

362 363 364
Be aware that we need to explicitly redirect the stderr output of bowtie2 to the log file with `2> {log}`. As you might remember from the introduction of the rule, the regular output (stdout) of bowtie2 is forwarded to samtools with the pipe.

Let us force snakemake to rerun this rule, such that we get the log file:
Sarah Peter's avatar
Sarah Peter committed
365 366

```bash
367
snakemake -f -j 1 -pr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
368 369
```

370
You can inspect the log file using `cat` or your favourite editor:
371

Sarah Peter's avatar
Sarah Peter committed
372
```bash
373
$ cat logs/bowtie2_INPUT-TC1-ST2-D0.12.log
374 375 376 377 378 379
400193 reads; of these:
  400193 (100.00%) were unpaired; of these:
    1669 (0.42%) aligned 0 times
    379290 (94.78%) aligned exactly 1 time
    19234 (4.81%) aligned >1 times
99.58% overall alignment rate
380 381 382
```

As the final directive we will add `benchmark` to track resource usage. It writes performance measures to a tsv file. Update your `Snakefile` to match the following (edit the existing rule, do not add a second rule):
383

384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410
```python
rule mapping:
  input: "chip-seq/{sample}.fastq.gz"
  output: "bowtie2/{sample}.bam"
  params:
    idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12"
  log: "logs/bowtie2_{sample}.log"
  benchmark: "benchmarks/mapping/{sample}.tsv"
  conda: "envs/bowtie2.yaml"
  shell:
    """
    bowtie2 -x {params.idx} -U {input} 2> {log} | \
    samtools sort - > {output}
    samtools index {output}
    """
```

Let us force run our final mapping rule again for both samples, so we get the log files and benchmark reports:

```bash
$ snakemake -f -j 1 -pr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
$ snakemake -f -j 1 -pr --use-conda bowtie2/H3K4-TC1-ST2-D0.12.bam
```

Check the benchmark report:

```bash
411
$ cat benchmarks/mapping/INPUT-TC1-ST2-D0.12.tsv
Sarah Peter's avatar
Sarah Peter committed
412
s       h:m:s   max_rss max_vms max_uss max_pss io_in io_out mean_load
413
28.0666 0:00:28 214.23  1171.48 204.68  205.76  0.00  0.00   52.47
Sarah Peter's avatar
Sarah Peter committed
414 415
```

416
After this step your working directory should contain the following files:
417 418 419 420 421

```
.
|-- benchmarks
|   `-- mapping
422
|       |-- H3K4-TC1-ST2-D0.12.tsv
423 424
|       `-- INPUT-TC1-ST2-D0.12.tsv
|-- bowtie2
425 426
|   |-- H3K4-TC1-ST2-D0.12.bam
|   |-- H3K4-TC1-ST2-D0.12.bam.bai
427 428
|   |-- INPUT-TC1-ST2-D0.12.bam
|   `-- INPUT-TC1-ST2-D0.12.bam.bai
Sarah Peter's avatar
Sarah Peter committed
429 430 431 432 433 434 435
|-- chip-seq
|   |-- H3K4-TC1-ST2-D0.12.fastq.gz
|   `-- INPUT-TC1-ST2-D0.12.fastq.gz
|-- envs
|   |-- bowtie2.yaml
|   |-- macs2.yaml
|   `-- ucsc.yaml
436
|-- logs
437
|   |-- bowtie2_H3K4-TC1-ST2-D0.12.log
438
|   `-- bowtie2_INPUT-TC1-ST2-D0.12.log
Sarah Peter's avatar
Sarah Peter committed
439 440 441 442 443 444 445 446 447 448
|-- reference
|   |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.1.bt2
|   |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.2.bt2
|   |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.3.bt2
|   |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.4.bt2
|   |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.fa
|   |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai
|   |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.rev.1.bt2
|   `-- Mus_musculus.GRCm38.dna_sm.chromosome.12.rev.2.bt2
`-- Snakefile
449 450
```

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
451
<a name="peaks"></a>
Sarah Peter's avatar
Sarah Peter committed
452 453 454

### Peak calling

Sarah Peter's avatar
Sarah Peter committed
455
The next step in the workflow is to call peaks with [`MACS2`](https://github.com/taoliu/MACS). This tells us where there is enrichment of the ChIP versus the input (control).
Sarah Peter's avatar
Sarah Peter committed
456 457 458

You should always choose the peak caller based on how you expect your enriched regions to look like, e.g. narrow or broad peaks.

Aurélien Ginolhac's avatar
Aurélien Ginolhac committed
459
Besides the list of peaks in [BED](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) format, `MACS2` also produces coverage tracks.
460

461 462
Add the following rule to your `Snakefile`:

Sarah Peter's avatar
Sarah Peter committed
463 464
```python
rule peak_calling:
Sarah Peter's avatar
Sarah Peter committed
465
  input:
466 467
    control = "bowtie2/INPUT-{sample}.bam",
    chip = "bowtie2/H3K4-{sample}.bam"
Sarah Peter's avatar
Sarah Peter committed
468
  output:
469
    peaks = "output/{sample}_peaks.narrowPeak",
Sarah Peter's avatar
Sarah Peter committed
470 471
    control_bdg = "macs2/{sample}_control_lambda.bdg",
    chip_bdg = "macs2/{sample}_treat_pileup.bdg"
Sarah Peter's avatar
Sarah Peter committed
472
  conda: "envs/macs2.yaml"
473
  shell:
Sarah Peter's avatar
Sarah Peter committed
474
    """
475 476 477
    macs2 callpeak -t {input.chip} -c {input.control} -f BAM -g mm -n {wildcards.sample} -B -q 0.01 --outdir macs2

    cp macs2/{wildcards.sample}_peaks.narrowPeak {output.peaks}
Sarah Peter's avatar
Sarah Peter committed
478
    """
Sarah Peter's avatar
Sarah Peter committed
479
```
Sarah Peter's avatar
Sarah Peter committed
480

481 482 483 484 485 486 487 488 489 490
The conda environment `envs/macs2.yaml` for this step is:

```yaml
name: macs2
channels:
  - bioconda
dependencies:
  - macs2
```

491 492 493 494 495 496 497
Let us first do a dry-run with with option `-n` to make sure everything is ok:

```bash
$ snakemake -npr -j 1 --use-conda output/TC1-ST2-D0.12_peaks.narrowPeak
```

If there are no errors, remove the `n` to execute the peak calling:
Sarah Peter's avatar
Sarah Peter committed
498 499

```bash
500
$ snakemake -pr -j 1 --use-conda output/TC1-ST2-D0.12_peaks.narrowPeak
Sarah Peter's avatar
Sarah Peter committed
501 502
```

503
Note that snakemake will not run the mapping steps again, since it only runs rules for which the output is not present or the input has changed.
Sarah Peter's avatar
Sarah Peter committed
504

505 506 507 508 509 510 511 512 513 514 515 516 517
After this step your working directory should contain the following files:

```
.
|-- benchmarks
|   `-- mapping
|       |-- H3K4-TC1-ST2-D0.12.tsv
|       `-- INPUT-TC1-ST2-D0.12.tsv
|-- bowtie2
|   |-- H3K4-TC1-ST2-D0.12.bam
|   |-- H3K4-TC1-ST2-D0.12.bam.bai
|   |-- INPUT-TC1-ST2-D0.12.bam
|   `-- INPUT-TC1-ST2-D0.12.bam.bai
Sarah Peter's avatar
Sarah Peter committed
518 519 520 521 522 523 524
|-- chip-seq
|   |-- H3K4-TC1-ST2-D0.12.fastq.gz
|   `-- INPUT-TC1-ST2-D0.12.fastq.gz
|-- envs
|   |-- bowtie2.yaml
|   |-- macs2.yaml
|   `-- ucsc.yaml
525 526 527 528 529 530 531 532 533 534 535 536
|-- logs
|   |-- bowtie2_H3K4-TC1-ST2-D0.12.log
|   `-- bowtie2_INPUT-TC1-ST2-D0.12.log
|-- macs2
|   |-- TC1-ST2-D0.12_control_lambda.bdg
|   |-- TC1-ST2-D0.12_model.r
|   |-- TC1-ST2-D0.12_peaks.narrowPeak
|   |-- TC1-ST2-D0.12_peaks.xls
|   |-- TC1-ST2-D0.12_summits.bed
|   `-- TC1-ST2-D0.12_treat_pileup.bdg
|-- output
|   `-- TC1-ST2-D0.12_peaks.narrowPeak
Sarah Peter's avatar
Sarah Peter committed
537 538 539 540 541 542 543 544 545 546
|-- reference
|   |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.1.bt2
|   |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.2.bt2
|   |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.3.bt2
|   |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.4.bt2
|   |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.fa
|   |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai
|   |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.rev.1.bt2
|   `-- Mus_musculus.GRCm38.dna_sm.chromosome.12.rev.2.bt2
`-- Snakefile
547 548
```

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
549
<a name="bigwig"></a>
550 551 552

### Generate bigWig files for visualisation

Aurélien Ginolhac's avatar
Aurélien Ginolhac committed
553
For easier visualisation and faster transfer, we convert the two coverage tracks from the `MACS2` output to [bigWig](https://genome.ucsc.edu/goldenpath/help/bigWig.html) format.
Sarah Peter's avatar
Sarah Peter committed
554

555 556
Add the following rule to your `Snakefile`:

557 558
```python
rule bigwig:
Sarah Peter's avatar
Sarah Peter committed
559 560
  input: "macs2/{sample}.bdg"
  output: "output/{sample}.bigwig"
561
  params:
562
    idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai"
Sarah Peter's avatar
Sarah Peter committed
563
  conda: "envs/ucsc.yaml"
564
  shell:
Sarah Peter's avatar
Sarah Peter committed
565 566 567
    """
    bedGraphToBigWig {input} {params.idx} {output}
    """
568 569
```

570 571 572 573 574 575 576 577 578 579
The conda environment `envs/ucsc.yaml` for this step is:

```yaml
name: ucsc
channels:
  - bioconda
dependencies:
  - ucsc-bedgraphtobigwig
```

Sarah Peter's avatar
Sarah Peter committed
580 581 582
Let's test this step with:

```bash
583
$ snakemake -pr -j 1 --use-conda output/TC1-ST2-D0.12_treat_pileup.bigwig
Sarah Peter's avatar
Sarah Peter committed
584 585 586 587
```

This time snakemake will only run the "bigwig" rule for the one file we specified.

588 589 590 591 592 593 594
After this step the `output` directory should contain the following files:

```
TC1-ST2-D0.12_peaks.narrowPeak
TC1-ST2-D0.12_treat_pileup.bigwig
```

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
595
<a name="summary"></a>
596

Sarah Peter's avatar
Sarah Peter committed
597
### Summary rule
598

599
To avoid always having to specify which output file we want on the command-line, we add one rule with just inputs that defines the result files we want to have in the end. Since by default snakemake executes the first rule in the snakefile, we add this rule as the first one to the top and then we don't need to specify anything additional on the command-line.
Sarah Peter's avatar
Sarah Peter committed
600 601

First, at the **very top** of the Snakefile, define a variable for the name of the sample:
602 603

```python
604
SAMPLE = "TC1-ST2-D0.12"
605 606
```

607
This makes it easier to change the Snakefile and apply it to other datasets. Snakemake is based on Python so we can use Python code inside the Snakefile. We will use [f-Strings](https://docs.python.org/3.6/reference/lexical_analysis.html#f-strings) to include the variable in the file names.
608

Sarah Peter's avatar
Sarah Peter committed
609
Add this rule at the top of the `Snakefile` after the line above:
Sarah Peter's avatar
Sarah Peter committed
610

611 612
```python
rule all:
613
  input: f"output/{SAMPLE}_peaks.narrowPeak", f"output/{SAMPLE}_control_lambda.bigwig", f"output/{SAMPLE}_treat_pileup.bigwig"
614
```
Sarah Peter's avatar
Sarah Peter committed
615

Sarah Peter's avatar
Sarah Peter committed
616
Finally run the workflow again, this time without a specific target file:
Sarah Peter's avatar
Sarah Peter committed
617

Sarah Peter's avatar
Sarah Peter committed
618
```bash
619
$ snakemake --use-conda -pr -j 1
Sarah Peter's avatar
Sarah Peter committed
620 621
```

622 623 624 625 626 627 628 629
After this step the `output` directory should contain the following files:

```
TC1-ST2-D0.12_control_lambda.bigwig
TC1-ST2-D0.12_peaks.narrowPeak
TC1-ST2-D0.12_treat_pileup.bigwig
```

Sarah Peter's avatar
Sarah Peter committed
630
Snakemake can visualise the dependency graph of the workflow with the following command (if `graphviz` is installed):
Sarah Peter's avatar
Sarah Peter committed
631 632

```bash
633
$ snakemake --dag | dot -Tpdf > dag.pdf
Sarah Peter's avatar
Sarah Peter committed
634 635
```

636
![DAG](img/dag.png)
Sarah Peter's avatar
Sarah Peter committed
637

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
638
<a name="igv"></a>
Sarah Peter's avatar
Sarah Peter committed
639 640 641

## Inspect results in IGV

Sarah Peter's avatar
Sarah Peter committed
642
Now that we have completed the workflow, let's have a look at the results.
643

644
For visualisation, [download IGV](http://software.broadinstitute.org/software/igv/download), or use any other genome browser of your choice.
Sarah Peter's avatar
Sarah Peter committed
645

646
Start IGV and select mouse mm10 as genome in the drop-down menu in the upper left. Go to "File" -> "Load from File…" and select all three files in the `output` folder.
647

648 649 650 651
In the search box enter for example "Ahr" to check the signal around one of the genes highlighted in the paper.

Pay attention to the scale on which the two coverage tracks are shown. You can right-click on the track name on the left and select "Autoscale" to adjust the range.

652 653 654
When you hover over the blocks in the `TC1-ST2-D0.12_peaks.narrowPeak` track, you can see additional information about the called peaks, e.g. p-value. The peaks should be at the transcription start sites, because that is what H3K4me3 marks. Pay attention to the arrows in the "Refseq genes" track to see in which direction transcription goes.

![IGV](img/IGV_annotated.jpg)
655

Sarah Peter's avatar
Sarah Peter committed
656 657 658 659 660 661
<a name="exercises"></a>

## Exercises

You can find some exercises for this tutorial [here](exercises.md).

662 663 664 665 666 667 668 669 670 671 672 673 674 675
<a name="revert"></a>

## (Optional) Revert the changes to your environment

If you want to stop conda from always being active:

```bash
(access)$> conda init --reverse
```

In case you want to get rid of conda completely, you can now also delete the directory where you installed it (default is `$HOME/miniconda3`).



Sarah Peter's avatar
Add TOC  
Sarah Peter committed
676
<a name="useful"></a>
Sarah Peter's avatar
Sarah Peter committed
677

Sarah Peter's avatar
Sarah Peter committed
678 679
## Useful stuff

Sarah Peter's avatar
Sarah Peter committed
680
* If `PYTHONNOUSERSITE` is set, Python won’t add the [`user site-packages directory`](https://docs.python.org/3/library/site.html#site.USER_SITE) to [`sys.path`](https://docs.python.org/3/library/sys.html#sys.path). If it's not set, Python will pick up packages from the user site-packages before packages from conda environments. This can lead to errors if package versions are incompatible and you cannot be sure anymore which version of a software/package you are using.
Sarah Peter's avatar
Sarah Peter committed
681

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
682
<a name="references"></a>
Sarah Peter's avatar
Sarah Peter committed
683

684 685 686 687 688 689 690
## References

* [Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.](http://bioinformatics.oxfordjournals.org/content/28/19/2520)
* [Gérard D, Schmidt F, Ginolhac A, Schmitz M, Halder R, Ebert P, Schulz MH, Sauter T, Sinkkonen L. Temporal enhancer profiling of parallel lineages identifies AHR and GLIS1 as regulators of mesenchymal multipotency. *Nucleic Acids Research*, Volume 47, Issue 3, 20 February 2019, Pages 1141–1163, https://doi.org/10.1093/nar/gky1240](https://www.ncbi.nlm.nih.gov/pubmed/30544251)
* [Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. *Nature Methods*. 2012, 9:357-359.](http://www.nature.com/nmeth/journal/v9/n4/full/nmeth.1923.html)
* [Zhang et al. Model-based Analysis of ChIP-Seq (MACS). *Genome Biol* (2008) vol. 9 (9) pp. R137](http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=search&term=18798982%5Bpmid%5D)

Sarah Peter's avatar
Add TOC  
Sarah Peter committed
691
<a name="acknowledgements"></a>
692

Sarah Peter's avatar
Sarah Peter committed
693 694
## Acknowledgements

695
Many thanks to Aurélien Ginolhac, Cedric Laczny, Nikola de Lange and Roland Krause for their help in developing this tutorial.