# Bioinformatics workflows with snakemake and conda
Author: Sarah Peter
In this tutorial you will learn how to run a [ChIP-seq](https://en.wikipedia.org/wiki/ChIP-sequencing) analysis with the [snakemake workflow engine](https://snakemake.readthedocs.io/en/stable/).
**Disclaimer:** In order to keep this tutorial simple, we use default parameters for the different tools as much as possible. However, for a real analysis you should always adapt the parameters to your dataset. Also, be aware that the results of some steps might be skewed, because we only work on data from one chromosome.
## Table of contents
1. [Install conda](#conda)
2. [Setup the environment](#env)
3. [Create snakemake workflow](#snakemake)
1. [Mapping](#mapping)
2. [Peak calling](#peaks)
3. [Generate bigWig files for visualisation](#bigwig)
4. [Summary rule](#summary)
4. [Inspect results in IGV](#igv)
5. [Exercises](#exercises)
6. [(Optional) Revert the changes to your environment](#revert)
7. [Useful stuff](#useful)
8. [References](#references)
9. [Acknowledgements](#acknowledgements)
## Install conda
For this tutorial we will use the [`conda` package manager](https://www.anaconda.com/) to install the required tools.
> Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.
>
> Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment.
>
> — [Conda manual](https://docs.conda.io/en/latest/index.html)
It can encapsulate software and packages in environments, so you can have multiple different versions of a software installed at the same time and avoid incompatibilities between different tools. It also has functionality to easily port and replicate environments, which is important to ensure reproducibility of analyses.
**Attention when dealing with sensitive data:** Everyone can very easily contribute installation recipies to the bioconda channel, without verified identity or double-checking from the community. Therefore it's possible to insert malicious software. If you use bioconda when processing sensitive data, you should check the recipes to verify that they install software from the official sources.
We will use conda on two levels in this tutorial. First we use a conda environment to install and run snakemake. Second, inside the snakemake workflow we will define separate conda environments for each step.
### Set up VirtualBox
Independent of the operating system you are using, we recommend you set up a VirtualBox as described in our [instructions](virtualbox.md). This ensures that the conda installation does not affect your local environment and all steps in the tutorial work as expected.
In case you still wish to install conda natively on your machine **at your own risk**, you can check [these installation instructions](conda_installation.md).
### Install conda in the VirtualBox VM
* Start the VM in VirtualBox.
* Start the `Powershell` (Windows) or other terminal application (MacOS, Linux).
* Connect to the VM with
```bash
ssh -p 2222 tutorial@127.0.0.1
```
* Password is `lcsblcsb` if you have not changed it.
Once connected to the VM, follow these steps to install conda:
```bash
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ chmod u+x Miniconda3-latest-Linux-x86_64.sh
$ ./Miniconda3-latest-Linux-x86_64.sh
```
You need to specify your installation destination, e.g. `/home/tutorial/tools/miniconda3`. You must use the **full** path and can**not** use `$HOME/tools/miniconda3`. Answer `yes` to initialize Miniconda3.
The installation will modify your `.bashrc` to make conda directly available after each login. To activate the changes now, run
```bash
$ source ~/.bashrc
```
## Setup the environment
1. Update conda to the latest version:
```bash
$ conda update conda
```
2. Create a new conda environment and activate it:
```bash
$ conda create -n bioinfo_tutorial
$ conda activate bioinfo_tutorial
```
After validation of the creation step and once activated, you can see that your prompt will now be prefixed with `(bioinfo_tutorial)` to show which environment is active. For the rest of the tutorial make sure that you always have this environment active.
3. Make sure Python does not pick up packages in your home directory:
```bash
(bioinfo_tutorial) $ export PYTHONNOUSERSITE=True
```
4. Install snakemake:
```bash
(bioinfo_tutorial) $ conda install -c bioconda -c conda-forge snakemake-minimal
```
## Create snakemake workflow
> The Snakemake workflow management system is a tool to create **reproducible and scalable** data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.
>
> — [Snakemake manual](https://snakemake.readthedocs.io/en/stable/index.html)
Snakemake is a very useful tool if you need to combine multiple steps using different software into a coherent workflow. It comes with many features desired for running workflows, like
* ensuring all input and result files are present
* restarting at a failed step
* rerunning (parts of a) pipeline when (some of) the input changed
* support for wildcards to apply a step to a set of files
* automatically parallelising where possible
* software management
* collecting benchmark data
* modularisation
* creating launcher scripts and submitting jobs to the cluster
* creating a visualisation of the workflow steps (see below)
In this tutorial we will analyse [ChIP-seq data](https://www.ebi.ac.uk/ena/data/view/PRJEB20933) from the paper [Gérard D, Schmidt F, Ginolhac A, Schmitz M, Halder R, Ebert P, Schulz MH, Sauter T, Sinkkonen L. Temporal enhancer profiling of parallel lineages identifies AHR and GLIS1 as regulators of mesenchymal multipotency. *Nucleic Acids Research*, Volume 47, Issue 3, 20 February 2019, Pages 1141–1163, https://doi.org/10.1093/nar/gky1240](https://www.ncbi.nlm.nih.gov/pubmed/30544251) published by our colleagues at LSRU.
We will set up the following workflow:

To speed up computing time we use source files that only contain sequencing reads that map to chromosome 12. The files for input (control) and H3K4me3 (ChIP) are available on our [WebDAV](https://webdav-r3lab.uni.lu/public/biocore/snakemake_tutorial/) server.
Create a working directory and download the necessary data:
```bash
$ mkdir bioinfo_tutorial
$ cd bioinfo_tutorial
$ curl https://webdav-r3lab.uni.lu/public/biocore/snakemake_tutorial/snakemake_tutorial_data.tar.gz.gpg -o snakemake_tutorial_data.tar.gz.gpg
$ curl https://webdav-r3lab.uni.lu/public/biocore/snakemake_tutorial/snakemake_tutorial_data.tar.gz.gpg.md5 -o snakemake_tutorial_data.tar.gz.gpg.md5
```
Check the md5 sum:
```bash
$ md5sum -c snakemake_tutorial_data.tar.gz.gpg.md5
snakemake_tutorial_data.tar.gz.gpg: OK
```
Make sure it says `OK`.
Then decrypt the data with gpg and finally unpack it. The passkey or passphrase for decryption is `elixirLU`. In real situations the passphrase should always be transmitted in a secure way and separate from the data!
```bash
$ gpg -o snakemake_tutorial_data.tar.gz --decrypt snakemake_tutorial_data.tar.gz.gpg
$ tar -xzf snakemake_tutorial_data.tar.gz
```
Your working directory should have the following layout now (using the `tree` command):
```
.
|-- chip-seq
| |-- H3K4-TC1-ST2-D0.12.fastq.gz
| `-- INPUT-TC1-ST2-D0.12.fastq.gz
|-- envs
| |-- bowtie2.yaml
| |-- macs2.yaml
| `-- ucsc.yaml
`-- reference
|-- Mus_musculus.GRCm38.dna_sm.chromosome.12.1.bt2
|-- Mus_musculus.GRCm38.dna_sm.chromosome.12.2.bt2
|-- Mus_musculus.GRCm38.dna_sm.chromosome.12.3.bt2
|-- Mus_musculus.GRCm38.dna_sm.chromosome.12.4.bt2
|-- Mus_musculus.GRCm38.dna_sm.chromosome.12.fa
|-- Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai
|-- Mus_musculus.GRCm38.dna_sm.chromosome.12.rev.1.bt2
`-- Mus_musculus.GRCm38.dna_sm.chromosome.12.rev.2.bt2
```
### Mapping
> In Snakemake, workflows are specified as Snakefiles. Inspired by GNU Make, a Snakefile contains rules that denote how to create output files from input files. Dependencies between rules are handled implicitly, by matching filenames of input files against output files. Thereby wildcards can be used to write general rules.
>
> — [Snakemake manual - Writing Workflows](https://snakemake.readthedocs.io/en/stable/snakefiles/writing_snakefiles.html)
> Most importantly, a rule can consist of a name (the name is optional and can be left out, creating an anonymous rule), input files, output files, and a shell command to generate the output from the input, i.e.
> ```python
> rule NAME:
> input: "path/to/inputfile", "path/to/other/inputfile"
> output: "path/to/outputfile", "path/to/another/outputfile"
> shell: "somecommand {input} {output}"
> ```
> — [Snakemake manual - Rules](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html)
For a detailed explanation of rules, have a look at the official [Snakemake tutorial](https://snakemake.readthedocs.io/en/stable/tutorial/basics.html).
A basic rule for mapping a fastq file with [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) could look like this:
```python
rule mapping:
input: "chip-seq/H3K4-TC1-ST2-D0.12.fastq.gz"
output: "bowtie2/H3K4-TC1-ST2-D0.12.bam"
shell:
"""
bowtie2 -x reference/Mus_musculus.GRCm38.dna_sm.chromosome.12 -U {input} | \
samtools sort - > {output}
samtools index {output}
"""
```
In the `shell` section we first map the fastq file to the mouse reference genome with `bowtie2`. The output is a sam file that is forwarded to [`samtools`](http://www.htslib.org/) using pipes (`|`) for sorting and conversion to bam format. The `\` is just escaping the line break, so we can break the line in the rule without actually breaking the line in the shell command. The output of the sort is then written to the `output` file. Finally we also create an index for the output bam file, again with `samtools`.
Create a file called `Snakefile` in the current directory, open it in your favourite editor, e.g.
```bash
$ nano Snakefile
```
Paste the above rule into the file and save it.
Let us try to run this rule with:
```bash
$ snakemake -pr -j 1
```
We will see that the execution fails, because we do not have `bowtie2` or `samtools` installed (the error is `/bin/bash: bowtie2: command not found`).
We could now just install `bowtie2` and `samtools` in our current conda environment. We would have to do the same for all other tools that are used in the analysis as well. However, it is much smarter to make proper use of conda environments and separate the tools into individual environments to avoid conflicts. For example `snakemake` itself conflicts with `macs2`, which we will use in a later step for peak calling, because `snakemake` requires Python 3, while `macs2` only works on Python 2.
We will create one conda environment for each `snakemake` rule and then we will tell `snakemake` in each rule which environment it should use. Fortunately, we do not need to create the environments manually, we just need to provide `snakemake` a [yaml](https://yaml.org/spec/1.2/spec.html) file that defines the conda environment. It will then create the environment itself.
You can easily export existing conda environments to a yaml file with `conda env export` or write the yaml from scratch. For this step the yaml file looks like this:
```yaml
name: bowtie2
channels:
- bioconda
dependencies:
- bowtie2
- samtools
```
The yaml files for all rules are already included in the tutorial data you downloaded and can be found in the `envs` subdirectory.
To specify the yaml file for the conda environment, there is a specific `conda` directive that can be added to the rule. Update your `Snakefile` to match the following (edit the existing rule, do not add a second rule):
```python
rule mapping:
input: "chip-seq/H3K4-TC1-ST2-D0.12.fastq.gz"
output: "bowtie2/H3K4-TC1-ST2-D0.12.bam"
conda: "envs/bowtie2.yaml"
shell:
"""
bowtie2 -x reference/Mus_musculus.GRCm38.dna_sm.chromosome.12 -U {input} | \
samtools sort - > {output}
samtools index {output}
"""
```
We should now be able to successfully run our mapping rule, if we tell `snakemake` to use conda environments with the `--use-conda` option:
```bash
$ snakemake -pr -j 1 --use-conda
```
The first run will take a bit longer, because snakemake creates the conda environment. In subsequent runs it will just activate the existing environment. However, it will recognise if the yaml files changes and then recreate the environment.
Since we have two fastq files to map, we should generalise the rule with wildcards. Update your `Snakefile` to match the following (edit the existing rule, do not add a second rule):
```python
rule mapping:
input: "chip-seq/{sample}.fastq.gz"
output: "bowtie2/{sample}.bam"
conda: "envs/bowtie2.yaml"
shell:
"""
bowtie2 -x reference/Mus_musculus.GRCm38.dna_sm.chromosome.12 -U {input} | \
samtools sort - > {output}
samtools index {output}
"""
```
Let us try to run this rule:
```bash
$ snakemake -pr -j 1 --use-conda
```
Now snakemake gives an error, because it does not know how to replace the `{sample}` wildcard. We need to tell it explicitly which output file we want to generate:
```bash
$ snakemake -j 1 -pr --use-conda bowtie2/H3K4-TC1-ST2-D0.12.bam
```
However, since this output file is already present, snakemake will not do anything. So let us specify the output file of our second sample instead:
```bash
$ snakemake -j 1 -pr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
```
If you want to force snakemake to execute the rule again, you can either add the option `-f` or delete the output file.
At this point we have a basic rule, but there are a couple of nice additional directives that we will have a look at in the following.
The first one is the `params` directive, which can be used to define parameters separately from the rule body. Update your `Snakefile` to match the following (edit the existing rule, do not add a second rule):
```python
rule mapping:
input: "chip-seq/{sample}.fastq.gz"
output: "bowtie2/{sample}.bam"
params:
idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12"
conda: "envs/bowtie2.yaml"
shell:
"""
bowtie2 -x {params.idx} -U {input} | \
samtools sort - > {output}
samtools index {output}
"""
```
Here we can use it to define the path to the reference and declutter the command-line call. It is possible to use wildcards in `params`, as well as other advanced functionality (e.g. functions) like for `input` and `output`.
Rerunning the rule will not result in any change:
```bash
$ snakemake -j 1 -pr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
Building DAG of jobs...
Nothing to be done.
Complete log: /home/tutorial/bioinfo_tutorial/.snakemake/log/2020-05-12T153550.679522.snakemake.log
```
The second directive is the `log` directive, which defines a path to permanently store the output of the execution. This is especially useful in this step to store the bowtie2 mapping statistics, which are just written to the command-line (stderr) otherwise. Update your `Snakefile` to match the following (edit the existing rule, do not add a second rule):
```python
rule mapping:
input: "chip-seq/{sample}.fastq.gz"
output: "bowtie2/{sample}.bam"
params:
idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12"
log: "logs/bowtie2_{sample}.log"
conda: "envs/bowtie2.yaml"
shell:
"""
bowtie2 -x {params.idx} -U {input} 2> {log} | \
samtools sort - > {output}
samtools index {output}
"""
```
Be aware that we need to explicitly redirect the stderr output of bowtie2 to the log file with `2> {log}`. As you might remember from the introduction of the rule, the regular output (stdout) of bowtie2 is forwarded to samtools with the pipe.
Let us force snakemake to rerun this rule, such that we get the log file:
```bash
snakemake -f -j 1 -pr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
```
You can inspect the log file using `cat` or your favourite editor:
```bash
$ cat logs/bowtie2_INPUT-TC1-ST2-D0.12.log
400193 reads; of these:
400193 (100.00%) were unpaired; of these:
1669 (0.42%) aligned 0 times
379290 (94.78%) aligned exactly 1 time
19234 (4.81%) aligned >1 times
99.58% overall alignment rate
```
As the final directive we will add `benchmark` to track resource usage. It writes performance measures to a tsv file. Update your `Snakefile` to match the following (edit the existing rule, do not add a second rule):
```python
rule mapping:
input: "chip-seq/{sample}.fastq.gz"
output: "bowtie2/{sample}.bam"
params:
idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12"
log: "logs/bowtie2_{sample}.log"
benchmark: "benchmarks/mapping/{sample}.tsv"
conda: "envs/bowtie2.yaml"
shell:
"""
bowtie2 -x {params.idx} -U {input} 2> {log} | \
samtools sort - > {output}
samtools index {output}
"""
```
Let us force run our final mapping rule again for both samples, so we get the log files and benchmark reports:
```bash
$ snakemake -f -j 1 -pr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
$ snakemake -f -j 1 -pr --use-conda bowtie2/H3K4-TC1-ST2-D0.12.bam
```
Check the benchmark report:
```bash
$ cat benchmarks/mapping/INPUT-TC1-ST2-D0.12.tsv
s h:m:s max_rss max_vms max_uss max_pss io_in io_out mean_load
28.0666 0:00:28 214.23 1171.48 204.68 205.76 0.00 0.00 52.47
```
After this step your working directory should contain the following files:
```
.
|-- benchmarks
| `-- mapping
| |-- H3K4-TC1-ST2-D0.12.tsv
| `-- INPUT-TC1-ST2-D0.12.tsv
|-- bowtie2
| |-- H3K4-TC1-ST2-D0.12.bam
| |-- H3K4-TC1-ST2-D0.12.bam.bai
| |-- INPUT-TC1-ST2-D0.12.bam
| `-- INPUT-TC1-ST2-D0.12.bam.bai
|-- chip-seq
| |-- H3K4-TC1-ST2-D0.12.fastq.gz
| `-- INPUT-TC1-ST2-D0.12.fastq.gz
|-- envs
| |-- bowtie2.yaml
| |-- macs2.yaml
| `-- ucsc.yaml
|-- logs
| |-- bowtie2_H3K4-TC1-ST2-D0.12.log
| `-- bowtie2_INPUT-TC1-ST2-D0.12.log
|-- reference
| |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.1.bt2
| |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.2.bt2
| |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.3.bt2
| |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.4.bt2
| |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.fa
| |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai
| |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.rev.1.bt2
| `-- Mus_musculus.GRCm38.dna_sm.chromosome.12.rev.2.bt2
`-- Snakefile
```
### Peak calling
The next step in the workflow is to call peaks with [`MACS2`](https://github.com/taoliu/MACS). This tells us where there is enrichment of the ChIP versus the input (control).
You should always choose the peak caller based on how you expect your enriched regions to look like, e.g. narrow or broad peaks.
Besides the list of peaks in [BED](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) format, `MACS2` also produces coverage tracks.
Add the following rule to your `Snakefile`:
```python
rule peak_calling:
input:
control = "bowtie2/INPUT-{sample}.bam",
chip = "bowtie2/H3K4-{sample}.bam"
output:
peaks = "output/{sample}_peaks.narrowPeak",
control_bdg = "macs2/{sample}_control_lambda.bdg",
chip_bdg = "macs2/{sample}_treat_pileup.bdg"
conda: "envs/macs2.yaml"
shell:
"""
macs2 callpeak -t {input.chip} -c {input.control} -f BAM -g mm -n {wildcards.sample} -B -q 0.01 --outdir macs2
cp macs2/{wildcards.sample}_peaks.narrowPeak {output.peaks}
"""
```
The conda environment `envs/macs2.yaml` for this step is:
```yaml
name: macs2
channels:
- bioconda
dependencies:
- macs2
```
Let us first do a dry-run with with option `-n` to make sure everything is ok:
```bash
$ snakemake -npr -j 1 --use-conda output/TC1-ST2-D0.12_peaks.narrowPeak
```
If there are no errors, remove the `n` to execute the peak calling:
```bash
$ snakemake -pr -j 1 --use-conda output/TC1-ST2-D0.12_peaks.narrowPeak
```
Note that snakemake will not run the mapping steps again, since it only runs rules for which the output is not present or the input has changed.
After this step your working directory should contain the following files:
```
.
|-- benchmarks
| `-- mapping
| |-- H3K4-TC1-ST2-D0.12.tsv
| `-- INPUT-TC1-ST2-D0.12.tsv
|-- bowtie2
| |-- H3K4-TC1-ST2-D0.12.bam
| |-- H3K4-TC1-ST2-D0.12.bam.bai
| |-- INPUT-TC1-ST2-D0.12.bam
| `-- INPUT-TC1-ST2-D0.12.bam.bai
|-- chip-seq
| |-- H3K4-TC1-ST2-D0.12.fastq.gz
| `-- INPUT-TC1-ST2-D0.12.fastq.gz
|-- envs
| |-- bowtie2.yaml
| |-- macs2.yaml
| `-- ucsc.yaml
|-- logs
| |-- bowtie2_H3K4-TC1-ST2-D0.12.log
| `-- bowtie2_INPUT-TC1-ST2-D0.12.log
|-- macs2
| |-- TC1-ST2-D0.12_control_lambda.bdg
| |-- TC1-ST2-D0.12_model.r
| |-- TC1-ST2-D0.12_peaks.narrowPeak
| |-- TC1-ST2-D0.12_peaks.xls
| |-- TC1-ST2-D0.12_summits.bed
| `-- TC1-ST2-D0.12_treat_pileup.bdg
|-- output
| `-- TC1-ST2-D0.12_peaks.narrowPeak
|-- reference
| |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.1.bt2
| |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.2.bt2
| |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.3.bt2
| |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.4.bt2
| |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.fa
| |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai
| |-- Mus_musculus.GRCm38.dna_sm.chromosome.12.rev.1.bt2
| `-- Mus_musculus.GRCm38.dna_sm.chromosome.12.rev.2.bt2
`-- Snakefile
```
### Generate bigWig files for visualisation
For easier visualisation and faster transfer, we convert the two coverage tracks from the `MACS2` output to [bigWig](https://genome.ucsc.edu/goldenpath/help/bigWig.html) format.
Add the following rule to your `Snakefile`:
```python
rule bigwig:
input: "macs2/{sample}.bdg"
output: "output/{sample}.bigwig"
params:
idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai"
conda: "envs/ucsc.yaml"
shell:
"""
bedGraphToBigWig {input} {params.idx} {output}
"""
```
The conda environment `envs/ucsc.yaml` for this step is:
```yaml
name: ucsc
channels:
- bioconda
dependencies:
- ucsc-bedgraphtobigwig
```
Let's test this step with:
```bash
$ snakemake -pr -j 1 --use-conda output/TC1-ST2-D0.12_treat_pileup.bigwig
```
This time snakemake will only run the "bigwig" rule for the one file we specified.
After this step the `output` directory should contain the following files:
```
TC1-ST2-D0.12_peaks.narrowPeak
TC1-ST2-D0.12_treat_pileup.bigwig
```
### Summary rule
To avoid always having to specify which output file we want on the command-line, we add one rule with just inputs that defines the result files we want to have in the end. Since by default snakemake executes the first rule in the snakefile, we add this rule as the first one to the top and then we don't need to specify anything additional on the command-line.
First, at the **very top** of the Snakefile, define a variable for the name of the sample:
```python
SAMPLE = "TC1-ST2-D0.12"
```
This makes it easier to change the Snakefile and apply it to other datasets. Snakemake is based on Python so we can use Python code inside the Snakefile. We will use [f-Strings](https://docs.python.org/3.6/reference/lexical_analysis.html#f-strings) to include the variable in the file names.
Add this rule at the top of the `Snakefile` after the line above:
```python
rule all:
input: f"output/{SAMPLE}_peaks.narrowPeak", f"output/{SAMPLE}_control_lambda.bigwig", f"output/{SAMPLE}_treat_pileup.bigwig"
```
Finally run the workflow again, this time without a specific target file:
```bash
$ snakemake --use-conda -pr -j 1
```
After this step the `output` directory should contain the following files:
```
TC1-ST2-D0.12_control_lambda.bigwig
TC1-ST2-D0.12_peaks.narrowPeak
TC1-ST2-D0.12_treat_pileup.bigwig
```
Snakemake can visualise the dependency graph of the workflow with the following command (if `graphviz` is installed):
```bash
$ snakemake --dag | dot -Tpdf > dag.pdf
```

## Inspect results in IGV
Now that we have completed the workflow, let's have a look at the results.
For visualisation, [download IGV](http://software.broadinstitute.org/software/igv/download), or use any other genome browser of your choice.
Start IGV and select mouse mm10 as genome in the drop-down menu in the upper left. Go to "File" -> "Load from File…" and select all three files in the `output` folder.
In the search box enter for example "Ahr" to check the signal around one of the genes highlighted in the paper.
Pay attention to the scale on which the two coverage tracks are shown. You can right-click on the track name on the left and select "Autoscale" to adjust the range.
When you hover over the blocks in the `TC1-ST2-D0.12_peaks.narrowPeak` track, you can see additional information about the called peaks, e.g. p-value. The peaks should be at the transcription start sites, because that is what H3K4me3 marks. Pay attention to the arrows in the "Refseq genes" track to see in which direction transcription goes.

## Exercises
You can find some exercises for this tutorial [here](exercises.md).
## (Optional) Revert the changes to your environment
If you want to stop conda from always being active:
```bash
(access)$> conda init --reverse
```
In case you want to get rid of conda completely, you can now also delete the directory where you installed it (default is `$HOME/miniconda3`).
## Useful stuff
* If `PYTHONNOUSERSITE` is set, Python won’t add the [`user site-packages directory`](https://docs.python.org/3/library/site.html#site.USER_SITE) to [`sys.path`](https://docs.python.org/3/library/sys.html#sys.path). If it's not set, Python will pick up packages from the user site-packages before packages from conda environments. This can lead to errors if package versions are incompatible and you cannot be sure anymore which version of a software/package you are using.
## References
* [Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.](http://bioinformatics.oxfordjournals.org/content/28/19/2520)
* [Gérard D, Schmidt F, Ginolhac A, Schmitz M, Halder R, Ebert P, Schulz MH, Sauter T, Sinkkonen L. Temporal enhancer profiling of parallel lineages identifies AHR and GLIS1 as regulators of mesenchymal multipotency. *Nucleic Acids Research*, Volume 47, Issue 3, 20 February 2019, Pages 1141–1163, https://doi.org/10.1093/nar/gky1240](https://www.ncbi.nlm.nih.gov/pubmed/30544251)
* [Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. *Nature Methods*. 2012, 9:357-359.](http://www.nature.com/nmeth/journal/v9/n4/full/nmeth.1923.html)
* [Zhang et al. Model-based Analysis of ChIP-Seq (MACS). *Genome Biol* (2008) vol. 9 (9) pp. R137](http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=search&term=18798982%5Bpmid%5D)
## Acknowledgements
Many thanks to Aurélien Ginolhac, Cedric Laczny, Nikola de Lange and Roland Krause for their help in developing this tutorial.