Commit 7438d842 authored by Sarah Peter's avatar Sarah Peter

Starting to build first rule step by step

parent 7876c4f3
......@@ -216,26 +216,28 @@ rule mapping:
"""
```
Since we have two fastq files to map, we should generalise the rule with wildcards:
In the `shell` section we first map the fastq file to the mouse reference genome with `bowtie2`. The output is a sam file that is forwarded to [`samtools`](http://www.htslib.org/) using pipes (`|`) for sorting and conversion to bam format. The `\` is just escaping the line break, so we can break the line in the rule without actually breaking the line in the shell command. The output of the sort is then written to the `output` file. Finally we also create an index for the output bam file, again with `samtools`.
```python
rule mapping:
input: "chip-seq/{sample}.fastq.gz"
output: "bowtie2/{sample}.bam"
shell:
"""
bowtie2 -x reference/Mus_musculus.GRCm38.dna_sm.chromosome.12 -U {input} | \
samtools sort - > {output}
samtools index {output}
"""
Create a file called `Snakefile` in the current directory, open it in your favourite editor, e.g.
```bash
$ nano Snakefile
```
Now we need to tell snakemake to use a conda environment with bowtie2 and [samtools](http://www.htslib.org/) inside to run this rule. For this purpose there is a specific `conda` directive that can be added to the rule. It accepts a [yaml](https://yaml.org/spec/1.2/spec.html) file that defines the conda environment.
and then paste the above rule and save the file.
```python
conda: "envs/bowtie2.yaml"
If we now try to run this rule with
```bash
$ snakemake -pr -j 1
```
we will see that the execution fails, because we do not have `bowtie2` or `samtools` installed (the error is `/bin/bash: bowtie2: command not found`).
We could now just install `bowtie2` and `samtools` in our current conda environment. We would have to do the same for all other tools that are used in the analysis as well. However, it is much smarter to make proper use of conda environments and separate the tools into individual environments, to avoid conflicts. For example `snakemake` itself conflicts with `macs2`, which we will use in a later step for peak calling, because `snakemake` requires Python 3, while `macs2` only works on Python 2.
We will create one conda environment for each `snakemake` rule and then we will tell `snakemake` in each rule which environment it should use. Fortunately, we do not need to create the environments manually, we just need to provide `snakemake` a [yaml](https://yaml.org/spec/1.2/spec.html) file that defines the conda environment. It will then create the environment itself.
You can easily export existing conda environments to a yaml file with `conda env export` or write the yaml from scratch. For this step the yaml file looks like this:
```yaml
......@@ -247,6 +249,46 @@ dependencies:
- samtools
```
The yaml files for all rules are already included in the tutorial data you downloaded and can be found in the `envs` subdirectory.
To specify the yaml file for the conda environment, there is a specific `conda` directive that can be added to the rule.
```python
rule mapping:
input: "chip-seq/H3K4-TC1-ST2-D0.12.fastq.gz"
output: "bowtie2/H3K4-TC1-ST2-D0.12.bam"
conda: "envs/bowtie2.yaml"
shell:
"""
bowtie2 -x reference/Mus_musculus.GRCm38.dna_sm.chromosome.12 -U {input} | \
samtools sort - > {output}
samtools index {output}
"""
```
We should now be able to successfully run our mapping rule, if we tell `snakemake` to use conda environments with the `--use-conda` option:
```bash
$ snakemake -pr --use-conda
```
The first run will take a bit longer, because snakemake creates the conda environment. In subsequent runs it will just activate the existing environment. However, it will recognise if the yaml files changes and then recreate the environment.
Since we have two fastq files to map, we should generalise the rule with wildcards:
```python
rule mapping:
input: "chip-seq/{sample}.fastq.gz"
output: "bowtie2/{sample}.bam"
conda: "envs/bowtie2.yaml"
shell:
"""
bowtie2 -x reference/Mus_musculus.GRCm38.dna_sm.chromosome.12 -U {input} | \
samtools sort - > {output}
samtools index {output}
"""
```
We also use the `params` directive to define the path to the reference and declutter the command-line call, as well as the `log` directive to define a path to permanently store the output of the execution. This is especially useful in this step to store the bowtie2 mapping statistics, which are just written to the command-line (stderr) otherwise.
To track resource usage we add the `benchmark` directive, which will write performance measures to a tsv file.
......@@ -288,8 +330,6 @@ If everything is fine we can run the rule to create the file `bowtie2/INPUT-TC1-
$ snakemake -pr --use-conda bowtie2/INPUT-TC1-ST2-D0.12.bam
```
The first run will take a bit longer, because snakemake creates the conda environment. In subsequent runs it will just activate the existing environment. However, it will recognise if the yaml files changes and then recreate the environment.
Check the mapping statistics and the benchmark report:
```bash
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment