Commit 38cf16b8 authored by Roland Krause's avatar Roland Krause

Merge branch 'exercises' into 'master'

Exercises

See merge request !3
parents a7e3883d e4e81fc2
# Exercises
## Quality control
1. Add a rule that generates [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) reports for the two input fastq.gz files.
* The tool `fastqc` is available from the `bioconda` conda channel.
* An example command-line to use `fastqc` is
```bash
$ fastqc chip-seq/H3K4-TC1-ST2-D0.12.fastq.gz
```
* The output are two files in the same directory as the input file, `H3K4-TC1-ST2-D0.12_fastqc.html` and `H3K4-TC1-ST2-D0.12_fastqc.zip`.
* You might want to copy at least the html file to the `output` directory.
* Do not forget to add at least one of the output files to the summary rule or explicitly specify it on the command-line.
2. Add a rule that generates alignment statistics for the two bam files with [Picard](https://broadinstitute.github.io/picard/) [CollectAlignmentSummaryMetrics](https://broadinstitute.github.io/picard/command-line-overview.html#CollectAlignmentSummaryMetrics).
* The tool `picard` is available from the `bioconda` conda channel.
* An example command-line to run `picard` is
```bash
$ picard CollectAlignmentSummaryMetrics R=reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa I=bowtie2/H3K4-TC1-ST2-D0.12.bam O=output/alignment_metrics_H3K4-TC1-ST2-D0.12.txt
```
* You can use the `params` directive again to specify the reference file.
* You might want to save the output that picard prints on the command-line (stderr) to a log file.
## Peak annotation
1. Add a rule that uses [bedtools](https://bedtools.readthedocs.io/en/latest/index.html) [intersect](https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html) to check with which genes the peaks overlap.
* The tool `bedtools` is available from the `bioconda` conda channel.
* You will need a file in BED format that contains the regions of the genes.
```bash
$ cd reference
$ wget "https://webdav-r3lab.uni.lu/public/biocore/snakemake_tutorial/UCSC_mm10_refseq.chromosome.12.bed"
$ cd ..
```
You can get this data also directly from the [UCSC Table Browser](https://genome.ucsc.edu/cgi-bin/hgTables), but you will need to fix the chromosome naming (remove `chr`).
* An example command-line to use `bedtools intersect` is
```bash
$ bedtools intersect -wb -a output/TC1-ST2-D0.12_peaks.narrowPeak -b reference/UCSC_mm10_refseq.chromosome.12.bed
```
* Be aware that `bedtools` writes the output to the command-line (stdout), so you have to manually redirect it to a file with `>`.
* (Optional) Since our peaks are expected to be at transcription start sites (TSS) and not necessarily within the gene body, you might want to extend the regions in the gene BED file with `bedtools slop` to cover a few kb upstream of the gene start.
```bash
$ bedtools slop -l 2000 -r 0 -s -i reference/UCSC_mm10_refseq.chromosome.12.bed -g reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai
```
Create a separate rule for this, but reuse the conda environment with `bedtools`.
* (Optional) Extract the list of unique genes from the `bedtools intersect` output (column 14) using `cut`, `sort` and `uniq`.
2. Add a rule that draws a heatmap of the ChIP signal around the genes using [deepTools](https://deeptools.readthedocs.io/en/latest/index.html).
* The toolsuite `deepTools` is available from the `bioconda` conda channel.
* You will need the `computeMatrix` and `plotHeatmap` tools.
* An example command-line to generate a plot is
```bash
$ computeMatrix scale-regions -S output/TC1-ST2-D0.12_treat_pileup.bigwig -R reference/UCSC_mm10_refseq.chromosome.12.bed -b 3000 -m 5000 -a 1000 -o deeptools/TC1-ST2-D0.12_matrix.gz
$ plotHeatmap -m deeptools/TC1-ST2-D0.12_matrix.gz -o output/TC1-ST2-D0.12_deeptools_plot.png
```
* (Optional) Generate a properly normalised bigwig track first, using `bamCompare`, and use this as input for `computeMatrix`.
```bash
$ bamCompare -b1 bowtie2/H3K4-TC1-ST2-D0.12.bam -b2 bowtie2/INPUT-TC1-ST2-D0.12.bam -o deeptools/TC1-ST2-D0.12_normalised.bigwig -of bigwig --effectiveGenomeSize 120129022
```
* You can put all commands in a single rule or create one rule per command. Since we do not reuse any of the intermediate outputs, a single rule works fine, but otherwise it would be better to split up the rule.
## Shell script into Snakefile
1. Have a look at the following shell script.
The script is adapted from the first few steps of the Galaxy atac-seq tutorial (https://training.galaxyproject.org/training-material/topics/epigenetics/tutorials/atac-seq/tutorial.html)
```bash
#!/bin/sh
wget https://zenodo.org/record/3270536/files/SRR891268_R1.fastq.gz
wget https://zenodo.org/record/3270536/files/SRR891268_R2.fastq.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz
# Alignment
bowtie2-build chr22.fa.gz hg38_chr22
bowtie2 -x hg38_chr22 -1 SRR891268_R1.fastq.gz -2 SRR891268_R2.fastq.gz | \
samtools sort -n - > SRR891268.bam
# Peak Calling
Genrich -t SRR891268.bam -o SRR891268.narrowPeak \
-e "chrM" -f SRR891268.log -m 30 -j\
-a 20 -r \
-k SRR891268.bg
```
Atac-seq is similar to ChIP-seq an epigenomics NGS technique to identify open chromatin. Since the sequencing is in a paired-end mode (instead of single-end like we had in the previous data set from the tutorial) other parameters are needed for the alignment with `bowtie2`.
* `-1` indicates the forward read,
* while `-2` indicates the reverse read.
<br/>
`Genrich` has an advantage over `MACS2` since it is offering to do the necessary filtering steps during peak calling.
* `-e` allows to exclude chromosomes that are specified. In this case the mitochondrial chromosome (chrM).
* `-m` filters low quality reads.
* `-j` specifies that it has to run in the atac-seq mode.
* `-a` is the minimum AUC for a peak.
* `-r` removes PCR duplicates.
* `-k` creates a bedgraph-ish file for pileups.
<br/>
2. Write the script into a Snakefile.
* Create a new directory `atac-seq` in your home directory for this (see the code below)
```bash
$ cd
$ mkdir atac-seq
$ cd atac-seq
```
* The peak caller `Genrich` is available from the `bioconda` conda channel
3. How can you improve the workflow in the context of data management? Think about data structures and log files.
SAMPLE = "TC1-ST2-D0.12"
rule all:
input: f"output/{SAMPLE}_peaks.narrowPeak", f"output/{SAMPLE}_control_lambda.bigwig", f"output/{SAMPLE}_treat_pileup.bigwig", f"output/H3K4-{SAMPLE}_fastqc.html", f"output/INPUT-{SAMPLE}_fastqc.html", f"output/alignment_metrics_H3K4-{SAMPLE}.txt", f"output/alignment_metrics_INPUT-{SAMPLE}.txt"
rule mapping:
input: "chip-seq/{sample}.fastq.gz"
output: "bowtie2/{sample}.bam"
params:
idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12"
log: "logs/bowtie2_{sample}.log"
benchmark: "benchmarks/mapping/{sample}.tsv"
conda: "envs/bowtie2.yaml"
shell:
"""
bowtie2 -x {params.idx} -U {input} 2> {log} | \
samtools sort - > {output}
samtools index {output}
"""
rule peak_calling:
input:
control = "bowtie2/INPUT-{sample}.bam",
chip = "bowtie2/H3K4-{sample}.bam"
output:
peaks = "output/{sample}_peaks.narrowPeak",
control_bdg = "macs2/{sample}_control_lambda.bdg",
chip_bdg = "macs2/{sample}_treat_pileup.bdg"
conda: "envs/macs2.yaml"
shell:
"""
macs2 callpeak -t {input.chip} -c {input.control} -f BAM -g mm -n {wildcards.sample} -B -q 0.01 --outdir macs2
cp macs2/{wildcards.sample}_peaks.narrowPeak {output.peaks}
"""
rule bigwig:
input: "macs2/{sample}.bdg"
output: "output/{sample}.bigwig"
params:
idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai"
conda: "envs/ucsc.yaml"
shell:
"""
bedGraphToBigWig {input} {params.idx} {output}
"""
rule fastqc:
input: "chip-seq/{sample}.fastq.gz"
output:
html = "output/{sample}_fastqc.html",
zip = "chip-seq/{sample}_fastqc.zip"
conda: "envs/fastqc.yaml"
shell:
"""
fastqc {input}
cp chip-seq/{wildcards.sample}_fastqc.html {output.html}
"""
rule alignment_metrics:
input: "bowtie2/{sample}.bam"
output: "output/alignment_metrics_{sample}.txt"
params:
idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa"
conda: "envs/picard.yaml"
log: "logs/alignment_metrics_{sample}.log"
shell:
"""
picard CollectAlignmentSummaryMetrics R={params.idx} I={input} O={output} 2> {log}
"""
name: fastqc
channels:
- bioconda
dependencies:
- fastqc
name: picard
channels:
- bioconda
dependencies:
- picard
SAMPLE = "TC1-ST2-D0.12"
rule all:
input: f"output/{SAMPLE}_peaks.narrowPeak", f"output/{SAMPLE}_control_lambda.bigwig", f"output/{SAMPLE}_treat_pileup.bigwig", f"output/H3K4-{SAMPLE}_fastqc.html", f"output/INPUT-{SAMPLE}_fastqc.html", f"output/alignment_metrics_H3K4-{SAMPLE}.txt", f"output/alignment_metrics_INPUT-{SAMPLE}.txt", f"output/{SAMPLE}_peaks_intersect_genes.tsv", f"output/{SAMPLE}_deeptools_plot.png"
rule mapping:
input: "chip-seq/{sample}.fastq.gz"
output: "bowtie2/{sample}.bam"
params:
idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12"
log: "logs/bowtie2_{sample}.log"
benchmark: "benchmarks/mapping/{sample}.tsv"
conda: "envs/bowtie2.yaml"
shell:
"""
bowtie2 -x {params.idx} -U {input} 2> {log} | \
samtools sort - > {output}
samtools index {output}
"""
rule peak_calling:
input:
control = "bowtie2/INPUT-{sample}.bam",
chip = "bowtie2/H3K4-{sample}.bam"
output:
peaks = "output/{sample}_peaks.narrowPeak",
control_bdg = "macs2/{sample}_control_lambda.bdg",
chip_bdg = "macs2/{sample}_treat_pileup.bdg"
conda: "envs/macs2.yaml"
shell:
"""
macs2 callpeak -t {input.chip} -c {input.control} -f BAM -g mm -n {wildcards.sample} -B -q 0.01 --outdir macs2
cp macs2/{wildcards.sample}_peaks.narrowPeak {output.peaks}
"""
rule bigwig:
input: "macs2/{sample}.bdg"
output: "output/{sample}.bigwig"
params:
idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai"
conda: "envs/ucsc.yaml"
shell:
"""
bedGraphToBigWig {input} {params.idx} {output}
"""
rule fastqc:
input: "chip-seq/{sample}.fastq.gz"
output:
html = "output/{sample}_fastqc.html",
zip = "chip-seq/{sample}_fastqc.zip"
conda: "envs/fastqc.yaml"
shell:
"""
fastqc {input}
cp chip-seq/{wildcards.sample}_fastqc.html {output.html}
"""
rule alignment_metrics:
input: "bowtie2/{sample}.bam"
output: "output/alignment_metrics_{sample}.txt"
params:
idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa"
conda: "envs/picard.yaml"
log: "logs/alignment_metrics_{sample}.log"
shell:
"""
picard CollectAlignmentSummaryMetrics R={params.idx} I={input} O={output} 2> {log}
"""
rule slopBed:
input: "reference/UCSC_mm10_refseq.chromosome.12.bed"
output: "reference/UCSC_mm10_refseq.chromosome.12.extended.bed"
params:
idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai"
conda: "envs/bedtools.yaml"
shell:
"""
bedtools slop -l 2000 -r 0 -s -i {input} -g {params.idx} > {output}
"""
rule intersect:
input:
peaks = "output/{sample}_peaks.narrowPeak",
genes = "reference/UCSC_mm10_refseq.chromosome.12.extended.bed"
output:
full = "output/{sample}_peaks_intersect_genes.tsv",
genes = "output/{sample}_peaks_intersect_genes.genelist.tsv"
conda: "envs/bedtools.yaml"
shell:
"""
bedtools intersect -wb -a {input.peaks} -b {input.genes} > {output.full}
cut -f 14 {output.full} | sort | uniq > {output.genes}
"""
rule plotSignal:
input:
control = "bowtie2/INPUT-{sample}.bam",
chip = "bowtie2/H3K4-{sample}.bam"
output:
bigwig = "deeptools/{sample}_normalised.bigwig",
matrix = "deeptools/{sample}_matrix.gz",
plot = "output/{sample}_deeptools_plot.png"
params:
regions = "reference/UCSC_mm10_refseq.chromosome.12.bed"
conda: "envs/deeptools.yaml"
shell:
"""
bamCompare -b1 {input.chip} -b2 {input.control} -o {output.bigwig} -of bigwig --effectiveGenomeSize 120129022
computeMatrix scale-regions -S {output.bigwig} -R {params.regions} -b 3000 -m 5000 -a 1000 -o {output.matrix}
plotHeatmap -m {output.matrix} -o {output.plot}"
"""
name: bedtools
channels:
- bioconda
dependencies:
- bedtools
name: deeptools
channels:
- bioconda
dependencies:
- deeptools
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment