Commit c9821d9b authored by Sarah Peter's avatar Sarah Peter

Add bedtools intersect exercise

parent 91636abc
......@@ -7,7 +7,7 @@
* An example command-line to use `fastqc` is
```
```bash
$ fastqc chip-seq/H3K4-TC1-ST2-D0.12.fastq.gz
```
......@@ -23,7 +23,7 @@
* An example command-line to run `picard` is
```
```bash
$ picard CollectAlignmentSummaryMetrics R=reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa I=bowtie2/H3K4-TC1-ST2-D0.12.bam O=output/alignment_metrics_H3K4-TC1-ST2-D0.12.txt
```
......@@ -31,3 +31,37 @@
* You might want to save the output that picard prints on the command-line (stderr) to a log file.
## Peak annotation
1. Add a rule that uses [bedtools](https://bedtools.readthedocs.io/en/latest/index.html) [intersect](https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html) to check with which genes the peaks overlap.
* The tool `bedtools` is available from the `bioconda` conda channel.
* You will need a file in BED format that contains the regions of the genes.
```bash
$ cd reference
$ wget "https://webdav-r3lab.uni.lu/public/biocore/snakemake_tutorial/UCSC_mm10_refseq.chromosome.12.bed"
$ cd ..
```
You can get this data also directly from the [UCSC Table Browser](https://genome.ucsc.edu/cgi-bin/hgTables), but you will need to fix the chromosome naming (remove `chr`).
* An example command-line to use `bedtools intersect` is
```bash
$ bedtools intersect -wb -a output/TC1-ST2-D0.12_peaks.narrowPeak -b reference/UCSC_mm10_refseq.chromosome.12.bed
```
* Be aware that `bedtools` writes the output to the command-line (stdout), so you have to manually redirect it to a file with `>`.
* (Optional) Since our peaks are expected to be at transcription start sites (TSS) and not necessarily within the gene body, you might want to extend the regions in the gene BED file with `bedtools slop` to cover also the TSS.
```bash
$ bedtools slop -l 2000 -r 0 -s -i reference/UCSC_mm10_refseq.chromosome.12.bed -g reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai
```
Create a separate rule for this, but reuse the conda environment with `bedtools`.
* (Optional) Extract the list of unique genes from the `bedtools intersect` output (column 14) using `cut`, `sort` and `uniq`.
SAMPLE = "TC1-ST2-D0.12"
rule all:
input: f"output/{SAMPLE}_peaks.narrowPeak", f"output/{SAMPLE}_control_lambda.bigwig", f"output/{SAMPLE}_treat_pileup.bigwig", f"output/H3K4-{SAMPLE}_fastqc.html", f"output/INPUT-{SAMPLE}_fastqc.html", f"output/alignment_metrics_H3K4-{SAMPLE}.txt", f"output/alignment_metrics_INPUT-{SAMPLE}.txt", f"output/{SAMPLE}_peaks_intersect_genes.tsv"
rule mapping:
input: "chip-seq/{sample}.fastq.gz"
output: "bowtie2/{sample}.bam"
params:
idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12"
log: "logs/bowtie2_{sample}.log"
benchmark: "benchmarks/mapping/{sample}.tsv"
conda: "envs/bowtie2.yaml"
shell:
"""
bowtie2 -x {params.idx} -U {input} 2> {log} | \
samtools sort - > {output}
samtools index {output}
"""
rule peak_calling:
input:
control = "bowtie2/INPUT-{sample}.bam",
chip = "bowtie2/H3K4-{sample}.bam"
output:
peaks = "output/{sample}_peaks.narrowPeak",
control_bdg = "macs2/{sample}_control_lambda.bdg",
chip_bdg = "macs2/{sample}_treat_pileup.bdg"
conda: "envs/macs2.yaml"
shell:
"""
macs2 callpeak -t {input.chip} -c {input.control} -f BAM -g mm -n {wildcards.sample} -B -q 0.01 --outdir macs2
cp macs2/{wildcards.sample}_peaks.narrowPeak {output.peaks}
"""
rule bigwig:
input: "macs2/{sample}.bdg"
output: "output/{sample}.bigwig"
params:
idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai"
conda: "envs/ucsc.yaml"
shell:
"""
bedGraphToBigWig {input} {params.idx} {output}
"""
rule fastqc:
input: "chip-seq/{sample}.fastq.gz"
output:
html = "output/{sample}_fastqc.html",
zip = "chip-seq/{sample}_fastqc.zip"
conda: "envs/fastqc.yaml"
shell:
"""
fastqc {input}
cp chip-seq/{wildcards.sample}_fastqc.html {output.html}
"""
rule alignment_metrics:
input: "bowtie2/{sample}.bam"
output: "output/alignment_metrics_{sample}.txt"
params:
idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa"
conda: "envs/picard.yaml"
log: "logs/alignment_metrics_{sample}.log"
shell:
"""
picard CollectAlignmentSummaryMetrics R={params.idx} I={input} O={output} 2> {log}
"""
rule slopBed:
input: "reference/UCSC_mm10_refseq.chromosome.12.bed"
output: "reference/UCSC_mm10_refseq.chromosome.12.extended.bed"
params:
idx = "reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai"
conda: "envs/bedtools.yaml"
shell:
"""
bedtools slop -l 2000 -r 0 -s -i {input} -g {params.idx} > {output}
"""
rule intersect:
input:
peaks = "output/{sample}_peaks.narrowPeak",
genes = "reference/UCSC_mm10_refseq.chromosome.12.extended.bed"
output:
full = "output/{sample}_peaks_intersect_genes.tsv",
genes = "output/{sample}_peaks_intersect_genes.genelist.tsv"
conda: "envs/bedtools.yaml"
shell:
"""
bedtools intersect -wb -a {input.peaks} -b {input.genes} > {output.full}
cut -f 14 {output.full} | sort | uniq > {output.genes}
"""
name: bedtools
channels:
- bioconda
dependencies:
- bedtools
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment