Commit 80d1282a authored by AntonieV's avatar AntonieV
Browse files

changes on deseq2 for comparisons of samples across the groups from each...

changes on deseq2 for comparisons of samples across the groups from each antibody, adaptations for snakemake workflow catalog, more plot descriptions for report, cleanup configs, README and ToDo's
parent b780951c
...@@ -22,7 +22,7 @@ resources: ...@@ -22,7 +22,7 @@ resources:
build: GRCh38 build: GRCh38
# for testing data a specific chromosome can be selected # for testing data a specific chromosome can be selected
chromosome: chromosome:
# specify release version number of igenomes list to use (see https://github.com/nf-core/chipseq/releases), default: 1.2.2 # specify release version number of igenomes list to use (see https://github.com/nf-core/chipseq/releases), e.g. 1.2.2
igenomes_release: 1.2.2 igenomes_release: 1.2.2
# if igenomes.yaml cannot be used, a value for the mappable or effective genome size can be specified here, e.g. macs-gsize: 2.7e9 # if igenomes.yaml cannot be used, a value for the mappable or effective genome size can be specified here, e.g. macs-gsize: 2.7e9
macs-gsize: macs-gsize:
...@@ -45,7 +45,7 @@ params: ...@@ -45,7 +45,7 @@ params:
picard_metrics: picard_metrics:
activate: True activate: True
deseq2: deseq2:
# optional to run vst transform instead of rlog # set to True to use the vst transformation instead of the rlog transformation for the DESeq2 analysis
vst: True vst: True
peak-annotation-analysis: peak-annotation-analysis:
activate: True activate: True
...@@ -53,6 +53,21 @@ params: ...@@ -53,6 +53,21 @@ params:
activate: True activate: True
consensus-peak-analysis: consensus-peak-analysis:
activate: True activate: True
# samtools view parameters:
# if duplicates should be removed in this filtering, add "-F 0x0400" to the params
# if for each read, you only want to retain a single (best) mapping, add "-q 1" to params
# if you would like to restrict analysis to certain regions (e.g. excluding other "blacklisted" regions),
# the -L option is automatically activated if a path to a blacklist of the given genome exists in the
# downloaded "resources/ref/igenomes.yaml" or has been provided via the parameter
# "config['resources']['ref']['blacklist']" in this configuration file
samtools-view-se: "-b -F 0x004"
samtools-view-pe: "-b -F 0x004 -G 0x009 -f 0x001"
plotfingerprint:
# Number of bins that sampled from the genome, for which the overlapping number of reads is computed for fingerprint plot
number-of-samples: 500000
# optional parameters for picard's CollectMultipleMetrics from sorted, filtered and merged bam files in post analysis step
# see https://gatk.broadinstitute.org/hc/en-us/articles/360037594031-CollectMultipleMetrics-Picard-
collect-multiple-metrics: VALIDATION_STRINGENCY=LENIENT
# TODO: move adapter parameters into a `adapter` column in units.tsv and check for its presence via the units.schema.yaml -- this enables unit-specific adapters, e.g. when integrating multiple datasets # TODO: move adapter parameters into a `adapter` column in units.tsv and check for its presence via the units.schema.yaml -- this enables unit-specific adapters, e.g. when integrating multiple datasets
# these cutadapt parameters need to contain the required flag(s) for # these cutadapt parameters need to contain the required flag(s) for
# the type of adapter(s) to trim, i.e.: # the type of adapter(s) to trim, i.e.:
......
...@@ -6,36 +6,36 @@ D 1 SRR1635446 ILLUMINA ...@@ -6,36 +6,36 @@ D 1 SRR1635446 ILLUMINA
E 1 SRR1635447 ILLUMINA E 1 SRR1635447 ILLUMINA
F 1 SRR1635448 ILLUMINA F 1 SRR1635448 ILLUMINA
G 1 SRR1635449 ILLUMINA G 1 SRR1635449 ILLUMINA
H 2 SRR1635450 ILLUMINA H 1 SRR1635450 ILLUMINA
I 1 SRR1635451 ILLUMINA I 1 SRR1635451 ILLUMINA
J 2 SRR1635452 ILLUMINA J 1 SRR1635452 ILLUMINA
K 1 SRR1635453 ILLUMINA K 1 SRR1635453 ILLUMINA
L 2 SRR1635454 ILLUMINA L 1 SRR1635454 ILLUMINA
M 1 SRR1635455 ILLUMINA M 1 SRR1635455 ILLUMINA
N 2 SRR1635456 ILLUMINA N 1 SRR1635456 ILLUMINA
O 1 SRR1635457 ILLUMINA O 1 SRR1635457 ILLUMINA
P 2 SRR1635458 ILLUMINA P 1 SRR1635458 ILLUMINA
Q 1 SRR1635459 ILLUMINA Q 1 SRR1635459 ILLUMINA
R 2 SRR1635460 ILLUMINA R 1 SRR1635460 ILLUMINA
S 1 SRR1635461 ILLUMINA S 1 SRR1635461 ILLUMINA
T 2 SRR1635462 ILLUMINA T 1 SRR1635462 ILLUMINA
U 1 SRR1635463 ILLUMINA U 1 SRR1635463 ILLUMINA
V 2 SRR1635464 ILLUMINA V 1 SRR1635464 ILLUMINA
W 1 SRR1635465 ILLUMINA W 1 SRR1635465 ILLUMINA
X 2 SRR1635466 ILLUMINA X 1 SRR1635466 ILLUMINA
Y 1 SRR1635467 ILLUMINA Y 1 SRR1635467 ILLUMINA
Z 2 SRR1635468 ILLUMINA Z 1 SRR1635468 ILLUMINA
AA 1 SRR1635469 ILLUMINA AA 1 SRR1635469 ILLUMINA
AB 2 SRR1635470 ILLUMINA AB 1 SRR1635470 ILLUMINA
AC 1 SRR1635471 ILLUMINA AC 1 SRR1635471 ILLUMINA
AD 2 SRR1635472 ILLUMINA AD 1 SRR1635472 ILLUMINA
AE 1 SRR1635473 ILLUMINA AE 1 SRR1635473 ILLUMINA
AF 2 SRR1635474 ILLUMINA AF 1 SRR1635474 ILLUMINA
AG 1 SRR1635435 ILLUMINA AG 1 SRR1635435 ILLUMINA
AH 2 SRR1635436 ILLUMINA AH 1 SRR1635436 ILLUMINA
AI 1 SRR1635437 ILLUMINA AI 1 SRR1635437 ILLUMINA
AJ 2 SRR1635438 ILLUMINA AJ 1 SRR1635438 ILLUMINA
AK 1 SRR1635439 ILLUMINA AK 1 SRR1635439 ILLUMINA
AL 2 SRR1635440 ILLUMINA AL 1 SRR1635440 ILLUMINA
AM 1 SRR1635441 ILLUMINA AM 1 SRR1635441 ILLUMINA
AN 2 SRR1635442 ILLUMINA AN 1 SRR1635442 ILLUMINA
...@@ -17,9 +17,9 @@ resources: ...@@ -17,9 +17,9 @@ resources:
release: 101 release: 101
# Genome build # Genome build
build: R64-1-1 build: R64-1-1
# for testing data only chromosome 21 is selected # for testing data a single chromosome can be selected
chromosome: chromosome:
# specify release version number of igenomes list to use (see https://github.com/nf-core/chipseq/releases), default: 1.2.2 # specify release version number of igenomes list to use (see https://github.com/nf-core/chipseq/releases), e.g. 1.2.2
igenomes_release: 1.2.2 igenomes_release: 1.2.2
# if igenomes.yaml cannot be used, a value for the mappable or effective genome size can be specified here, e.g. macs-gsize: 2.7e9 # if igenomes.yaml cannot be used, a value for the mappable or effective genome size can be specified here, e.g. macs-gsize: 2.7e9
macs-gsize: macs-gsize:
...@@ -42,7 +42,7 @@ params: ...@@ -42,7 +42,7 @@ params:
picard_metrics: picard_metrics:
activate: True activate: True
deseq2: deseq2:
# optional to run vst transform instead of rlog # set to True to use the vst transformation instead of the rlog transformation for the DESeq2 analysis
vst: True vst: True
peak-annotation-analysis: peak-annotation-analysis:
activate: True activate: True
...@@ -50,6 +50,21 @@ params: ...@@ -50,6 +50,21 @@ params:
activate: True activate: True
consensus-peak-analysis: consensus-peak-analysis:
activate: True activate: True
# samtools view parameters:
# if duplicates should be removed in this filtering, add "-F 0x0400" to the params
# if for each read, you only want to retain a single (best) mapping, add "-q 1" to params
# if you would like to restrict analysis to certain regions (e.g. excluding other "blacklisted" regions),
# the -L option is automatically activated if a path to a blacklist of the given genome exists in the
# downloaded "resources/ref/igenomes.yaml" or has been provided via the parameter
# "config['resources']['ref']['blacklist']" in this configuration file
samtools-view-se: "-b -F 0x004"
samtools-view-pe: "-b -F 0x004 -G 0x009 -f 0x001"
plotfingerprint:
# Number of bins that sampled from the genome, for which the overlapping number of reads is computed for fingerprint plot
number-of-samples: 500000
# optional parameters for picard's CollectMultipleMetrics from sorted, filtered and merged bam files in post analysis step
# see https://gatk.broadinstitute.org/hc/en-us/articles/360037594031-CollectMultipleMetrics-Picard-
collect-multiple-metrics: VALIDATION_STRINGENCY=LENIENT
# TODO: move adapter parameters into a `adapter` column in units.tsv and check for its presence via the units.schema.yaml -- this enables unit-specific adapters, e.g. when integrating multiple datasets # TODO: move adapter parameters into a `adapter` column in units.tsv and check for its presence via the units.schema.yaml -- this enables unit-specific adapters, e.g. when integrating multiple datasets
# these cutadapt parameters need to contain the required flag(s) for # these cutadapt parameters need to contain the required flag(s) for
# the type of adapter(s) to trim, i.e.: # the type of adapter(s) to trim, i.e.:
......
...@@ -19,7 +19,7 @@ resources: ...@@ -19,7 +19,7 @@ resources:
build: R64-1-1 build: R64-1-1
# for testing data a specific chromosome can be selected # for testing data a specific chromosome can be selected
chromosome: VII chromosome: VII
# specify release version number of igenomes list to use (see https://github.com/nf-core/chipseq/releases), default: 1.2.2 # specify release version number of igenomes list to use (see https://github.com/nf-core/chipseq/releases), e.g. 1.2.2
igenomes_release: 1.2.2 igenomes_release: 1.2.2
# if igenomes.yaml cannot be used, a value for the mappable or effective genome size can be specified here, e.g. macs-gsize: 2.7e9 # if igenomes.yaml cannot be used, a value for the mappable or effective genome size can be specified here, e.g. macs-gsize: 2.7e9
macs-gsize: macs-gsize:
...@@ -42,7 +42,7 @@ params: ...@@ -42,7 +42,7 @@ params:
picard_metrics: picard_metrics:
activate: True activate: True
deseq2: deseq2:
# optional to run vst transform instead of rlog # set to True to use the vst transformation instead of the rlog transformation for the DESeq2 analysis
vst: False vst: False
peak-annotation-analysis: peak-annotation-analysis:
activate: True activate: True
...@@ -50,6 +50,21 @@ params: ...@@ -50,6 +50,21 @@ params:
activate: True activate: True
consensus-peak-analysis: consensus-peak-analysis:
activate: True activate: True
# samtools view parameters:
# if duplicates should be removed in this filtering, add "-F 0x0400" to the params
# if for each read, you only want to retain a single (best) mapping, add "-q 1" to params
# if you would like to restrict analysis to certain regions (e.g. excluding other "blacklisted" regions),
# the -L option is automatically activated if a path to a blacklist of the given genome exists in the
# downloaded "resources/ref/igenomes.yaml" or has been provided via the parameter
# "config['resources']['ref']['blacklist']" in this configuration file
samtools-view-se: "-b -F 0x004"
samtools-view-pe: "-b -F 0x004 -G 0x009 -f 0x001"
plotfingerprint:
# Number of bins that sampled from the genome, for which the overlapping number of reads is computed for fingerprint plot
number-of-samples: 500000
# optional parameters for picard's CollectMultipleMetrics from sorted, filtered and merged bam files in post analysis step
# see https://gatk.broadinstitute.org/hc/en-us/articles/360037594031-CollectMultipleMetrics-Picard-
collect-multiple-metrics: VALIDATION_STRINGENCY=LENIENT
# TODO: move adapter parameters into a `adapter` column in units.tsv and check for its presence via the units.schema.yaml -- this enables unit-specific adapters, e.g. when integrating multiple datasets # TODO: move adapter parameters into a `adapter` column in units.tsv and check for its presence via the units.schema.yaml -- this enables unit-specific adapters, e.g. when integrating multiple datasets
# these cutadapt parameters need to contain the required flag(s) for # these cutadapt parameters need to contain the required flag(s) for
# the type of adapter(s) to trim, i.e.: # the type of adapter(s) to trim, i.e.:
......
...@@ -22,7 +22,7 @@ resources: ...@@ -22,7 +22,7 @@ resources:
build: GRCh38 build: GRCh38
# for testing data a specific chromosome can be selected # for testing data a specific chromosome can be selected
chromosome: chromosome:
# specify release version number of igenomes list to use (see https://github.com/nf-core/chipseq/releases), default: 1.2.2 # specify release version number of igenomes list to use (see https://github.com/nf-core/chipseq/releases), e.g. 1.2.2
igenomes_release: 1.2.2 igenomes_release: 1.2.2
# if igenomes.yaml cannot be used, a value for the mappable or effective genome size can be specified here, e.g. macs-gsize: 2.7e9 # if igenomes.yaml cannot be used, a value for the mappable or effective genome size can be specified here, e.g. macs-gsize: 2.7e9
macs-gsize: macs-gsize:
...@@ -45,7 +45,7 @@ params: ...@@ -45,7 +45,7 @@ params:
picard_metrics: picard_metrics:
activate: True activate: True
deseq2: deseq2:
# optional to run vst transform instead of rlog # set to True to use the vst transformation instead of the rlog transformation for the DESeq2 analysis
vst: True vst: True
peak-annotation-analysis: peak-annotation-analysis:
activate: True activate: True
...@@ -53,6 +53,21 @@ params: ...@@ -53,6 +53,21 @@ params:
activate: True activate: True
consensus-peak-analysis: consensus-peak-analysis:
activate: True activate: True
# samtools view parameters:
# if duplicates should be removed in this filtering, add "-F 0x0400" to the params
# if for each read, you only want to retain a single (best) mapping, add "-q 1" to params
# if you would like to restrict analysis to certain regions (e.g. excluding other "blacklisted" regions),
# the -L option is automatically activated if a path to a blacklist of the given genome exists in the
# downloaded "resources/ref/igenomes.yaml" or has been provided via the parameter
# "config['resources']['ref']['blacklist']" in this configuration file
samtools-view-se: "-b -F 0x004"
samtools-view-pe: "-b -F 0x004 -G 0x009 -f 0x001"
plotfingerprint:
# Number of bins that sampled from the genome, for which the overlapping number of reads is computed for fingerprint plot
number-of-samples: 500000
# optional parameters for picard's CollectMultipleMetrics from sorted, filtered and merged bam files in post analysis step
# see https://gatk.broadinstitute.org/hc/en-us/articles/360037594031-CollectMultipleMetrics-Picard-
collect-multiple-metrics: VALIDATION_STRINGENCY=LENIENT
# TODO: move adapter parameters into a `adapter` column in units.tsv and check for its presence via the units.schema.yaml -- this enables unit-specific adapters, e.g. when integrating multiple datasets # TODO: move adapter parameters into a `adapter` column in units.tsv and check for its presence via the units.schema.yaml -- this enables unit-specific adapters, e.g. when integrating multiple datasets
# these cutadapt parameters need to contain the required flag(s) for # these cutadapt parameters need to contain the required flag(s) for
# the type of adapter(s) to trim, i.e.: # the type of adapter(s) to trim, i.e.:
......
...@@ -22,7 +22,7 @@ resources: ...@@ -22,7 +22,7 @@ resources:
build: GRCh38 build: GRCh38
# for testing data a specific chromosome can be selected # for testing data a specific chromosome can be selected
chromosome: 21 chromosome: 21
# specify release version number of igenomes list to use (see https://github.com/nf-core/chipseq/releases), default: 1.2.2 # specify release version number of igenomes list to use (see https://github.com/nf-core/chipseq/releases), e.g. 1.2.2
igenomes_release: 1.2.2 igenomes_release: 1.2.2
# if igenomes.yaml cannot be used, a value for the mappable or effective genome size can be specified here, e.g. macs-gsize: 2.7e9 # if igenomes.yaml cannot be used, a value for the mappable or effective genome size can be specified here, e.g. macs-gsize: 2.7e9
macs-gsize: macs-gsize:
...@@ -45,7 +45,7 @@ params: ...@@ -45,7 +45,7 @@ params:
picard_metrics: picard_metrics:
activate: True activate: True
deseq2: deseq2:
# optional to run vst transform instead of rlog # set to True to use the vst transformation instead of the rlog transformation for the DESeq2 analysis
vst: True vst: True
peak-annotation-analysis: peak-annotation-analysis:
activate: True activate: True
...@@ -53,6 +53,21 @@ params: ...@@ -53,6 +53,21 @@ params:
activate: True activate: True
consensus-peak-analysis: consensus-peak-analysis:
activate: True activate: True
# samtools view parameters:
# if duplicates should be removed in this filtering, add "-F 0x0400" to the params
# if for each read, you only want to retain a single (best) mapping, add "-q 1" to params
# if you would like to restrict analysis to certain regions (e.g. excluding other "blacklisted" regions),
# the -L option is automatically activated if a path to a blacklist of the given genome exists in the
# downloaded "resources/ref/igenomes.yaml" or has been provided via the parameter
# "config['resources']['ref']['blacklist']" in this configuration file
samtools-view-se: "-b -F 0x004"
samtools-view-pe: "-b -F 0x004 -G 0x009 -f 0x001"
plotfingerprint:
# Number of bins that sampled from the genome, for which the overlapping number of reads is computed for fingerprint plot
number-of-samples: 500000
# optional parameters for picard's CollectMultipleMetrics from sorted, filtered and merged bam files in post analysis step
# see https://gatk.broadinstitute.org/hc/en-us/articles/360037594031-CollectMultipleMetrics-Picard-
collect-multiple-metrics: VALIDATION_STRINGENCY=LENIENT
# TODO: move adapter parameters into a `adapter` column in units.tsv and check for its presence via the units.schema.yaml -- this enables unit-specific adapters, e.g. when integrating multiple datasets # TODO: move adapter parameters into a `adapter` column in units.tsv and check for its presence via the units.schema.yaml -- this enables unit-specific adapters, e.g. when integrating multiple datasets
# these cutadapt parameters need to contain the required flag(s) for # these cutadapt parameters need to contain the required flag(s) for
# the type of adapter(s) to trim, i.e.: # the type of adapter(s) to trim, i.e.:
......
# Snakemake workflow: chipseq # Snakemake workflow: chipseq
[![Snakemake](https://img.shields.io/badge/snakemake-≥5.14.0-brightgreen.svg)](https://snakemake.bitbucket.io) [![Snakemake](https://img.shields.io/badge/snakemake-≥6.4.0-brightgreen.svg)](https://snakemake.github.io)
[![Build Status](https://travis-ci.org/snakemake-workflows/chipseq.svg?branch=master)](https://travis-ci.org/snakemake-workflows/chipseq) [![GitHub actions status](https://github.com/snakemake-workflows/chipseq/workflows/Tests/badge.svg?branch=master)](https://github.com/snakemake-workflows/chipseq/actions?query=branch%3Amaster+workflow%3ATests)
This is the template for a new Snakemake workflow. Replace this text with a comprehensive description covering the purpose and domain. This workflow is a Snakemake port of the [nextflow chipseq pipeline](https://nf-co.re/chipseq) and performs ChIP-seq peak-calling, QC and differential analysis.
Insert your code into the respective folders, i.e. `scripts`, `rules`, and `envs`. Define the entry point of the workflow in the `Snakefile` and the main configuration in the `config.yaml` file.
## Authors
* Antonie Vietor (@AntonieV)
## Usage ## Usage
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above). The usage of this workflow is described in the [Snakemake Workflow Catalog](https://snakemake.github.io/snakemake-workflow-catalog/?usage=snakemake-workflows/chipseq).
### Step 1: Obtain a copy of this workflow
1. Create a new github repository using this workflow [as a template](https://help.github.com/en/articles/creating-a-repository-from-a-template).
2. [Clone](https://help.github.com/en/articles/cloning-a-repository) the newly created repository to your local system, into the place where you want to perform the data analysis.
### Step 2: Configure workflow
Configure the workflow according to your needs via editing the files in the `config/` folder. Adjust `config.yaml` to configure the workflow execution, and `samples.tsv` to specify your sample setup.
### Step 3: Install Snakemake
Install Snakemake using [conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html):
conda create -c bioconda -c conda-forge -n snakemake snakemake
For installation details, see the [instructions in the Snakemake documentation](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html).
### Step 4: Execute workflow
Activate the conda environment:
conda activate snakemake
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using `$N` cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
If you not only want to fix the software stack but also the underlying OS, use
snakemake --use-conda --use-singularity
in combination with any of the modes above.
See the [Snakemake documentation](https://snakemake.readthedocs.io/en/stable/executable.html) for further details.
### Step 5: Investigate results
After successful execution, you can create a self-contained interactive HTML report with all results via:
snakemake --report report.html
This report can, e.g., be forwarded to your collaborators.
An example (using some trivial test data) can be seen [here](https://cdn.rawgit.com/snakemake-workflows/rna-seq-kallisto-sleuth/master/.test/report.html).
### Step 6: Commit changes
Whenever you change something, don't forget to commit the changes back to your github copy of the repository:
git commit -a
git push
### Step 7: Obtain updates from upstream
Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.
1. Once, register the upstream repository in your local copy: `git remote add -f upstream git@github.com:snakemake-workflows/chipseq.git` or `git remote add -f upstream https://github.com/snakemake-workflows/chipseq.git` if you do not have setup ssh keys.
2. Update the upstream version: `git fetch upstream`.
3. Create a diff with the current version: `git diff HEAD upstream/master workflow > upstream-changes.diff`.
4. Investigate the changes: `vim upstream-changes.diff`.
5. Apply the modified diff via: `git apply upstream-changes.diff`.
6. Carefully check whether you need to update the config files: `git diff HEAD upstream/master config`. If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.
### Step 8: Contribute back
In case you have also changed or added steps, please consider contributing them back to the original repository:
1. [Fork](https://help.github.com/en/articles/fork-a-repo) the original repo to a personal or lab account.
2. [Clone](https://help.github.com/en/articles/cloning-a-repository) the fork to your local system, to a different place than where you ran your analysis.
3. Copy the modified files from your analysis to the clone of your fork, e.g., `cp -r workflow path/to/fork`. Make sure to **not** accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary.
4. Commit and push your changes to your fork.
5. Create a [pull request](https://help.github.com/en/articles/creating-a-pull-request) against the original repository.
## Testing
Test cases are in the subfolder `.test`. They are automatically executed via continuous integration with [Github Actions](https://github.com/features/actions).
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and its DOI (see above).
# General settings
To configure this workflow, modify ``config/config.yaml`` according to your needs, following the explanations provided in the file.
# Sample sheet
Add samples to `config/samples.tsv`. For each sample, the columns `sample`, `group`, `control`, and `antibody` have to be defined.
* Samples / IP (immunoprecipitations) within the same `group` represents replicates and must have the same antibody and the same control.
* Controls / Input are listed like samples, but they do not have entries in the columns for `control` and `antibody`.
* The identifiers of each control has to be noted in the column `sample`.
* For all samples, the identifiers of the corresponding controls have to be given in the `control` column (see example below).
**Sample sheet example**:
* Samples / IP: A, B and C
* Controls / Input: D and E
| sample | group | batch_effect | control | antibody |
|--------|--------|--------------|---------|----------|
| A | TNFa | batch1 | D | p65 |
| B | TNFa | batch2 | D | p65 |
| C | E2TNFa | batch1 | E | p65 |
| D | TNFa | batch1 | | |
| E | E2TNFa | batch1 | | |
# Unit sheet
For each sample, add one or more sequencing units (runs or lanes) to the unit sheet `config/units.tsv`. For each unit, the columns `sample`, `unit`, `platform` and either `fq1` (single-end reads) or `fq1` and `fq2` (paired-end reads) or `sra_accession` have to be defined.
* Each unit has a name specified in column `unit`, which can be e.g. a running number, or an actual run, lane or replicate id.
* Each unit has a `sample` name, which associates it with the biological sample it comes from.
* For single-end reads define for each unit either a path to FASTQ file (column `fq1`) or define an SRA (sequence read archive) accession (starting with e.g. ERR or SRR) by using the column `sra_accession`.
* For paired-end reads define for each unit either two paths to FASTQ files (columns `fq1`, `fq2`) or define an SRA accession (column `sra_accession`).
* In case SRA accession numbers are used, the pipeline will automatically download the corresponding reads from SRA. If both local files and SRA accession are available, the local files will be preferred.
* The platform column needs to contain the used sequencing platform (one of 'CAPILLARY', 'LS454', 'ILLUMINA', 'SOLID', 'HELICOS', 'IONTORRENT', 'ONT', 'PACBIO’).
**Unit sheet example for single-end reads:**
| sample | unit | fq1 | fq2 | sra_accession | platform |
|--------|------|----------------------|-----|---------------|----------|
| A | 1 | data/A-run1.fastq.gz | | | ILLUMINA |
| B | 1 | data/B-run1.fastq.gz | | | ILLUMINA |
| B | 2 | data/B-run2.fastq.gz | | | ILLUMINA |
| C | 1 | data/C-run1.fastq.gz | | | ILLUMINA |
**Unit sheet example for paired-end reads:**
| sample | unit | fq1 | fq2 | sra_accession | platform |
|--------|------|------------------------|------------------------|---------------|----------|
| A | 1 | data/A-run1_1.fastq.gz | data/A-run1_2.fastq.gz | | ILLUMINA |
| B | 1 | data/B-run1_1.fastq.gz | data/B-run1_2.fastq.gz | | ILLUMINA |
| B | 2 | data/B-run2_1.fastq.gz | data/B-run1_2.fastq.gz | | ILLUMINA |
| C | 1 | data/C-run1_1.fastq.gz | data/C-run1_2.fastq.gz | | ILLUMINA |
**Unit sheet example with SRA download (single-end or paired-end reads):**
| sample | unit | fq1 | fq2 | sra_accession | platform |
|--------|------|-----|-----|---------------|----------|
| A | 1 | | | SRR1635456 | ILLUMINA |
| B | 1 | | | SRR1635457 | ILLUMINA |
| B | 2 | | | SRR1635458 | ILLUMINA |
| C | 1 | | | SRR1635439 | ILLUMINA |
# This file should contain everything to configure the workflow on a global scale. # This file contain everything to configure the workflow on a global scale.
# In case of sample based data, it should be complemented by a samples.tsv file that contains # The sample based data must be complemented by a samples.tsv file that contains
# one row per sample. It can be parsed easily via pandas. # one row per sample. It can be parsed easily via pandas.
samples: "config/samples.tsv" samples: "config/samples.tsv"
units: "config/units.tsv" # to download reads from SRA the accession numbers (see https://www.ncbi.nlm.nih.gov/sra) of samples must be given in units.tsv # to download reads from SRA the accession numbers (see https://www.ncbi.nlm.nih.gov/sra) of samples must be given in units.tsv
units: "config/units.tsv"
single_end: False single_end: False
resources: resources:
...@@ -16,9 +17,9 @@ resources: ...@@ -16,9 +17,9 @@ resources:
release: 101 release: 101
# Genome build # Genome build
build: R64-1-1 build: R64-1-1
# for testing data only chromosome 21 is selected # for testing data a specific chromosome can be selected
chromosome: chromosome:
# specify release version number of igenomes list to use (see https://github.com/nf-core/chipseq/releases), default: 1.2.2 # specify release version number of igenomes list to use (see https://github.com/nf-core/chipseq/releases), e.g. 1.2.2
igenomes_release: 1.2.2 igenomes_release: 1.2.2
# if igenomes.yaml cannot be used, a value for the mappable or effective genome size can be specified here, e.g. macs-gsize: 2.7e9 # if igenomes.yaml cannot be used, a value for the mappable or effective genome size can be specified here, e.g. macs-gsize: 2.7e9
macs-gsize: macs-gsize:
...@@ -41,7 +42,7 @@ params: ...@@ -41,7 +42,7 @@ params:
picard_metrics: picard_metrics:
activate: True activate: True
deseq2: deseq2:
# optional to run vst transform instead of rlog # set to True to use the vst transformation instead of the rlog transformation for the DESeq2 analysis
vst: False vst: False
peak-annotation-analysis: peak-annotation-analysis:
activate: True activate: True
...@@ -53,9 +54,17 @@ params: ...@@ -53,9 +54,17 @@ params:
# if duplicates should be removed in this filtering, add "-F 0x0400" to the params # if duplicates should be removed in this filtering, add "-F 0x0400" to the params
# if for each read, you only want to retain a single (best) mapping, add "-q 1" to params # if for each read, you only want to retain a single (best) mapping, add "-q 1" to params
# if you would like to restrict analysis to certain regions (e.g. excluding other "blacklisted" regions), # if you would like to restrict analysis to certain regions (e.g. excluding other "blacklisted" regions),
# the -L option is automatically activated if a path to a blacklist of the given genome exists in "config/igenomes.yaml" or has been entered there # the -L option is automatically activated if a path to a blacklist of the given genome exists in the
# downloaded "resources/ref/igenomes.yaml" or has been provided via the parameter
# "config['resources']['ref']['blacklist']" in this configuration file
samtools-view-se: "-b -F 0x004" samtools-view-se: "-b -F 0x004"
samtools-view-pe: "-b -F 0x004 -G 0x009 -f 0x001" samtools-view-pe: "-b -F 0x004 -G 0x009 -f 0x001"
plotfingerprint:
# Number of bins that sampled from the genome, for which the overlapping number of reads is computed for fingerprint plot
number-of-samples: 500000
# optional parameters for picard's CollectMultipleMetrics from sorted, filtered and merged bam files in post analysis step
# see https://gatk.broadinstitute.org/hc/en-us/articles/360037594031-CollectMultipleMetrics-Picard-
collect-multiple-metrics: VALIDATION_STRINGENCY=LENIENT
# TODO: move adapter parameters into a `adapter` column in units.tsv and check for its presence via the units.schema.yaml -- this enables unit-specific adapters, e.g. when integrating multiple datasets # TODO: move adapter parameters into a `adapter` column in units.tsv and check for its presence via the units.schema.yaml -- this enables unit-specific adapters, e.g. when integrating multiple datasets
# these cutadapt parameters need to contain the required flag(s) for # these cutadapt parameters need to contain the required flag(s) for
# the type of adapter(s) to trim, i.e.: # the type of adapter(s) to trim, i.e.:
......
sample group batch_effect control antibody sample group batch_effect control antibody
A treated batch1 D BCATENIN
B untreated batch2 D BCATENIN
C treated batch1 D TCF4
D untreated batch2
sample unit fragment_len_mean fragment_len_sd fq1 fq2 sra_accession platform sample unit fragment_len_mean fragment_len_sd fq1 fq2 sra_accession platform
A 1 resources/reads/a.chr21.1.fq resources/reads/a.chr21.2.fq ILLUMINA
B 1 resources/reads/b.chr21.1.fq resources/reads/b.chr21.2.fq ILLUMINA
B 2 300 14 resources/reads/b.chr21.1.fq ILLUMINA
C 1 resources/reads/a.chr21.1.fq resources/reads/a.chr21.2.fq ILLUMINA
D 1 resources/reads/b.chr21.1.fq resources/reads/b.chr21.2.fq ILLUMINA
**HOMER** peak annotation summary plot is generated by calculating the proportion of {{snakemake.config["params"]["peak-analysis"]}} peaks assigned to genomic features by `HOMER annotatePeaks.pl <http://homer.ucsd.edu/homer/ngs/annotation.html>`_. **HOMER** peak annotation summary plot is generated by calculating the proportion of
{{snakemake.config["params"]["peak-analysis"]}} peaks assigned to genomic features by
`HOMER annotatePeaks.pl <http://homer.ucsd.edu/homer/ngs/annotation.html>`_.
**`Base distribution by cycle plot
<https://gatk.broadinstitute.org/hc/en-us/articles/360042477312-CollectBaseDistributionByCycle-Picard->`_ (Picard)** is
used as quality control for alignment-level and shows the nucleotide distribution per cycle of the bam files after
filtering, sorting, merging and removing orphans. For any cycle within reads the relative proportions of nucleotides
should reflect the AT:CG content. For all nucleotides flattish lines would be expected and any spikes would suggest a
systematic sequencing error. For more information about `collected Picard metrics
<https://gatk.broadinstitute.org/hc/en-us/articles/360037594031-CollectMultipleMetrics-Picard->`_ please
see `documentation <https://broadinstitute.github.io/picard/>`_.
**MACS2 and bedtools** merged consensus {{ snakemake.wildcards.peak }} peaks plot is generated by calculating the proportion of intersection size assigned to {{ snakemake.wildcards.samples }} for {{ snakemake.wildcards.antibody }}. **MACS2 and bedtools** merged consensus {{ snakemake.wildcards.peak }} peaks plot is generated by calculating the
proportion of intersection size assigned to {{ snakemake.wildcards.samples }} for {{ snakemake.wildcards.antibody }}.
**`MA plot <https://bioconductor.org/packages/release/bioc/manuals/DESeq2/man/DESeq2.pdf#Rfn.plotMA>`_ (FDR 0.01)**