Unverified Commit 1345df8a authored by Antonie Vietor's avatar Antonie Vietor Committed by GitHub
Browse files

Consensus peak analysis (#6)

* preseq and picard collectalignmentsummarymetrics added

* changed PICARD COLLECTALIGNMENTSUMMARYMETRICS to PICARD COLLECTMULTIPLEMETRICS and integrated the new wrapper as a temporary script

* integration of next steps with temporary use of future wrappers as scripts

* wrapper integration for collectmultiplemetrics and genomecov, rule to create an igv-file from bigWig files

* deeptools and phantompeakqualtools integration

* phantompeakqualtools, headers and multiqc integration

* draft for integration of additional plots from phantompeakqualtools data into multiqc

* Changes according view #4

* Cross-correlation plots are grouped now. Changes in the description of the plots.

* change to newer wrapper versions n all rules and code cleanup

* Changes according to view #4, temporary matplotlib dependency to multiqc added, github actions added

* github actions

* test config and test data

* changes according to PR #4

* update config

* more logs added

* lint: Mixed rules and functions in same snakefile -> moved a part of the rule_all input to common.smk, input functions for rule_all added

* undo the changes of the last commit

* moved all function from Snakefile to common.smk

* --cache flag added for github actions

* --cache flag added for github actions

* snakemake_output_cache location added

* test snakemake_output_cache location

* another test snakemake_output_cache location

* another test snakemake_output_cache location

* set cache in github actions

* fix: dependencies in pysam resulted in ContextualVersionConflict in multiqc

* test: set cache location in github actions

* removed config files in .test from gitignore

* pysam depenencies and changes for github actions

* directory for ngs-test-data added

* gitmodules

* config

* test submodules

* test submodules

* config added

* directory for snakemake output cache changed

* cache location removed

* creating directory for snakemake output cache in github actions

* test cache directory with mkdir and chmod

* code cleanup github actions

* code cleanup github actions

* conda-forge channel added to pysam env

* conda-forge channel added to pysam env

* rule phantompeak added in a script instead of a shell command via Rscript

* testing on saccharomyces cerevisiae data set with deactivated preseq_lc_extrap rule

* r-base environment added to rule phantompeak_correlation

* changed genome data back to human, added rule for downloading single chromosomes from the reference genome (to generate smaller test data)

* rule preseq_lc_extrap activated again, changed genome data back to human, added rule for downloading single chromosomes from the reference genome (to generate smaller test data)

* adopt changes from bam_post_analysis branch, control grouping for samples and input rule to get sample-control combinations

* minimal cleanup

* adjustment of the plot_fingerprint: input function, matching of each treatment sample to its control sample for common output, integration of the JSD calculation, new wildcard for the control samples

* changes on wildcard handling for controls

* rule for macs2 callpeak added

* rule for bedtools intersect added, drafts for multiqc peaks count added

* broad and narrow option handling via config, additional rule for narrow peaks output, peaks count and frip score for multiqc, peaks for igv

* adaptation and integration of plot scripts for results from homer and macs2 analysis, script for plot_peaks_count and it's integration in snakemake-report, integraion of older plots in snakemake-report

* changes for linter

* changes for linter

* changes for linter

* changes for linter

* changes for linter

* changes on input functions and on params parsing for rules plot_macs_qc and plot_homrer_annotatepeaks, peaks wildcard added to all outputs

* test for the behavior of the linter

* test for the behavior of the linter

* changes for the linter

* test for the linter

* refactoring the config variable, restoring the input functions

* plot for FRiP score and some reports added, plot for annotatepeaks summary as draft added

* plot for homer annotatepeaks summary and report description, changes on frip score and peak count plots, changes according to PR #5

* some code cleanup

* changes for PR #5 added

* rules for merging peaks added

* output logic for peak analysis and consensus peak analysis integrated, rule macs2_merged_expand added, config for output logic adapted, samples.tsv adapted to groups

* functions for checking exists_multiple_groups and exists_replicates adapted to check this for each antibody, integration of igenomes.yaml for macs_gsize param, blacklist .bed files added to which igenomes.yaml references, added seperate activation param for optional outputs in config file, rule for consensus peak .saf and .bed filed, plot for consens peak intersected, rule to create igv for consensus peak analysis

* a few minor corrections

* rules for creating genome-filter - integration of blacklist to the workflow

* homer annotation for consensus peaks, grouping of samples to antibodies, rule and script for featureCounts

* bedtools sort bug fixes

* draft for deseq2-analysis step, does not work yet

* removed submodule ngs-test-data

* removed submodule test-datasets

* removes submodule atacseq

* new test data sets added, integration of new data in samples and units tables, rule for modification of the featurecounts outputs, integration and modification of the featurecounts_deseq2.R script, integration of released wrappers

* changes for linting

* changes for linting

* wildcards handling for rule feature_counts, update on snakemake-github-action

* the new test data for PE now work correctly, ToDo: featurecounts_deseq2.R script still needs to be fixed, parallelization not yet possible

* integration of sra download for SE test data sets, integration of accession numbers to cutadapt and fastqc, fixes on blacklists integration

* workflow adjustment up to step 5, samples.tsv and units.tsv added for large real dataset

* automation of sra download without the need for an entry in config, additional column for sra accession number in units.tsv and handling, changes according to suggestions, bug fixes

* adaptation of the workflow for se: bypassing of orphan_rm and the stats from it, adaptations in post-analysis step

* spliting and merging workflow for se and pe data on orphan remove

* changes

* changes

* changes

* split se and pe workflow at orphan remove step

* adjustments for se data in CollectMultipleMetrics and genomecov, extraction of fragment size for genomcov from samtools stats

* workflow has been adapted for se data

* changes on design file and configuration for debugging deseq2 analysis, changes on github actions configuration

* changes on design file and configuration for debugging deseq2 analysis, changes on github actions configuration

* testing changes on design file and config

* testing changes on design file and config

* testing changes on design file: ambiguous samples

* deseq2 analysis established

* changes for linting

* changes for linting

* changes for github actions

* changes for github actions

* changes for github actions

* changes for github actions

* changes for github actions

* changes for github actions

* changes for github actions

* changes for github actions

* changes for github actions

* changes for github actions

* changes for github actions

* changes on common.smk for configfile

* changes on common.smk for configfile

* changes on common.smk for configfile

* changes on common.smk for configfile

* changes on common.smk for configfile

* changes on common.smk for configfile

* changes on directory structure for config

* changes on directory structure for config

* Integration of controls for DESeq2 analysis, debugging of DESeq2 FDR calculation, separate output of each plot, reduced dataset created, debugging of compute matrix output for large data

* config and data of reduced dataset

* typo removed on main.yaml

* new minimal single end test dataset from chromosome 14

* some minor changes for github actions

* some minor changes on units.tsv and debugging mode removed

* new minimal paired end test dataset from chromosome VII saccharomyces cerevisiae

* some minor changes on github actions

* igenomes download, parser for igenomes, blacklist download and and option to customize the blacklist for a specific chromosome added, changes according to PR

* changes for linting

* changes for linting

* changes for linting

* some changes to satisfy the linter

* separate rules for igenomes and blacklist download and blacklist handling

* draft for creating checkpoint and rules to handle igenomes, blacklists and macs_gsize value

* testing blacklist handling on new snakemake release

* blacklists and genome size handling with snakemake 6.4.0 release

* test data for single end changed to chromosome
21 for data reduction

* test refactoring for github actions

* some minor changes on blacklist formatting

* some additional minor changes on blacklist formatting

* minimal changes on blacklist formatting

* test data for single end reads changed back to chromosome 14

* single end test data on chromosome 21

* additional reduction of single end test data

* some minor changes for testing

* some minor changes for testing

* some minor changes for testing

* some minor changes for testing

* some minor changes in github actions

* some minor changes in github actions

* changes according to PR
parent bc1ebc47
......@@ -14,7 +14,7 @@ jobs:
steps:
- uses: actions/checkout@v1
- name: Lint workflow
uses: snakemake/snakemake-github-action@v1.9.0
uses: snakemake/snakemake-github-action@v1.19.0
with:
directory: .
snakefile: workflow/Snakefile
......@@ -26,27 +26,78 @@ jobs:
runs-on: ubuntu-latest
needs: Linting
steps:
- uses: actions/checkout@v1
- name: Checkout submodules
uses: textbook/git-checkout-submodule-action@2.0.0
- name: Checkout repository
uses: actions/checkout@v1
- name: Test dry run for a large single end workflow
uses: snakemake/snakemake-github-action@v1.19.0
with:
directory: .test
snakefile: workflow/Snakefile
args: "--use-conda -n --show-failed-logs -j 10 --conda-cleanup-pkgs cache --conda-frontend mamba"
- name: Test workflow (local test data)
uses: snakemake/snakemake-github-action@v1.9.0
- name: Test minimized single end workflow (on local reduced SRA files for a single chromosome - homo sapiens)
uses: snakemake/snakemake-github-action@v1.19.0
with:
directory: .test
snakefile: workflow/Snakefile
args: "--use-conda --cache --show-failed-logs -j 10 --conda-cleanup-pkgs cache --conda-frontend mamba"
args: "--configfile .test/config_single_end_reduced/config.yaml --use-conda --cache --show-failed-logs -j 10 --conda-cleanup-pkgs cache --conda-frontend mamba"
stagein: |
export TMPDIR=/tmp
rm -rf .test/resources .test/results
export SNAKEMAKE_OUTPUT_CACHE=/snakemake-cache
mkdir -p -m a+rw $SNAKEMAKE_OUTPUT_CACHE
#
# # Test for single end reads with larger data sets and download of SRA files. It can be included for heavy duty testing on dedicated machines.
#
# - name: Test single end workflow (test data sra-download)
# uses: snakemake/snakemake-github-action@v1.19.0
# with:
# directory: .test
# snakefile: workflow/Snakefile
# args: "--configfile .test/config_single_end/config.yaml --use-conda --cache --show-failed-logs -j 10 --conda-cleanup-pkgs cache --conda-frontend mamba"
# stagein: |
# export TMPDIR=/tmp
# rm -rf .test/resources .test/results
# export SNAKEMAKE_OUTPUT_CACHE=/snakemake-cache
# mkdir -p -m a+rw $SNAKEMAKE_OUTPUT_CACHE
- name: Test minimized paired end workflow (on local reduced SRA files for a single chromosome - saccharomyces cerevisiae)
uses: snakemake/snakemake-github-action@v1.19.0
with:
directory: .test
snakefile: workflow/Snakefile
args: "--configfile .test/config_paired_end_reduced/config.yaml --use-conda --cache --show-failed-logs -j 10 --conda-cleanup-pkgs cache --conda-frontend mamba"
stagein: |
export TMPDIR=/tmp
rm -rf .test/resources .test/results
export SNAKEMAKE_OUTPUT_CACHE=/snakemake-cache
mkdir -p -m a+rw $SNAKEMAKE_OUTPUT_CACHE
#
# # Test for paired end reads with larger datasets from git submodules. It can be included for heavy duty testing on dedicated machines.
#
# - uses: actions/checkout@v1
# - name: Checkout submodules
# uses: textbook/git-checkout-submodule-action@2.0.0
#
# - name: Test paired end workflow (submodule test data)
# uses: snakemake/snakemake-github-action@v1.19.0
# with:
# directory: .test
# snakefile: workflow/Snakefile
# args: "--configfile .test/config_paired_end/config.yaml --use-conda --cache --show-failed-logs -j 10 --conda-cleanup-pkgs cache --conda-frontend mamba"
# stagein: |
# export TMPDIR=/tmp
# rm -rf .test/resources .test/results
# export SNAKEMAKE_OUTPUT_CACHE=/snakemake-cache
# mkdir -p -m a+rw $SNAKEMAKE_OUTPUT_CACHE
- name: Test report
uses: snakemake/snakemake-github-action@v1.9.0
uses: snakemake/snakemake-github-action@v1.19.0
with:
directory: .test
snakefile: workflow/Snakefile
args: "--report report.zip"
args: "--report report.zip --configfile .test/config_paired_end_reduced/config.yaml"
stagein: |
export TMPDIR=/tmp
rm -rf .test/resources .test/results
[submodule ".test/ngs-test-data"]
path = .test/ngs-test-data
url = https://github.com/snakemake-workflows/ngs-test-data
# For testing the workflow with larger datasets for paired end reads once a dedicated GitHub actions machine is set up.
[submodule ".test/data/atacseq/test-datasets"]
path = .test/data/atacseq/test-datasets
url = https://github.com/nf-core/test-datasets.git
branch = atacseq
[submodule ".test/data/chipseq/test-datasets"]
path = .test/data/chipseq/test-datasets
url = https://github.com/nf-core/test-datasets.git
branch = chipseq
......@@ -2,8 +2,13 @@
# In case of sample based data, it should be complemented by a samples.tsv file that contains
# one row per sample. It can be parsed easily via pandas.
samples: "config/samples.tsv"
# to download reads from SRA the accession numbers (see https://www.ncbi.nlm.nih.gov/sra) of samples must be given in
# units.tsv dataset for testing this workflow with single end reads:
# https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA255509&o=acc_s%3Aa
units: "config/units.tsv"
single_end: True
# config for a large single end data set
resources:
ref:
# Number of chromosomes to consider for calling.
......@@ -15,11 +20,39 @@ resources:
release: 101
# Genome build
build: GRCh38
# for testing data a specific chromosome can be selected
chromosome:
# specify release version number of igenomes list to use (see https://github.com/nf-core/chipseq/releases), default: 1.2.2
igenomes_release: 1.2.2
# if igenomes.yaml cannot be used, a value for the mappable or effective genome size can be specified here, e.g. macs-gsize: 2.7e9
macs-gsize:
# if igenomes.yaml cannot be used, a path to an own blacklist can be specified here
blacklist:
params:
lc_extrap: True
# choose "narrow" or "broad" for macs2 callpeak analysis, for documentation and source code please see https://github.com/macs3-project/MACS
peak-analysis: "broad"
# Number of biological replicates required from a given condition for a peak to contribute to a consensus peak
min-reps-consensus: 1
callpeak:
p-value: 0.5
q-value:
deeptools-plots:
# when activated the plot profile and heatmap plot are generated, this involves a matrix calculation that requires a lot of working memory.
activate: True
lc_extrap:
activate: True
picard_metrics:
activate: True
deseq2:
# optional to run vst transform instead of rlog
vst: True
peak-annotation-analysis:
activate: True
peak-qc:
activate: True
consensus-peak-analysis:
activate: True
# TODO: move adapter parameters into a `adapter` column in units.tsv and check for its presence via the units.schema.yaml -- this enables unit-specific adapters, e.g. when integrating multiple datasets
# these cutadapt parameters need to contain the required flag(s) for
# the type of adapter(s) to trim, i.e.:
......
sample condition batch_effect control antibody
A treated batch1 D BCATENIN
B untreated batch2 D BCATENIN
C treated batch1 D TCF4
D untreated batch2
sample group batch_effect control antibody
A Veh batch1 AG ERa
B Veh batch2 AH ERa
C E2 batch1 AI ERa
D E2 batch2 AJ ERa
E TNFa batch1 AK ERa
F TNFa batch2 AL ERa
G E2_TNFa batch1 AM ERa
H E2_TNFa batch2 AN ERa
I Veh batch1 AG p65
J Veh batch2 AH p65
K E2 batch1 AI p65
L E2 batch2 AJ p65
M TNFa batch1 AK p65
N TNFa batch2 AL p65
O E2_TNFa batch1 AM p65
P E2_TNFa batch2 AN p65
Q Veh batch1 AG FoxA1
R Veh batch2 AH FoxA1
S E2 batch1 AI FoxA1
T E2 batch2 AJ FoxA1
U TNFa batch1 AK FoxA1
V TNFa batch2 AL FoxA1
W E2_TNFa batch1 AM FoxA1
X E2_TNFa batch2 AM FoxA1
Y E2_TNFa batch1 AM ERa
Z E2_TNFa batch1 AM ERa
AA E2_TNFa batch1 AM ERa
AB E2_TNFa batch1 AM ERa
AC E2_TNFa batch2 AN ERa
AD E2_TNFa batch2 AN ERa
AE E2_TNFa batch2 AN ERa
AF E2_TNFa batch2 AN ERa
AG Veh batch1
AH Veh batch2
AI E2 batch1
AJ E2 batch2
AK TNFa batch1
AL TNFa batch2
AM E2_TNFa batch1
AN E2_TNFa batch2
{
"filters" : [
{
"id" : "mismatch",
"tag" : "NM:<=4"
}
],
"rule" : " mismatch "
}
sample unit fragment_len_mean fragment_len_sd fq1 fq2 platform
A 1 ngs-test-data/reads/a.chr21.1.fq ngs-test-data/reads/a.chr21.2.fq ILLUMINA
B 1 ngs-test-data/reads/a.chr21.1.fq ngs-test-data/reads/a.chr21.2.fq ILLUMINA
B 2 300 14 ngs-test-data/reads/b.chr21.1.fq ILLUMINA
C 1 ngs-test-data/reads/a.chr21.1.fq ngs-test-data/reads/a.chr21.2.fq ILLUMINA
D 1 ngs-test-data/reads/b.chr21.1.fq ngs-test-data/reads/b.chr21.2.fq ILLUMINA
sample unit fragment_len_mean fragment_len_sd fq1 fq2 sra_accession platform
A 1 SRR1635443 ILLUMINA
B 1 SRR1635444 ILLUMINA
C 1 300 14 SRR1635445 ILLUMINA
D 1 SRR1635446 ILLUMINA
E 1 SRR1635447 ILLUMINA
F 1 SRR1635448 ILLUMINA
G 1 SRR1635449 ILLUMINA
H 2 SRR1635450 ILLUMINA
I 1 SRR1635451 ILLUMINA
J 2 SRR1635452 ILLUMINA
K 1 SRR1635453 ILLUMINA
L 2 SRR1635454 ILLUMINA
M 1 SRR1635455 ILLUMINA
N 2 SRR1635456 ILLUMINA
O 1 SRR1635457 ILLUMINA
P 2 SRR1635458 ILLUMINA
Q 1 SRR1635459 ILLUMINA
R 2 SRR1635460 ILLUMINA
S 1 SRR1635461 ILLUMINA
T 2 SRR1635462 ILLUMINA
U 1 SRR1635463 ILLUMINA
V 2 SRR1635464 ILLUMINA
W 1 SRR1635465 ILLUMINA
X 2 SRR1635466 ILLUMINA
Y 1 SRR1635467 ILLUMINA
Z 2 SRR1635468 ILLUMINA
AA 1 SRR1635469 ILLUMINA
AB 2 SRR1635470 ILLUMINA
AC 1 SRR1635471 ILLUMINA
AD 2 SRR1635472 ILLUMINA
AE 1 SRR1635473 ILLUMINA
AF 2 SRR1635474 ILLUMINA
AG 1 SRR1635435 ILLUMINA
AH 2 SRR1635436 ILLUMINA
AI 1 SRR1635437 ILLUMINA
AJ 2 SRR1635438 ILLUMINA
AK 1 SRR1635439 ILLUMINA
AL 2 SRR1635440 ILLUMINA
AM 1 SRR1635441 ILLUMINA
AN 2 SRR1635442 ILLUMINA
# This file should contain everything to configure the workflow on a global scale.
# In case of sample based data, it should be complemented by a samples.tsv file that contains
# one row per sample. It can be parsed easily via pandas.
samples: "config_paired_end_reduced/samples.tsv"
units: "config_paired_end_reduced/units.tsv"
single_end: False
# config for paired end data set for testing
resources:
ref:
# Number of chromosomes to consider for calling.
# The first n entries of the FASTA will be considered.
n_chromosomes: 17
# Ensembl species name
species: saccharomyces_cerevisiae
# Ensembl release
release: 101
# Genome build
build: R64-1-1
# for testing data only chromosome 21 is selected
chromosome:
# specify release version number of igenomes list to use (see https://github.com/nf-core/chipseq/releases), default: 1.2.2
igenomes_release: 1.2.2
# if igenomes.yaml cannot be used, a value for the mappable or effective genome size can be specified here, e.g. macs-gsize: 2.7e9
macs-gsize:
# if igenomes.yaml cannot be used, a path to an own blacklist can be specified here
blacklist:
params:
# choose "narrow" or "broad" for macs2 callpeak analysis, for documentation and source code please see https://github.com/macs3-project/MACS
peak-analysis: "narrow"
# Number of biological replicates required from a given condition for a peak to contribute to a consensus peak
min-reps-consensus: 1
callpeak:
p-value: 0.5
q-value:
deeptools-plots:
# when activated the plot profile and heatmap plot are generated, this involves a matrix calculation that requires a lot of working memory.
activate: True
lc_extrap:
activate: True
picard_metrics:
activate: True
deseq2:
# optional to run vst transform instead of rlog
vst: True
peak-annotation-analysis:
activate: True
peak-qc:
activate: True
consensus-peak-analysis:
activate: True
# TODO: move adapter parameters into a `adapter` column in units.tsv and check for its presence via the units.schema.yaml -- this enables unit-specific adapters, e.g. when integrating multiple datasets
# these cutadapt parameters need to contain the required flag(s) for
# the type of adapter(s) to trim, i.e.:
# * https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types
# * `-a` for 3' adapter in the forward reads
# * `-g` for 5' adapter in the forward reads
# * `-b` for adapters anywhere in the forward reads
# also, separate capitalised letter flags are required for adapters in
# the reverse reads of paired end sequencing:
# * https://cutadapt.readthedocs.io/en/stable/guide.html#trimming-paired-end-reads
cutadapt-se: "-g AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT"
# reasoning behind parameters:
# * `-e 0.005`: the default cutadapt maximum error rate of `0.2` is far too high, for Illumina
# data the error rate is more in the range of `0.005` and setting it accordingly should avoid
# false positive adapter matches
# * `--minimum-overlap 7`: the cutadapt default minimum overlap of `5` did trimming on the level
# of expected adapter matches by chance
cutadapt-pe: "-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -g AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -G AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT"
cutadapt-others: "-e 0.005 --overlap 7"
sample group batch_effect control antibody
A T0 batch1 E SPT5
B T0 batch2 E SPT5
C T15 batch1 F SPT5
D T15 batch2 F SPT5
E T0 batch1
F T15 batch1
{
"filters" : [
{
"id" : "mismatch",
"tag" : "NM:<=4"
}
],
"rule" : " mismatch "
}
sample unit fragment_len_mean fragment_len_sd fq1 fq2 sra_accession platform
A 1 data/atacseq/test-datasets/testdata/SRR1822153_1.fastq.gz data/atacseq/test-datasets/testdata/SRR1822153_2.fastq.gz ILLUMINA
B 1 data/atacseq/test-datasets/testdata/SRR1822154_1.fastq.gz data/atacseq/test-datasets/testdata/SRR1822154_2.fastq.gz ILLUMINA
C 1 300 14 data/atacseq/test-datasets/testdata/SRR1822157_1.fastq.gz data/atacseq/test-datasets/testdata/SRR1822157_2.fastq.gz ILLUMINA
D 1 data/atacseq/test-datasets/testdata/SRR1822158_1.fastq.gz data/atacseq/test-datasets/testdata/SRR1822158_2.fastq.gz ILLUMINA
E 1 data/chipseq/test-datasets/testdata/SRR5204809_Spt5-ChIP_Input1_SacCer_ChIP-Seq_ss100k_R1.fastq.gz data/chipseq/test-datasets/testdata/SRR5204809_Spt5-ChIP_Input1_SacCer_ChIP-Seq_ss100k_R2.fastq.gz ILLUMINA
F 1 data/chipseq/test-datasets/testdata/SRR5204810_Spt5-ChIP_Input2_SacCer_ChIP-Seq_ss100k_R1.fastq.gz data/chipseq/test-datasets/testdata/SRR5204810_Spt5-ChIP_Input2_SacCer_ChIP-Seq_ss100k_R2.fastq.gz ILLUMINA
# This file should contain everything to configure the workflow on a global scale.
# In case of sample based data, it should be complemented by a samples.tsv file that contains
# one row per sample. It can be parsed easily via pandas.
samples: "config_paired_end_reduced/samples.tsv"
units: "config_paired_end_reduced/units.tsv"
single_end: False
# config for paired end data set for testing
resources:
ref:
# Number of chromosomes to consider for calling.
# The first n entries of the FASTA will be considered.
n_chromosomes: 17
# Ensembl species name
species: saccharomyces_cerevisiae
# Ensembl release
release: 101
# Genome build
build: R64-1-1
# for testing data a specific chromosome can be selected
chromosome: VII
# specify release version number of igenomes list to use (see https://github.com/nf-core/chipseq/releases), default: 1.2.2
igenomes_release: 1.2.2
# if igenomes.yaml cannot be used, a value for the mappable or effective genome size can be specified here, e.g. macs-gsize: 2.7e9
macs-gsize:
# if igenomes.yaml cannot be used, a path to an own blacklist can be specified here
blacklist:
params:
# choose "narrow" or "broad" for macs2 callpeak analysis, for documentation and source code please see https://github.com/macs3-project/MACS
peak-analysis: "broad"
# Number of biological replicates required from a given condition for a peak to contribute to a consensus peak
min-reps-consensus: 1
callpeak:
p-value: 0.5
q-value:
deeptools-plots:
# when activated the plot profile and heatmap plot are generated, this involves a matrix calculation that requires a lot of working memory.
activate: True
lc_extrap:
activate: False
picard_metrics:
activate: True
deseq2:
# optional to run vst transform instead of rlog
vst: False
peak-annotation-analysis:
activate: True
peak-qc:
activate: True
consensus-peak-analysis:
activate: True
# TODO: move adapter parameters into a `adapter` column in units.tsv and check for its presence via the units.schema.yaml -- this enables unit-specific adapters, e.g. when integrating multiple datasets
# these cutadapt parameters need to contain the required flag(s) for
# the type of adapter(s) to trim, i.e.:
# * https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types
# * `-a` for 3' adapter in the forward reads
# * `-g` for 5' adapter in the forward reads
# * `-b` for adapters anywhere in the forward reads
# also, separate capitalised letter flags are required for adapters in
# the reverse reads of paired end sequencing:
# * https://cutadapt.readthedocs.io/en/stable/guide.html#trimming-paired-end-reads
cutadapt-se: "-g AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT"
# reasoning behind parameters:
# * `-e 0.005`: the default cutadapt maximum error rate of `0.2` is far too high, for Illumina
# data the error rate is more in the range of `0.005` and setting it accordingly should avoid
# false positive adapter matches
# * `--minimum-overlap 7`: the cutadapt default minimum overlap of `5` did trimming on the level
# of expected adapter matches by chance
cutadapt-pe: "-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -g AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -G AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT"
cutadapt-others: "-e 0.005 --overlap 7"
{
"filters" : [
{
"id" : "paired_end",
"isPaired" : "true"
},
{
"id" : "mismatch",
"tag" : "NM:<=4"
},
{
"id" : "min_size",
"insertSize" : ">=-2000"
},
{
"id" : "max_size",
"insertSize" : "<=2000"
}
],
"rule" : " (paired_end & mismatch & min_size & max_size) | (!paired_end & mismatch) "
}
sample group batch_effect control antibody
A T0 batch1 E SPT5
B T0 batch2 E SPT5
C T15 batch1 E SPT5
D T15 batch2 E SPT5
E T0 batch1
{
"filters" : [
{
"id" : "mismatch",
"tag" : "NM:<=4"
}
],
"rule" : " mismatch "
}
sample unit fragment_len_mean fragment_len_sd fq1 fq2 sra_accession platform
A 1 data/paired_end_test_data/A-1_vii_1.fastq.gz data/paired_end_test_data/A-1_vii_2.fastq.gz ILLUMINA
B 1 data/paired_end_test_data/B-1_vii_1.fastq.gz data/paired_end_test_data/B-1_vii_2.fastq.gz ILLUMINA
C 1 300 14 data/paired_end_test_data/C-1_vii_1.fastq.gz data/paired_end_test_data/C-1_vii_2.fastq.gz ILLUMINA
D 1 data/paired_end_test_data/D-1_vii_1.fastq.gz data/paired_end_test_data/D-1_vii_2.fastq.gz ILLUMINA
E 1 data/paired_end_test_data/E-1_vii_1.fastq.gz data/paired_end_test_data/E-1_vii_2.fastq.gz ILLUMINA
# This file should contain everything to configure the workflow on a global scale.
# In case of sample based data, it should be complemented by a samples.tsv file that contains
# one row per sample. It can be parsed easily via pandas.
samples: "config_single_end/samples.tsv"
# to download reads from SRA the accession numbers (see https://www.ncbi.nlm.nih.gov/sra) of samples must be given in
# units.tsv dataset for testing this workflow with single end reads:
# https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA255509&o=acc_s%3Aa
units: "config_single_end/units.tsv"
single_end: True
# config for a small single end data set for testing
resources:
ref:
# Number of chromosomes to consider for calling.
# The first n entries of the FASTA will be considered.
n_chromosomes: 25
# Ensembl species name
species: homo_sapiens
# Ensembl release
release: 101
# Genome build
build: GRCh38
# for testing data a specific chromosome can be selected
chromosome:
# specify release version number of igenomes list to use (see https://github.com/nf-core/chipseq/releases), default: 1.2.2
igenomes_release: 1.2.2
# if igenomes.yaml cannot be used, a value for the mappable or effective genome size can be specified here, e.g. macs-gsize: 2.7e9
macs-gsize:
# if igenomes.yaml cannot be used, a path to an own blacklist can be specified here
blacklist:
params:
# choose "narrow" or "broad" for macs2 callpeak analysis, for documentation and source code please see https://github.com/macs3-project/MACS
peak-analysis: "broad"
# Number of biological replicates required from a given condition for a peak to contribute to a consensus peak
min-reps-consensus: 1
callpeak:
p-value: 0.5
q-value:
deeptools-plots:
# when activated the plot profile and heatmap plot are generated, this involves a matrix calculation that requires a lot of working memory.
activate: True
lc_extrap:
activate: True
picard_metrics:
activate: True
deseq2:
# optional to run vst transform instead of rlog
vst: True
peak-annotation-analysis:
activate: True
peak-qc:
activate: True
consensus-peak-analysis:
activate: True
# TODO: move adapter parameters into a `adapter` column in units.tsv and check for its presence via the units.schema.yaml -- this enables unit-specific adapters, e.g. when integrating multiple datasets
# these cutadapt parameters need to contain the required flag(s) for
# the type of adapter(s) to trim, i.e.:
# * https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types
# * `-a` for 3' adapter in the forward reads
# * `-g` for 5' adapter in the forward reads
# * `-b` for adapters anywhere in the forward reads
# also, separate capitalised letter flags are required for adapters in
# the reverse reads of paired end sequencing:
# * https://cutadapt.readthedocs.io/en/stable/guide.html#trimming-paired-end-reads
cutadapt-se: "-g AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT"
# reasoning behind parameters:
# * `-e 0.005`: the default cutadapt maximum error rate of `0.2` is far too high, for Illumina
# data the error rate is more in the range of `0.005` and setting it accordingly should avoid
# false positive adapter matches
# * `--minimum-overlap 7`: the cutadapt default minimum overlap of `5` did trimming on the level
# of expected adapter matches by chance
cutadapt-pe: "-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -g AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -G AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT"
cutadapt-others: "-e 0.005 --overlap 7"