Unverified Commit 268eb612 authored by Antonie Vietor's avatar Antonie Vietor Committed by GitHub
Browse files

Bam post analysis (#4)



* preseq and picard collectalignmentsummarymetrics added

* changed PICARD COLLECTALIGNMENTSUMMARYMETRICS to PICARD COLLECTMULTIPLEMETRICS and integrated the new wrapper as a temporary script

* integration of next steps with temporary use of future wrappers as scripts

* wrapper integration for collectmultiplemetrics and genomecov, rule to create an igv-file from bigWig files

* deeptools and phantompeakqualtools integration

* phantompeakqualtools, headers and multiqc integration

* draft for integration of additional plots from phantompeakqualtools data into multiqc

* Changes according view #4

* Cross-correlation plots are grouped now. Changes in the description of the plots.

* change to newer wrapper versions n all rules and code cleanup

* Changes according to view #4, temporary matplotlib dependency to multiqc added, github actions added

* github actions

* test config and test data

* changes according to PR #4

* update config

* more logs added

* lint: Mixed rules and functions in same snakefile -> moved a part of the rule_all input to common.smk, input functions for rule_all added

* undo the changes of the last commit

* moved all function from Snakefile to common.smk

* --cache flag added for github actions

* --cache flag added for github actions

* snakemake_output_cache location added

* test snakemake_output_cache location

* another test snakemake_output_cache location

* another test snakemake_output_cache location

* set cache in github actions

* fix: dependencies in pysam resulted in ContextualVersionConflict in multiqc

* test: set cache location in github actions

* removed config files in .test from gitignore

* pysam depenencies and changes for github actions

* directory for ngs-test-data added

* gitmodules

* config

* test submodules

* test submodules

* config added

* directory for snakemake output cache changed

* cache location removed

* creating directory for snakemake output cache in github actions

* test cache directory with mkdir and chmod

* code cleanup github actions

* code cleanup github actions

* conda-forge channel added to pysam env

* conda-forge channel added to pysam env

* rule phantompeak added in a script instead of a shell command via Rscript

* testing on saccharomyces cerevisiae data set with deactivated preseq_lc_extrap rule

* r-base environment added to rule phantompeak_correlation

* changed genome data back to human, added rule for downloading single chromosomes from the reference genome (to generate smaller test data)

* rule preseq_lc_extrap activated again, changed genome data back to human, added rule for downloading single chromosomes from the reference genome (to generate smaller test data)

* Add some TODOs for later

* reflect that compute_matrix requests two threads in the extra params

* also cache bwa_index, as this can take quite some time
Co-authored-by: default avatarDavid Laehnemann <david.laehnemann@hhu.de>
parent 4607b5fc
name: Tests
on:
push:
branches:
- master
pull_request:
branches: [ master ]
branches_ignore: []
jobs:
Linting:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v1
- name: Lint workflow
uses: snakemake/snakemake-github-action@v1.9.0
with:
directory: .
snakefile: workflow/Snakefile
args: "--lint"
stagein: |
export TMPDIR=/tmp
Testing:
runs-on: ubuntu-latest
needs: Linting
steps:
- uses: actions/checkout@v1
- name: Checkout submodules
uses: textbook/git-checkout-submodule-action@2.0.0
- name: Test workflow (local test data)
uses: snakemake/snakemake-github-action@v1.9.0
with:
directory: .test
snakefile: workflow/Snakefile
args: "--use-conda --cache --show-failed-logs -j 10 --conda-cleanup-pkgs cache --conda-frontend mamba"
stagein: |
export TMPDIR=/tmp
rm -rf .test/resources .test/results
export SNAKEMAKE_OUTPUT_CACHE=/snakemake-cache
mkdir -p -m a+rw $SNAKEMAKE_OUTPUT_CACHE
- name: Test report
uses: snakemake/snakemake-github-action@v1.9.0
with:
directory: .test
snakefile: workflow/Snakefile
args: "--report report.zip"
stagein: |
export TMPDIR=/tmp
*
results
results/*
!analysis/README.md
!config
!config/*
!resources
!resources/*
!workflow
!workflow/**
!.gitignore
!.gitattributes
!.editorconfig
!.travis.yml
!.test
!.test/config/*
!.test/ngs-test-data
!LICENSE
!README.md
[submodule ".test/ngs-test-data"]
path = .test/ngs-test-data
url = https://github.com/snakemake-workflows/ngs-test-data
{
"filters" : [
{
"id" : "paired_end",
"isPaired" : "true"
},
{
"id" : "mismatch",
"tag" : "NM:<=4"
},
{
"id" : "min_size",
"insertSize" : ">=-2000"
},
{
"id" : "max_size",
"insertSize" : "<=2000"
}
],
"rule" : " (paired_end & mismatch & min_size & max_size) | (!paired_end & mismatch) "
}
# This file should contain everything to configure the workflow on a global scale.
# In case of sample based data, it should be complemented by a samples.tsv file that contains
# one row per sample. It can be parsed easily via pandas.
samples: "config/samples.tsv"
units: "config/units.tsv"
resources:
ref:
# Number of chromosomes to consider for calling.
# The first n entries of the FASTA will be considered.
n_chromosomes: 25
# Ensembl species name
species: homo_sapiens
# Ensembl release
release: 101
# Genome build
build: GRCh38
params:
lc_extrap: True
# TODO: move adapter parameters into a `adapter` column in units.tsv and check for its presence via the units.schema.yaml -- this enables unit-specific adapters, e.g. when integrating multiple datasets
# these cutadapt parameters need to contain the required flag(s) for
# the type of adapter(s) to trim, i.e.:
# * https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types
# * `-a` for 3' adapter in the forward reads
# * `-g` for 5' adapter in the forward reads
# * `-b` for adapters anywhere in the forward reads
# also, separate capitalised letter flags are required for adapters in
# the reverse reads of paired end sequencing:
# * https://cutadapt.readthedocs.io/en/stable/guide.html#trimming-paired-end-reads
cutadapt-se: "-g AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT"
# reasoning behind parameters:
# * `-e 0.005`: the default cutadapt maximum error rate of `0.2` is far too high, for Illumina
# data the error rate is more in the range of `0.005` and setting it accordingly should avoid
# false positive adapter matches
# * `--minimum-overlap 7`: the cutadapt default minimum overlap of `5` did trimming on the level
# of expected adapter matches by chance
cutadapt-pe: "-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -g AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -G AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT"
cutadapt-others: "-e 0.005 --overlap 7"
sample condition batch_effect control antibody
A treated batch1 D BCATENIN
B untreated batch2 D BCATENIN
C treated batch1 D TCF4
D untreated batch2
sample unit fragment_len_mean fragment_len_sd fq1 fq2 platform
A 1 ngs-test-data/reads/a.chr21.1.fq ngs-test-data/reads/a.chr21.2.fq ILLUMINA
B 1 ngs-test-data/reads/b.chr21.1.fq ngs-test-data/reads/b.chr21.2.fq ILLUMINA
B 2 300 14 ngs-test-data/reads/b.chr21.1.fq ILLUMINA
C 1 ngs-test-data/reads/a.chr21.1.fq ngs-test-data/reads/a.chr21.2.fq ILLUMINA
D 1 ngs-test-data/reads/b.chr21.1.fq ngs-test-data/reads/b.chr21.2.fq ILLUMINA
Subproject commit 8083775249918d187e375ae9800d81842ec3c3e9
......@@ -6,10 +6,18 @@ units: "config/units.tsv"
resources:
ref:
# path to referece genome file in FASTA format
genome: "resources/ref/genome.chr21.fa"
# Number of chromosomes to consider for calling.
# The first n entries of the FASTA will be considered.
n_chromosomes: 25
# Ensembl species name
species: homo_sapiens
# Ensembl release
release: 101
# Genome build
build: GRCh38
params:
lc_extrap: True
# these cutadapt parameters need to contain the required flag(s) for
# the type of adapter(s) to trim, i.e.:
# * https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types
......
sample condition batch_effect
A treated batch1
B untreated batch1
C treated batch2
D untreated batch2
sample condition batch_effect control antibody
A treated batch1 D BCATENIN
B untreated batch2 D BCATENIN
C treated batch1 D TCF4
D untreated batch2
sample unit fragment_len_mean fragment_len_sd fq1 fq2 platform
A 1 raw/a.chr21.1.fq raw/a.chr21.2.fq ILLUMINA
B 1 raw/b.chr21.1.fq raw/b.chr21.2.fq ILLUMINA
B 2 300 14 raw/b.chr21.1.fq ILLUMINA
C 1 raw/a.chr21.1.fq raw/a.chr21.2.fq ILLUMINA
D 1 raw/b.chr21.1.fq raw/b.chr21.2.fq ILLUMINA
A 1 resources/reads/a.chr21.1.fq resources/reads/a.chr21.2.fq ILLUMINA
B 1 resources/reads/b.chr21.1.fq resources/reads/b.chr21.2.fq ILLUMINA
B 2 300 14 resources/reads/b.chr21.1.fq ILLUMINA
C 1 resources/reads/a.chr21.1.fq resources/reads/a.chr21.2.fq ILLUMINA
D 1 resources/reads/b.chr21.1.fq resources/reads/b.chr21.2.fq ILLUMINA
......@@ -2,57 +2,14 @@
# After configuring, running snakemake -n in a clone of this repository should successfully execute a dry-run of the workflow.
include: "rules/common.smk"
include: "rules/ref.smk"
include: "rules/qc.smk"
include: "rules/cutadapt.smk"
include: "rules/mapping.smk"
include: "rules/filtering.smk"
include: "rules/stats.smk"
include: "rules/utils.smk"
def all_input(wildcards):
wanted_input = []
# QC with fastQC and multiQC
wanted_input.extend(["results/qc/multiqc/multiqc.html"])
# trimming reads
for (sample, unit) in units.index:
if is_single_end(sample, unit):
wanted_input.extend(expand(
[
"results/trimmed/{sample}-{unit}.fastq.gz",
"results/trimmed/{sample}-{unit}.se.qc.txt"
],
sample = sample,
unit = unit
)
)
else:
wanted_input.extend(
expand (
[
"results/trimmed/{sample}-{unit}.1.fastq.gz",
"results/trimmed/{sample}-{unit}.2.fastq.gz",
"results/trimmed/{sample}-{unit}.pe.qc.txt"
],
sample = sample,
unit = unit
)
)
# mapping, merging and filtering bam-files
for sample in samples.index:
wanted_input.extend(
expand (
[
"results/orphan_rm_sorted/{sample}.bam"
],
sample = sample
)
)
return wanted_input
include: "rules/post-analysis.smk"
rule all:
input: all_input
channels:
- conda-forge
dependencies:
- curl ==7.71
channels:
- conda-forge
- defaults
dependencies:
- gawk ==5.1.0
channels:
- bioconda
- defaults
dependencies:
- perl-getopt-long ==2.50
channels:
- conda-forge
dependencies:
- r-base ==4.0
channels:
- conda-forge
- bioconda
dependencies:
- phantompeakqualtools ==1.2.2
- samtools ==1.10
- r-snow ==0.4_3
- r-snowfall ==1.84_6.1
- r-bitops ==1.0_6
- r-catools ==1.18.0
- bioconductor-rsamtools ==2.4.0
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- pysam ==0.15.4
- python ==3.7
- pysam ==0.16
# parent_id: spp
# parent_name: 'ChIP-seq processing pipeline'
# id: nsc_coefficient
# section_name: 'Normalized strand cross-correlation coefficient (NSC)'
# description: "The normalized strand cross-correlation coefficient is a quality control metric based on strand-shift cross-correlation plot.
# The barplot is generated using run_spp.R script from
# <a href='https://github.com/kundajelab/phantompeakqualtools' target='_blank'>phantompeakqualtools</a> and shows for each sample on y-axis the
# normalized strand cross-correlation coefficient on x-axis.
# The coefficient is defined by the ratio of maximal cross-correlation value divided by the background cross-correlation.
# The higher NSC values are the better the enrichment.
# For more information about calculation and interpretation of this value please see
# <a href='https://hbctraining.github.io/In-depth-NGS-Data-Analysis-Course/sessionV/lessons/CC_metrics_extra.html' target='_blank'>Quality Metrics Based on Cross-Correlation</a>."
# format: 'tsv'
# plot_type: 'bargraph'
# pconfig:
# title: 'Normalized strand cross-correlation coefficient'
# ylab: 'NSC coefficient'
# ymin: 1
# tt_decimals: 1
# parent_id: spp
# parent_name: 'ChIP-seq processing pipeline'
# id: rsc_coefficient
# section_name: 'Relative strand cross-correlation coefficient (RSC)'
# description: "The relative strand cross-correlation coefficient is a quality control metric based on strand-shift cross-correlation plot.
# The barplot is generated using run_spp.R script from
# <a href='https://github.com/kundajelab/phantompeakqualtools' target='_blank'>phantompeakqualtools</a> and shows for each sample on y-axis the
# relative strand cross-correlation coefficient on x-axis.
# The coefficient is defined by the ratio of maximal cross-correlation value divided by the phantom peak cross-correlation value after removal
# of the background cross-correlation for both values.
# RSC values >1 indicates a high enrichment.
# For more information about calculation and interpretation of this value please see
# <a href='https://hbctraining.github.io/In-depth-NGS-Data-Analysis-Course/sessionV/lessons/CC_metrics_extra.html' target='_blank'>Quality Metrics Based on Cross-Correlation</a>."
# format: 'tsv'
# plot_type: 'bargraph'
# pconfig:
# title: 'Relative strand cross-correlation coefficient'
# ylab: 'RSC coefficient'
# ymin: 0
# tt_decimals: 1
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment