Commit f5956222 authored by Susana MARTINEZ's avatar Susana MARTINEZ
Browse files

upload all

parents
The **CrisprPrediction** module aims the identification and redundancy removal of CRISPR elements, i.e. spacers and repeats, and CRISPR-flanking sequences from preprocessed reads (input of CRASS) and assembled contigs (input of metaCRT), redundancy removal of CRISPR elements, and identification of protospacer-containing contigs.
* **Inputs**: from IMP results; MG and MT preprocessed reads, MGMT co-assembled contigs, MT contigs
* **Steps and outputs**:
- results from [CRASS](http://ctskennerton.github.io/crass/); prediction of CRISPR elements from the IMP preprocessed reads --> CRISPR spacers, CRISPR repeats and CRISPR flanks.
- results from [metaCRT](http://omics.informatics.indiana.edu/CRISPR/); prediction of CRISPR arrays and CRISPR elements from MGMT co-assembled and MT contigs --> CRISPR spacers and CRISPR repeats.
- customized extraction of CRISPR flanks from metaCRT CRISPR arrays.
- removal of redundancy within the CRISPR elements, using [CD-HIT](http://weizhongli-lab.org/cd-hit/), three independent steps;
i) collection (from CRASS and metaCRT results) and clustering of CRISPR spacers
ii) collection and clustering of CRISPR repeats
iii) collection and clustering of CRISPR flanks
- [blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi) of CRISPR elements against the MGMT co-assembled contigs
- identification of protospacer-containing contigs(PSpCC)
* **Snakemake workflow "CrisprPrediction"**
```
# Source dependencies and define number of threads
source /home/users/smartinezarbas/git/gitlab/CRISPR_MGE_pipeline/src/preload_modules.sh
export THREADS='8'
# Run snakemake (variables defining paths to input files and output directories are in workflows/CrisprPrediction)
snakemake -j $THREADS -pfk crispr_workflow.done -s workflows/CrisprPrediction
```
The **MgePrediction** module aims the identification/prediction of bacteriophage (VirSorter and VirFinder) and plasmid(cBar and PlasFLow) sequences from assembled contigs.
* **Inputs**: from IMP results; MGMT co-assembled contigs, MT contigs
* **Steps and outputs**:
- prediction of **phages** by [VirSorter](https://github.com/simroux/VirSorter) and [VirFinder](https://github.com/jessieren/VirFinder)
- prediction of **plasmids** by [cBar](http://csbl.bmb.uga.edu/~ffzhou/cBar/) and [PlasFlow](https://github.com/smaegol/PlasFlow)
* **Snakemake workflow "MgePrediction"**
```
#!/bin/bash -l
#OAR -n crisprPrediction
#OAR -l nodes=1/core=8,walltime=120
source /home/users/smartinezarbas/git/gitlab/CRISPR_MGE_pipeline/src/preload_modules.sh
export THREADS='8'
export TS_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results'
export DB_FA_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results/Assemblies'
export TS_SAMPLES='TS2 TD2'
export MGE_OUTDIR='/scratch/users/smartinezarbas/AmazonRiverCRISPR_MGE/MGE_prediction'
snakemake -j 8 -pf plasmid_phage_prediction_workflow.done -s workflows/MgePrediction
```
# CRISPR-MGE pipeline
The CRISPR-MGE pipeline identifies the CRISPRs from reads and contigs, and invasive mobile genetic elements (iMGEs), in particular bacteriophages and plasmids, from contigs. Then, CRISPR elements are assigned to representative genomes (ReGes) previously defined. Finally, iMGE-host networks are provided.
## The pipeline contains the following workflows:
* [CRISPR prediction](CRISPR-prediction.md)
* [iMGEs prediction]**MgePrediction**: identification of bacteriophage (VirSorter and VirFinder) and plasmid(cBar and PlasFLow) sequences from assembled contigs
* **MgeDereplication**: collection of predicted MGEs and redundancy removal.
* **MgeRemapping**: remapping of all the reads aings the MGE contigs.
* **MgeHostLink**: identification of candidate hosts, their spacers composition and the link with the protospacer-containing contigs.
## Dependencies
- [CRASS](http://ctskennerton.github.io/crass/)
- [metaCRT](http://omics.informatics.indiana.edu/CRISPR/)
- [CD-HIT](http://weizhongli-lab.org/cd-hit/)
- [blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi)
- [VirSorter](https://github.com/simroux/VirSorter)
- [VirFinder](https://github.com/jessieren/VirFinder)
- [PlasFlow](https://github.com/smaegol/PlasFlow)
- [cBar](http://csbl.bmb.uga.edu/~ffzhou/cBar/)
- bwa
- featureCounts
- R version 3.4.0: packages `tidyverse`, `ggplot2`, `reshape2`, (...)
## 2- **MgePrediction workflow**
* **Inputs**: from IMP results; MGMT co-assembled contigs, MT contigs
* **Steps and outputs**:
- prediction of **phages** by [VirSorter](https://github.com/simroux/VirSorter) and [VirFinder](https://github.com/jessieren/VirFinder)
- prediction of **plasmids** by [cBar](http://csbl.bmb.uga.edu/~ffzhou/cBar/) and [PlasFlow](https://github.com/smaegol/PlasFlow)
* **Launcher** (adjust memory requests, input paths, etc.)
```
#!/bin/bash -l
#OAR -n crisprPrediction
#OAR -l nodes=1/core=8,walltime=120
source /home/users/smartinezarbas/git/gitlab/CRISPR_MGE_pipeline/src/preload_modules.sh
export THREADS='8'
export TS_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results'
export DB_FA_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results/Assemblies'
export TS_SAMPLES='TS2 TD2'
export MGE_OUTDIR='/scratch/users/smartinezarbas/AmazonRiverCRISPR_MGE/MGE_prediction'
snakemake -j 8 -pf plasmid_phage_prediction_workflow.done -s workflows/MgePrediction
```
## 3- **MgeDereplication workflow**
* **Inputs**: all phage and plasmid predictions
* **Steps and outputs**:
- collection of the lists of phages and plasmid predictions, and the PSpCCs
- fetch the sequences
- run CD-HIT (parameteres within rule collapse\_candidate\_mobile\_elements.rule)
- generate table of MGE info (contig name, type, etc.)
* **Launcher** (adjust memory request, input paths, etc.)
```
#!/bin/bash -l
#OAR -n mgeDereplication
#OAR -l nodes=1/core=8,walltime=120
source /home/users/smartinezarbas/git/gitlab/CRISPR_MGE_pipeline/src/preload_modules.sh
export THREADS='8'
export TS_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results'
export DB_FA_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results/Assemblies'
export TS_SAMPLES='TS2 TD2'
export MGE_OUTDIR='/scratch/users/smartinezarbas/AmazonRiverCRISPR_MGE/MGE_prediction'
snakemake -j 8 -pf mge_dereplication_workflow.done -s workflows/MgePrediction
```
## 4- **MgeRemapping**
* **Inputs**: preprocessed MG and MT reads, list of unique MGEs
* **Steps**:
- annotation
- index
- mapping reads to the mge contigs
- featureCounts: gene and contig levels
- calculate abundance: average depth of coverage per contig of MG and MT
* **Launcher**
```
#!/bin/bash -l
#OAR -n mgeRemapping_test
#OAR -l nodes=1/core=12,walltime=96
source /home/users/smartinezarbas/git/gitlab/CRISPR_MGE_pipeline/src/preload_modules.sh
export THREADS='3'
export TS_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results'
export TS_SAMPLES='TS2 TD2'
export IGE_DIR='/scratch/users/smartinezarbas/AmazonRiverCRISPR_MGE/MGE_dereplication'
export OUTDIR='/scratch/users/smartinezarbas/AmazonRiverCRISPR_MGE/MGE_remapping'
snakemake -j 4 -pfk mge_remapping_workflow.done -s workflows/MgeRemapping
```
## 5- **MgeHostLink**
* **Inputs**: pieces of all the previous workflows + output of binning dereplication (to link a bin with a CRISPR, it is needed to at least have contigs assigned to bins)
* **Steps**
- blast of bins against CRISPR flanks (from workflow **CrisprPrediction**)
- blast of bins against CRISPR repeats (from workflow **CrisprPrediction**)
- identification of hosts: filter by matches with flank and repeat sequences, and filtering by coverage and identity
- link CRISPR spacers to protospacers (formatting and adding info to protospacers identified in workflow **CrisprPrediction**)
- link spacers with candidate hosts
- link hosts with targeted protospacers
* **Launcher**
```
#!/bin/bash -l
#OAR -n mgeHostLink_test
#OAR -l nodes=1/core=1,walltime=4
source /home/users/smartinezarbas/git/gitlab/CRISPR_MGE_pipeline/src/preload_modules.sh
export THREADS='3'
export TS_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results'
export TS_SAMPLES='TS2 TD2'
export DB_FA_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results/Assemblies'
export BIN_DICT='metadata/bin_conversion.tsv'
export OUTDIR='/scratch/users/smartinezarbas/AmazonRiverCRISPR_MGE/MGE_CrispHost_link'
snakemake -npf mge_host_link_workflow.done -s workflows/MgeHostLink
```
## 6- **Not added**: additional downstream analysis that require manual inspection
- abundance of bins/ReGes per sample
- PCoA with/without envfit of physico-chemical parameters
- gene expression normalization
- CRISPR-Cas genes visualization
- Other plots: ratios (plasmid/phage, gain/loss of spacers over time), heatmaps of presence/abscence of spacers, etc.
\ No newline at end of file
{
"threads": 12,
"memory_total_gb": 384,
"memory_per_core_gb": 32,
"tmp_dir": "/tmp",
"sample": "test",
"outputdir": "/output",
"db_path": "/mnt/nfs/projects/ecosystem_biology/local_tools/IMP/dependencies/prokka/db",
"prokka": {
"databases": [
"cm/Bacteria.i1i",
"genus/Staphylococcus.phr",
"hmm/CLUSTERS.hmm.h3f",
"kingdom/Archaea/sprot.phr"
]
},
"kegg": {
"db_ec2pthy": "http://rest.kegg.jp/link/ec/pathway",
"db_hierarchy": "http://rest.kegg.jp/list/pathway"
},
"binning": {
"pk": 10,
"nn": 4
},
"dRep": {
"completeness": 0.6,
"strain_htr": 101,
"P_ani": 0.6,
"S_ani": 0.965
}
}
#!/bin/bash -l
#OAR -n crisprPrediction_test
#OAR -l nodes=1/core=8,walltime=48
source /home/users/smartinezarbas/git/gitlab/CRISPR_MGE_pipeline/src/preload_modules.sh
export THREADS='8'
export TS_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results'
export DB_FA_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results/Assemblies'
export TS_SAMPLES='TS2 TD2'
export CRISPR_OUTDIR='/scratch/users/smartinezarbas/AmazonRiverCRISPR_MGE/CRISPR_prediction'
snakemake -j 8 -pf crispr_workflow.done -s workflows/CrisprPrediction
#!/bin/bash -l
#OAR -n mgeDereplication_test
#OAR -l nodes=1/core=6,walltime=6
source /home/users/smartinezarbas/git/gitlab/CRISPR_MGE_pipeline/src/preload_modules.sh
export THREADS='6'
export TS_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results'
export DB_FA_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results/Assemblies'
export TS_SAMPLES='TS2 TD2'
export VIRSORTER_DIR='/scratch/users/smartinezarbas/AmazonRiverCRISPR_MGE/MGE_prediction/VirSorter'
export VIRFINDER_DIR='/scratch/users/smartinezarbas/AmazonRiverCRISPR_MGE/MGE_prediction/VirFinder'
export PLASF_DIR='/scratch/users/smartinezarbas/AmazonRiverCRISPR_MGE/MGE_prediction/PlasFlow'
export CBAR_DIR='/scratch/users/smartinezarbas/AmazonRiverCRISPR_MGE/MGE_prediction/cBar'
export WORK_DIR='/scratch/users/smartinezarbas/AmazonRiverCRISPR_MGE/MGE_dereplication'
snakemake --allow-ambiguity -pf mge_dereplication_workflow.done -s workflows/MgeDereplication
#!/bin/bash -l
#OAR -n mgeHostLink_test
#OAR -l nodes=1/core=1,walltime=4
source /home/users/smartinezarbas/git/gitlab/CRISPR_MGE_pipeline/src/preload_modules.sh
export THREADS='1'
export TS_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results'
export TS_SAMPLES='TS2 TD2'
export DB_FA_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results/Assemblies'
# BIN_DICT should be generated after the dereplication of the bins
export BIN_DICT='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results/dRepALL/RepresentativeBins/ReGes_dictionary.tsv'
# For this workflow, HOST_DB should contain fasta of dereplicated genomes or bins
export HOST_DB='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results/dRepALL/RepresentativeBins/dereplicated_genomes/ALL_representative_genomes.fa'
export CRISPR_ELEMENTS_DIR='/scratch/users/smartinezarbas/AmazonRiverCRISPR_MGE/ALL'
export DREP_DONE='no'
export OUTDIR='/scratch/users/smartinezarbas/AmazonRiverCRISPR_MGE/MGE_CrisprHost_link'
snakemake -pf mge_host_link_workflow.done -s workflows/MgeHostLink
#!/bin/bash -l
#OAR -n mgePrediction_test
#OAR -l nodes=1/core=6,walltime=48
source /home/users/smartinezarbas/git/gitlab/CRISPR_MGE_pipeline/src/preload_modules.sh
export THREADS='6'
export TS_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results'
export DB_FA_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results/Assemblies'
export TS_SAMPLES='TS2 TD2'
export CRISPR_OUTDIR='/scratch/users/smartinezarbas/AmazonRiverCRISPR_MGE'
snakemake -j 6 -pf plasmid_phage_prediction_workflow.done -s workflows/MgePrediction
#snakemake -pf VirFinder/TS2/TS2_mt_unmapped.done -s workflows/MgePrediction
#snakemake -pf PlasFlow/TS2/TS2_mgmt.done -s workflows/MgePrediction
#snakemake -pf cBar/TS2/TS2_mgmt.done -s workflows/MgePrediction
#!/bin/bash -l
#OAR -n mgeRemapping_test
#OAR -l nodes=1/core=12,walltime=96
source /home/users/smartinezarbas/git/gitlab/CRISPR_MGE_pipeline/src/preload_modules.sh
export THREADS='3'
export TS_DIR='/work/users/smartinezarbas/comparative_analysis/AmazonRiver/IMP_results'
export TS_SAMPLES='TS2 TD2'
export IGE_DIR='/scratch/users/smartinezarbas/AmazonRiverCRISPR_MGE/MGE_dereplication'
export OUTDIR='/scratch/users/smartinezarbas/AmazonRiverCRISPR_MGE/MGE_remapping'
snakemake -j 4 -pfk mge_remapping_workflow.done -s workflows/MgeRemapping
#snakemake -j 4 -pfk Annotations/ALL_MGEs_annotation.gff -s workflows/MgeRemapping
#snakemake -j 4 -pfk Index/ALL_MGEs-merged.fa -s workflows/MgeRemapping
#snakemake -j 4 -pfk Annotations/ALL_MGEs_annotation.gff -s workflows/MgeRemapping
#snakemake -j 4 -pfk Calculations/ContigLevel/ALL_mg_contig_depth.txt -s workflows/MgeRemapping
rule filter_protospacers:
input:
"ALL/spacers/ALL_spacers-merged_x_ALL_MGMT.assembly.merged.tsv",
"ALL/repeats/ALL_repeats-merged_x_ALL_MGMT.assembly.merged.tsv"
output:
"ALL/protospacers/ALL_pspcc.tsv",
"ALL/protospacers/ALL_pspcc_qcov95_pident95.tsv",
"ALL/protospacers/ALL_pspcc_qcov95_pident95.gff"
shell:
"""
source {SRCDIR}/R_env.sh
Rscript {SRCDIR}/filterProtospacers.R {input[0]} {input[1]} {output[0]} {output[1]}
Rscript {SRCDIR}/pspcc2gff.R {output[1]} {output[2]}
source {SRCDIR}/unload_R_env.sh
"""
rule get_metaCRT_flanks:
input:
"ALL/metaCRT/metacrt_contigs-mgmt.out",
"%s/ALL-mgmt.assembly.merged.fa" % DB_FA_DIR
output:
"ALL/metaCRT/metacrt_contigs-mgmt_flanks.fa",
"db/ALL-mgmt.assembly.merged.fa"
shell:
"""
mkdir -p db
rsync -av {input[1]} {output[1]}
source {SRCDIR}/R_env.sh
Rscript {SRCDIR}/parser_metaCRTout.R {input[0]} {TMPDIR}/flanks.bed {TMPDIR}/flanks.tsv
bedtools getfasta -fi {output[1]} -bed {TMPDIR}/flanks.bed -fo {TMPDIR}/flanks.fa
Rscript {SRCDIR}/merge_bedfasta.R {TMPDIR}/flanks.bed {TMPDIR}/flanks.fa {TMPDIR}/metacrt_contigs-mgmt_flanks.tsv
awk '{{print \">\"$1\"\\n\"$2}}' {TMPDIR}/metacrt_contigs-mgmt_flanks.tsv > {output[0]}
touch {output}
source {SRCDIR}/unload_R_env.sh
"""
rule blast_flanks:
input:
fa = "ALL/flanks/ALL_flanks-merged.fa",
so = "db/ALL-mgmt.assembly.merged.fa",
sp = "ALL/spacers/blast_spacers.done",
done = "db/formatdb.done"
output:
blastout = "ALL/flanks/ALL_flanks-merged_x_ALL_MGMT.assembly.merged.tsv",
done = "ALL/flanks/blast_flanks.done"
shell:
"""
FILESIZE=$(stat -c%s "{input.fa}")
BSIZE=$(echo $(($FILESIZE/{THREADS}/2)))
cat {input.fa} | parallel --gnu --plain -j {THREADS} --block ${{BSIZE}} --recstart '>' --pipe blastn -query - -db {input.so} -task 'blastn' -outfmt "6\ qseqid\ sseqid\ pident\ length\ mismatch\ gapopen\ qstart\ qend\ qlen\ qcovs\ sstart\ send\ slen\ evalue\ bitscore" >> {output.blastout}
sed -i '1i qseqid\tsseqid\tpident\tlength\tmismatch\tgapopen\tqstart\tqend\tqlen\tqcovs\tsstart\tsend\tslen\tevalue\tbitscore' {output.blastout}
touch {output.done}
"""
rule blast_repeats:
input:
fa = "ALL/repeats/ALL_repeats-merged.fa",
so = "db/ALL-mgmt.assembly.merged.fa",
log = "db/formatdb.done"
output:
blastout = "ALL/repeats/ALL_repeats-merged_x_ALL_MGMT.assembly.merged.tsv",
done = "ALL/repeats/blast_repeats.done"
shell:
"""
FILESIZE=$(stat -c%s "{input.fa}")
BSIZE=$(echo $(($FILESIZE/{THREADS}/2)))
cat {input.fa} | parallel --gnu --plain -j {THREADS} --block ${{BSIZE}} --recstart '>' --pipe blastn -query - -db {input.so} -task 'blastn-short' -outfmt "6\ qseqid\ sseqid\ pident\ length\ mismatch\ gapopen\ qstart\ qend\ qlen\ qcovs\ sstart\ send\ slen\ evalue\ bitscore" >> {output.blastout}
sed -i '1i qseqid\tsseqid\tpident\tlength\tmismatch\tgapopen\tqstart\tqend\tqlen\tqcovs\tsstart\tsend\tslen\tevalue\tbitscore' {output.blastout}
touch {output.done}
"""
rule blast_spacers:
input:
fa = "ALL/spacers/ALL_spacers-merged.fa",
so = "db/ALL-mgmt.assembly.merged.fa",
dr = "ALL/repeats/blast_repeats.done",
done = "db/formatdb.done"
output:
blastout = "ALL/spacers/ALL_spacers-merged_x_ALL_MGMT.assembly.merged.tsv",
done = "ALL/spacers/blast_spacers.done"
shell:
"""
FILESIZE=$(stat -c%s "{input.fa}")
BSIZE=$(echo $(($FILESIZE/{THREADS}/2)))
cat {input.fa} | parallel --gnu --plain -j {THREADS} --block ${{BSIZE}} --recstart '>' --pipe blastn -query - -db {input.so} -task 'blastn-short' -gapopen 10 -penalty -1 -dust "no" -evalue 1 -outfmt \"6\ qseqid\ sseqid\ pident\ length\ mismatch\ gapopen\ qstart\ qend\ qlen\ qcovs\ sstart\ send\ slen\ evalue\ bitscore\" >> {output.blastout}
sed -i '1i qseqid\tsseqid\tpident\tlength\tmismatch\tgapopen\tqstart\tqend\tqlen\tqcovs\tsstart\tsend\tslen\tevalue\tbitscore' {output.blastout}
touch {output.done}
"""
rule merge_flanks:
input:
expand("ALL/{ts_sample}/crass_mg-reads/crass_mg-reads_{ts_sample}_flanks.fa", ts_sample = TS_SAMPLES),
expand("ALL/{ts_sample}/crass_mt-reads/crass_mt-reads_{ts_sample}_flanks.fa", ts_sample = TS_SAMPLES),
"ALL/metaCRT/metacrt_contigs-mgmt_flanks.fa"
output:
"ALL/flanks/ALL_flanks.fa",
"ALL/flanks/ALL_flanks-merged.fa",
"ALL/flanks/ALL_flanks-merged.fa.clstr",
"ALL/flanks/ALL_flanks-merged.fa.tbl",
"ALL/flanks/ALL_flanks-merged.fa.tbl2seq"
shell:
"""
cat {input} > {output[0]}
sed -i "s/^>_/>/g" {output[0]}
cd-hit-est -T {THREADS} -M $(({MEMTOTAL}*1000)) -i {output[0]} -o {output[1]} -c 0.99 -d 0 -aL 0.975 -aS 0.975 -s 0.975
# cd-hit-lap -i {output[0]} -o {output[1]}
# This provides a tables for the mappings for the clustered sequences
clstr_sql_tbl.pl {output[2]} {output[3]}
join -1 1 -2 3 -t$'\t' <(paste <(grep "^>" {output[2]} | sed -e "s/>Cluster //g") <(grep "\*" {output[2]} | cut -f2 -d " " | sed -e "s/>//g" | sed -e "s/\.\.\.//g")) <(cat {output[3]}) > {output[4]}
"""
rule merge_repeats:
input:
expand("ALL/{ts_sample}/crass_mt-reads/crass_mt-reads_{ts_sample}_repeats.fa", ts_sample = TS_SAMPLES),
expand("ALL/{ts_sample}/crass_mg-reads/crass_mg-reads_{ts_sample}_repeats.fa", ts_sample = TS_SAMPLES),
expand("ALL/{ts_sample}/metacrt_contigs-mt/metacrt_contigs-mt_repeats.fa", ts_sample = TS_SAMPLES),
"ALL/metaCRT/metacrt_contigs-mgmt_repeats.fa"
output:
"ALL/repeats/ALL_repeats.fa",
"ALL/repeats/ALL_repeats-merged.fa",
"ALL/repeats/ALL_repeats-merged.fa.clstr",
"ALL/repeats/ALL_repeats-merged.fa.tbl",
"ALL/repeats/ALL_repeats-merged.fa.tbl2seq"
shell:
"""
cat {input} > {output[0]}
sed -i "s/^>_/>/g" {output[0]}
cd-hit-est -i {output[0]} -o {output[1]} -c 0.8 -d 0 -s 0.75
# This provides a tables for the mappings for the clustered sequences
clstr_sql_tbl.pl {output[2]} {output[3]}
join -1 1 -2 3 -t$'\t' <(paste <(grep "^>" {output[2]} | sed -e "s/>Cluster //g") <(grep "\*" {output[2]} | cut -f2 -d " " | sed -e "s/>//g" | sed -e "s/\.\.\.//g")) <(cat {output[3]}) > {output[4]}
"""
rule merge_spacers:
input:
expand("ALL/{ts_sample}/crass_mt-reads/crass_mt-reads_{ts_sample}_spacers.fa", ts_sample = TS_SAMPLES),
expand("ALL/{ts_sample}/crass_mg-reads/crass_mg-reads_{ts_sample}_spacers.fa", ts_sample = TS_SAMPLES),
expand("ALL/{ts_sample}/metacrt_contigs-mt/metacrt_contigs-mt_spacers.fa", ts_sample = TS_SAMPLES),
"ALL/metaCRT/metacrt_contigs-mgmt_spacers.fa"
output:
"ALL/spacers/ALL_spacers.fa",
"ALL/spacers/ALL_spacers-merged.fa",
"ALL/spacers/ALL_spacers-merged.fa.clstr",
"ALL/spacers/ALL_spacers-merged.fa.tbl",
"ALL/spacers/ALL_spacers-merged.fa.tbl2seq",
"ALL/spacers/ReSpa2MeSpa.tsv"
shell:
"""
cat {input} > {output[0]}
sed -i "s/^>_/>/g" {output[0]}
cd-hit-est -i {output[0]} -o {output[1]} -c 0.9 -d 0 -s 1 -aL 1 -aS 1
# This provides a tables for the mappings for the clustered sequences
clstr_sql_tbl.pl {output[2]} {output[3]}
join -1 1 -2 3 -t$'\t' <(paste <(grep "^>" {output[2]} | sed -e "s/>Cluster //g") <(grep "\*" {output[2]} | cut -f2 -d " " | sed -e "s/>//g" | sed -e "s/\.\.\.//g")) <(cat {output[3]}) > {output[4]}
cat {output[4]} | cut -f 2,3 > {output[5]}
"""
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment