|
|
## This is the story of how harmless curiosity for Oxford Nanopore Technology lead to an awesome learning experience!
|
|
|
- Documented here for those who wish to follow in our footsteps..
|
|
|
|
|
|
# It started with a conversation titled, "what is the current state of the ONT data from 2018 and 2019?"
|
|
|
- And here, the journey begins with me (SBB) trying to understand how different assembly methods involving short-read and/or long-read sequences influence the number and kind of proteins recovered.
|
|
|
- And if one is an improvement over the other
|
|
|
|
|
|
## Chapter I - Running clustering analyses using [Prodigal](https://github.com/hyattpd/Prodigal) output
|
|
|
* CCL built an amazing and comprehensive [Snakefile](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/master/original_CCL_Snakefileurl)
|
|
|
* Everything from assemblies, to gene calls, to genome coverages were estimated using short-reads and long-reads from a Generous donor B (GDB) sample collected in 2018 and in 2019
|
|
|
* The ensuing story is being told from my perspective including the minor role I will have played in the downstream analyes
|
|
|
|
|
|
# The story starts with first working in my own folders
|
|
|
```
|
|
|
cd /scratch/users/sbusi/ONT
|
|
|
```
|
|
|
# symlining cedric's data to mine and then modifying files such as the snakefile
|
|
|
```
|
|
|
ln -sd /scratch/users/claczny/ont/fecal_pilot_testing/results/annotation/proteins/ cedric_ont_proteins
|
|
|
|
|
|
# included a rule to run cd-hit
|
|
|
rule cdhit:
|
|
|
input:
|
|
|
db1=expand("{datadir}/flye/lr/merged/barcode07/assembly.faa", datadir=DATA_DIR),
|
|
|
db2=expand("{datadir}/megahit/NEB2_MG_S17/final.contigs.faa", datadir=DATA_DIR),
|
|
|
db3=expand("{datadir}/metaspades_hybrid/lr_barcode07-sr_NEB2_MG_S17/contigs.faa", datadir=DATA_DIR)
|
|
|
output: g12=expand("{resultsdir}/flye_megahit_novel.fasta", resultsdir=RESULTS_DIR),g23=expand("{resultsdir}/megahit_metaspades_novel.fasta", resultsdir=RESULTS_DIR),g13=expand("{resultsdir}/flye_metaspades_novel.fasta", resultsdir=RESULTS_DIR),g21=expand("{resultsdir}/megahit_flye_novel.fasta", resultsdir=RESULTS_DIR),g32=expand("{resultsdir}/metaspades_megahit_novel.fasta", resultsdir=RESULTS_DIR),g31=expand("{resultsdir}/metaspades_flye_novel.fasta", resultsdir=RESULTS_DIR),gout=expand("{resultsdir}/cd-hit.done", resultsdir=RESULTS_DIR)
|
|
|
conda: "/scratch/users/sbusi/cd-hit.yml"
|
|
|
shell: """
|
|
|
date
|
|
|
cd-hit-2d -i {input.db1} -i2 {input.db2} -o {output.g12} -c 0.9 -n 5 -d 0 -M 16000 -T 4
|
|
|
cd-hit-2d -i {input.db2} -i2 {input.db3} -o {output.g23} -c 0.9 -n 5 -d 0 -M 16000 -T 4
|
|
|
cd-hit-2d -i {input.db1} -i2 {input.db3} -o {output.g13} -c 0.9 -n 5 -d 0 -M 16000 -T 4
|
|
|
cd-hit-2d -i {input.db2} -i2 {input.db1} -o {output.g21} -c 0.9 -n 5 -d 0 -M 16000 -T 4
|
|
|
cd-hit-2d -i {input.db3} -i2 {input.db2} -o {output.g32} -c 0.9 -n 5 -d 0 -M 16000 -T 4
|
|
|
cd-hit-2d -i {input.db3} -i2 {input.db1} -o {output.g31} -c 0.9 -n 5 -d 0 -M 16000 -T 4
|
|
|
touch {output.gout}
|
|
|
date
|
|
|
"""
|
|
|
|
|
|
# ran the file as follows:
|
|
|
./run_snakemake.sh
|
|
|
|
|
|
# subsequently used auxilliary scripts to merge the cluster (output) files
|
|
|
cd /scratch/users/sbusi/ONT/cd_hit_output
|
|
|
|
|
|
# making plots for all .clstr files (http://weizhongli-lab.org/lab-wiki/doku.php?id=cd-hit-user-guide#cd-hit-2d)
|
|
|
for file in *.clstr
|
|
|
do
|
|
|
echo "$file"
|
|
|
plot_len1.pl "$file" \
|
|
|
1,2-4,5-9,10-19,20-49,50-99,100-299,500-99999 \
|
|
|
10-59,60-149,150-499,500-1999,2000-999999
|
|
|
done >> cluster_plots
|
|
|
|
|
|
# merged multiple clstr files
|
|
|
clstr_merge.pl flye_megahit_novel.fasta.clstr flye_metaspades_novel.fasta.clstr megahit_metaspades_novel.fasta.clstr > merged.clstr
|
|
|
# plotting
|
|
|
plot_len1.pl test.clstr \
|
|
|
1,2-4,5-9,10-19,20-49,50-99,100-299,500-99999 \
|
|
|
10-59,60-149,150-499,500-1999,2000-999999
|
|
|
```
|
|
|
- however the need to re-run the same samples by swapping was not only cumbersome, but depending on the orientation of the samples, obtained non-overlapping sets of fasta list
|
|
|
|
|
|
## So, I decided to pursue [mmseqs2](https://github.com/soedinglab/MMseqs2) instead
|
|
|
* The output of this is now incorporated into the [updated_SNAKEFILE](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/master/2019_GDB/updated_SNAKEFILE) that will be used for the analyses going forward.
|
|
|
* see also [MMSEQ_RULES](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/master/2019_GDB/rules/MMSEQ_RULES) |
|
|
\ No newline at end of file |