Susheel Busi · a47bcb4d
--- a/ONT_pilot_w_GDB_samples.md
+++ b/ONT_pilot_w_GDB_samples.md
+## This is the story of how harmless curiosity for Oxford Nanopore Technology lead to an awesome learning experience!
+- Documented here for those who wish to follow in our footsteps.. 
+
+# It started with a conversation titled, "what is the current state of the ONT data from 2018 and 2019?"
+- And here, the journey begins with me (SBB) trying to understand how different assembly methods involving short-read and/or long-read sequences influence the number and kind of proteins recovered.
+- And if one is an improvement over the other
+
+## Chapter I - Running clustering analyses using [Prodigal](https://github.com/hyattpd/Prodigal) output
+*  CCL built an amazing and comprehensive [Snakefile](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/master/original_CCL_Snakefileurl)
+*  Everything from assemblies, to gene calls, to genome coverages were estimated using short-reads and long-reads from a Generous donor B (GDB) sample collected in 2018 and in 2019
+*  The ensuing story is being told from my perspective including the minor role I will have played in the downstream analyes
+
+# The story starts with first working in my own folders
+```
+cd /scratch/users/sbusi/ONT
+```
+# symlining cedric's data to mine and then modifying files such as the snakefile
+```
+ln -sd /scratch/users/claczny/ont/fecal_pilot_testing/results/annotation/proteins/ cedric_ont_proteins
+
+# included a rule to run cd-hit
+rule cdhit:
+   input:
+     db1=expand("{datadir}/flye/lr/merged/barcode07/assembly.faa", datadir=DATA_DIR),
+     db2=expand("{datadir}/megahit/NEB2_MG_S17/final.contigs.faa", datadir=DATA_DIR),
+     db3=expand("{datadir}/metaspades_hybrid/lr_barcode07-sr_NEB2_MG_S17/contigs.faa", datadir=DATA_DIR)
+   output: g12=expand("{resultsdir}/flye_megahit_novel.fasta", resultsdir=RESULTS_DIR),g23=expand("{resultsdir}/megahit_metaspades_novel.fasta", resultsdir=RESULTS_DIR),g13=expand("{resultsdir}/flye_metaspades_novel.fasta", resultsdir=RESULTS_DIR),g21=expand("{resultsdir}/megahit_flye_novel.fasta", resultsdir=RESULTS_DIR),g32=expand("{resultsdir}/metaspades_megahit_novel.fasta", resultsdir=RESULTS_DIR),g31=expand("{resultsdir}/metaspades_flye_novel.fasta", resultsdir=RESULTS_DIR),gout=expand("{resultsdir}/cd-hit.done", resultsdir=RESULTS_DIR)
+   conda:  "/scratch/users/sbusi/cd-hit.yml"
+   shell:  """
+           date
+           cd-hit-2d -i {input.db1} -i2 {input.db2} -o {output.g12} -c 0.9 -n 5 -d 0 -M 16000 -T 4
+           cd-hit-2d -i {input.db2} -i2 {input.db3} -o {output.g23} -c 0.9 -n 5 -d 0 -M 16000 -T 4
+           cd-hit-2d -i {input.db1} -i2 {input.db3} -o {output.g13} -c 0.9 -n 5 -d 0 -M 16000 -T 4
+           cd-hit-2d -i {input.db2} -i2 {input.db1} -o {output.g21} -c 0.9 -n 5 -d 0 -M 16000 -T 4
+           cd-hit-2d -i {input.db3} -i2 {input.db2} -o {output.g32} -c 0.9 -n 5 -d 0 -M 16000 -T 4
+           cd-hit-2d -i {input.db3} -i2 {input.db1} -o {output.g31} -c 0.9 -n 5 -d 0 -M 16000 -T 4
+           touch {output.gout}
+           date
+           """
+
+# ran the file as follows:
+./run_snakemake.sh 
+
+# subsequently used auxilliary scripts to merge the cluster (output) files
+cd /scratch/users/sbusi/ONT/cd_hit_output
+
+# making plots for all .clstr files (http://weizhongli-lab.org/lab-wiki/doku.php?id=cd-hit-user-guide#cd-hit-2d)
+for file in *.clstr 
+do      
+    echo "$file"
+    plot_len1.pl "$file"  \
+           1,2-4,5-9,10-19,20-49,50-99,100-299,500-99999   \
+                 10-59,60-149,150-499,500-1999,2000-999999
+done >> cluster_plots
+
+# merged multiple clstr files
+clstr_merge.pl flye_megahit_novel.fasta.clstr flye_metaspades_novel.fasta.clstr megahit_metaspades_novel.fasta.clstr > merged.clstr
+# plotting
+plot_len1.pl test.clstr  \
+           1,2-4,5-9,10-19,20-49,50-99,100-299,500-99999   \
+                 10-59,60-149,150-499,500-1999,2000-999999
+```
+- however the need to re-run the same samples by swapping was not only cumbersome, but depending on the orientation of the samples, obtained non-overlapping sets of fasta list
+
+## So, I decided to pursue [mmseqs2](https://github.com/soedinglab/MMseqs2) instead
+*  The output of this is now incorporated into the [updated_SNAKEFILE](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/master/2019_GDB/updated_SNAKEFILE) that will be used for the analyses going forward. 
+*  see also [MMSEQ_RULES](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/master/2019_GDB/rules/MMSEQ_RULES)
\ No newline at end of file