... | ... | @@ -64,7 +64,10 @@ plot_len1.pl test.clstr \ |
|
|
|
|
|
# So, I decided to pursue [mmseqs2](https://github.com/soedinglab/MMseqs2) instead
|
|
|
* The output of this is now incorporated into the [updated_SNAKEFILE](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/master/2019_GDB/updated_SNAKEFILE) that will be used for the analyses going forward.
|
|
|
* see also [MMSEQ_RULES](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/master/2019_GDB/rules/MMSEQ_RULES)
|
|
|
* see also the following:
|
|
|
1. [MMSEQ_RULES](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/master/2019_GDB/rules/MMSEQ_RULES)
|
|
|
2. [prepare_plot_files.sh](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/checkpoint_snakefile/scripts/prepare_plot_files.sh)
|
|
|
3. [mmseq_plots.R](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/checkpoint_snakefile/scripts/mmseq_plots.R)
|
|
|
|
|
|
## And this is where the fun begins..
|
|
|
- Like any good research story - one question always leads to another, and one answer opens up the rabbit hole of unending possibilities
|
... | ... | @@ -109,6 +112,160 @@ snakemake -p --use-conda COVERAGE_OF_REFERENCES |
|
|
less default.log
|
|
|
```
|
|
|
|
|
|
## Chapter III - What about 'non-methlyation-aware' basecalling, a.k.a. "non-mod" basecalling?
|
|
|
- The question arose as to what effect this might have
|
|
|
- What better way to answer this, than test it?!
|
|
|
```
|
|
|
#########################
|
|
|
### NO_MOD_basecalled ###
|
|
|
#########################
|
|
|
- Running the basecalled data that was generated without any "modifications" or "methylation-awareness"
|
|
|
- To do this, we first have to move the "results" folder to a new name
|
|
|
- then create a new "results" and keep the "basecalled_NO_MOD", but rename this folder to "basecalled"
|
|
|
- Can be done via script as follows:
|
|
|
|
|
|
cd /scratch/users/sbusi/ONT/cedric_ont_basecalling/
|
|
|
[./move_results_folder.sh](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/checkpoint_snakefile/2019_GDB/scripts/move_results_folder.sh)
|
|
|
|
|
|
# removing snakemake 'touched' files to re-run freshly the "non-mod-basecalled" data
|
|
|
rm annotate.done
|
|
|
rm assemble_and_coverage.done
|
|
|
rm basecall_merge_qc.done
|
|
|
rm coverage_of_references.done
|
|
|
|
|
|
# due to QosGrpCPUlimit, replaced the 'batch' in "cluster.json" to 'bigmem'
|
|
|
# Running the 'Snakefile' using the wrappers
|
|
|
./src/snakemake_run_use_conda.sh BASECALL_MERGE_QC
|
|
|
./src/snakemake_run_use_conda.sh COVERAGE_OF_REFERENCES
|
|
|
./src/snakemake_run_use_conda.sh ASSEMBLE_AND_COVERAGE
|
|
|
./src/snakemake_run_use_conda.sh TEST
|
|
|
./src/snakemake_run_use_conda.sh TEST_PRODIGAL
|
|
|
./src/snakemake_run_use_conda.sh TEST_DIAMOND
|
|
|
./src/snakemake_run_use_conda.sh POLISH_AND_COVERAGE
|
|
|
./src/snakemake_run_use_conda.sh DB_MMSEQ2
|
|
|
./src/snakemake_run_use_conda.sh COMPARE_MMSEQ2
|
|
|
./src/snakemake_run_use_conda.sh CONVERT_MMSEQ2
|
|
|
./src/snakemake_run_use_conda.sh PREPARE_FILES
|
|
|
./src/snakemake_run_use_conda.sh PLOT_MMSEQ2
|
|
|
```
|
|
|
|
|
|
## Chapter IV - Metagenome-assembled genomes (MAGs)
|
|
|
- When you're having so much fun, why give up so soon?!
|
|
|
- We had assemblies, gene calls, coverages, and comparisons between gene calls
|
|
|
- So, we decided to go the full monty, a.k.a., "can we bin MAGs and will there be differences (if any)?
|
|
|
- Note: the differences were expected, but we are scientists - we need data to validate!!
|
|
|
```
|
|
|
###################
|
|
|
##### Binning #####
|
|
|
###################
|
|
|
- We decided to bin using [MaxBin2](https://sourceforge.net/projects/maxbin2/) and [MetaBAT2](https://github.com/songweizhi/Katana_cmds/wiki/MetaBAT2)
|
|
|
- Next, [DASTool](https://github.com/cmks/DAS_Tool) was used to identify 'high-quality, non-redundant' bins
|
|
|
- However, before we get to the fun parts, we first had to map the short- and long-reads to the "hybrid" assembly using [MetaSPADES](https://github.com/ablab/spades)
|
|
|
- The rules for [MAPPING](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/checkpoint_snakefile/2019_GDB/rules/MAPPING_RULES) and [BINNING](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/checkpoint_snakefile/2019_GDB/rules/BINNING_RULES) provde our strategy to tackle this
|
|
|
- Note: we decided to test both [bwa-mem2](https://github.com/bwa-mem2/bwa-mem2url) and [minimap2](https://github.com/lh3/minimap2url) as mappers to see how they perform/affect our analyses
|
|
|
|
|
|
####################
|
|
|
##### Taxonomy #####
|
|
|
####################
|
|
|
- Now that we had MAGs, the logical thing to do was test their completion and contamination among other metrics
|
|
|
- [CheckM](https://github.com/Ecogenomics/CheckM) was used for testing completion
|
|
|
- Then we assigned taxonomy using the [GTDBtk](https://github.com/Ecogenomics/GTDBTk) toolkit
|
|
|
- See the rule here: [TAXONOMY_RULES](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/checkpoint_snakefile/2019_GDB/rules/TAXONOMY_RULES)
|
|
|
```
|
|
|
|
|
|
## Chapter V - Let's throw in some METATRANSCRIPTOMICS
|
|
|
- Since we had corresponding metaT data for the 2018-GDB sample, we decided to map the metaT reads to the different assemblies
|
|
|
- See [METAT_RULES](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/checkpoint_snakefile/2019_GDB/rules/METAT_RULES)
|
|
|
```
|
|
|
#########
|
|
|
# metaT #
|
|
|
#########
|
|
|
# Running the metaT processing and mapping as follows:
|
|
|
./src/snakemake_run_use_conda.sh METAT
|
|
|
|
|
|
# calculating average coverage of the assemblies via metaT was done as indicated below
|
|
|
cat sr/GDB_2018_metaT_reads-x-NEB2_MG_S17-megahit_contigs.avg_cov.txt | awk '{x+=$2; next} END{print x/NR}'
|
|
|
#5.52716
|
|
|
cat sr/GDB_2018_metaT_reads-x-lr_barcode07_sr_NEB2_MG_S17-metaspades_hybrid_contigs.avg_cov.txt | awk '{x+=$2; next} END{print x/NR}'
|
|
|
#3.02177
|
|
|
cat lr/GDB_2018_metaT_reads-x-barcode07-flye_contigs.avg_cov.txt | awk '{x+=$2; next} END{print x/NR}'
|
|
|
#30.7486
|
|
|
```
|
|
|
|
|
|
## Chapter VI - the "big kahuna"
|
|
|
- This part of the story is very much "serendipitous"
|
|
|
- Having a 1200 line Snakefile can be amazing; unless, for some unknown reason the entire run is triggered even when files are present
|
|
|
- Due to this slight inconvenience of time, we decided to break down the Snakefile into "modular" sections
|
|
|
- All the rules were broken down to [rules](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/tree/checkpoint_snakefile/2019_GDB%2Frulesurl) and [workflows](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/tree/checkpoint_snakefile/2019_GDB%2Fworkflowsurl)
|
|
|
```
|
|
|
####################
|
|
|
# MODULAR WORKFLOW #
|
|
|
####################
|
|
|
cd /scratch/users/sbusi/ONT/cedric_ont_basecalling
|
|
|
snakemake -np -s [updated_SNAKEFILE](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/checkpoint_snakefile/updated_SNAKEFILEurl)
|
|
|
```
|
|
|
|
|
|
## Chapter IX - The miscellaneous or nearly-forgotten side projects
|
|
|
- Due to the multifaceted nature of the best, i.e. the modular workflow, we tested several aspects separately
|
|
|
- For example: since we used two mappers bwa-mem and minimap for the reads, we binned each sample separtely based on the mapper
|
|
|
- Additionally, we needed to merge bam files for the "hybrid"-binning, so we compared bins using [sourmash](https://github.com/dib-lab/sourmash) and compared assemblies using [quast](https://github.com/ablab/quast)
|
|
|
|
|
|
##### SOURMASH #####
|
|
|
- To check if the bins across different metaspades_hybrid 'bam' files are similar
|
|
|
- to decide whether to go with sr_bam or merged_bam files
|
|
|
- therefore, comparing ONLY the {mapper}_{reads}_metaspades_hybrid vs {mapper}_{reads}_metaspades_hybrid bins
|
|
|
```
|
|
|
cd /scratch/users/sbusi/ONT/cedric_ont_basecalling/non_mod_basecalled_results
|
|
|
cd Binning
|
|
|
mkdir sourmash
|
|
|
mkdir bins
|
|
|
cd /scratch/users/sbusi/ONT/cedric_ont_basecalling/non_mod_basecalled_results/Binning/sourmash/bins
|
|
|
|
|
|
for file in `cat list`
|
|
|
do
|
|
|
for i in `ls /scratch/users/sbusi/ONT/cedric_ont_basecalling/non_mod_basecalled_results/Binning/"$file"/dastool_output/"$file"_DASTool_bins/*.fa`
|
|
|
do
|
|
|
j=`echo $i | cut -d'.' -f2,3`
|
|
|
ln -s "$i" "$file"_"$j"
|
|
|
done
|
|
|
done
|
|
|
|
|
|
# renaming files that do not have the '.fa' extension but are in reality fasta files
|
|
|
for f in *_sub; do
|
|
|
mv -- "$f" "${f%_sub}_sub.fa"
|
|
|
done
|
|
|
|
|
|
# Running sourmash
|
|
|
cd /scratch/users/sbusi/ONT/cedric_ont_basecalling/non_mod_basecalled_results/Binning/sourmash
|
|
|
si # interactive
|
|
|
conda activate sourmash
|
|
|
sourmash compute -k 31 bins/*.fa
|
|
|
sourmash compare *.sig -o ont_hybrid_bins_sourmash_comp
|
|
|
saving labels to: ont_hybrid_bins_sourmash_comp.labels.txt
|
|
|
# saving distance matrix to: ont_hybrid_bins_sourmash_comp
|
|
|
# sourmash compare *.sig --csv ont_hybrid_bins_sourmash_comp.csv
|
|
|
sourmash plot ont_hybrid_bins_sourmash_comp --labels
|
|
|
sourmash plot ont_hybrid_bins_sourmash_comp --labels --pdf
|
|
|
# downloaded the .csv file to the "Desktop" and used an additional Rscript for plotting
|
|
|
```
|
|
|
|
|
|
##### QUAST #####
|
|
|
- Testing assembly stats using quast
|
|
|
```
|
|
|
si # interactive session on hpc-IRIS
|
|
|
conda activate quast
|
|
|
cd /scratch/users/sbusi/ONT/cedric_ont_basecalling/Binning
|
|
|
metaquast.py --max-ref-number 0 --threads 24 *fna -o non_mod_basecalled_quast_results
|
|
|
# downloaded the folder to desktop at ~/Documents/Nanopore_ONT/.
|
|
|
cd /scratch/users/sbusi/ONT/cedric_ont_basecalling/mod_basecalled_results/assembly
|
|
|
ln -sd flye/lr/merged/barcode07/assembly.fasta flye.fa
|
|
|
ln -sd megahit/NEB2_MG_S17/final.contigs.fa megahit.fa
|
|
|
ln -sd metaspades_hybrid/lr_barcode07-sr_NEB2_MG_S17/contigs.fasta metaspades.fa
|
|
|
|
|
|
metaquast.py --max-ref-number 0 --threads 24 *fa -o methylation_basecalled_quast_results
|
|
|
# downloaded the folder to desktop at ~/Documents/Nanopore_ONT/.
|
|
|
```
|
|
|
|
|
|
##### GITLAB #####
|
|
|
- copying all changed files to gitlab: https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab
|
|
|
```
|
... | ... | |