ONT_pilot_gitlab issueshttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues2021-09-01T10:55:30+02:00https://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/129ppc2021-09-01T10:55:30+02:00Laurent Heirendtlaurent.heirendt@uni.luppc@cylon-x ppc@cylon-x ppccylon-xcylon-xhttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/128utils: ave_gene_cov bug2021-04-21T14:51:08+02:00Valentina Galatavalentina.galata@uni.luutils: ave_gene_cov bugFix function `ave_gene_cov` in `utils.py`: https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/master/workflow/scripts/utils.py#L284
1. why using `if contig_id != ""`?: https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/...Fix function `ave_gene_cov` in `utils.py`: https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/master/workflow/scripts/utils.py#L284
1. why using `if contig_id != ""`?: https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/master/workflow/scripts/utils.py#L299
2. no ave. cov. for genes/proteins from the last contig because the `if` statement will not be reached after the end of file
* [x] fix the code
* [x] re-create the per-gene coverage result files
* [x] compare to prev. result files
* [x] re-run the report workflow
* [x] re-run the figure workflow
* [x] update notes
* [x] add relevant per-gene cov. files to result archive (also update its README)
* [x] re-create the result archive (update on figshare)
* [x] re-create the code archive (update on figshare)
* [x] update figures (if needed) in the manuscriptManuscript - v2Valentina Galatavalentina.galata@uni.luValentina Galatavalentina.galata@uni.luhttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/127Manuscript submission (preprint, Genome Biology)2021-04-27T13:52:09+02:00Valentina Galatavalentina.galata@uni.luManuscript submission (preprint, Genome Biology)### Finalizing submission:
* [x] manuscript: feedback from
* [x] MC
* [x] RH
* [x] BK
* [x] PW
* [x] code: update README: add missing details
* [x] code: update README: how to skip the preprocessing step for GDB
* [x] code: code...### Finalizing submission:
* [x] manuscript: feedback from
* [x] MC
* [x] RH
* [x] BK
* [x] PW
* [x] code: update README: add missing details
* [x] code: update README: how to skip the preprocessing step for GDB
* [x] code: code to create results archive w/ relevant files
* [x] code: add a tag
* [x] code: metaP data processing and credentials
* [x] raw data: submit GDB metaG data
* [x] raw data: submit GDB metaT data
* [x] raw data: submit GDB metaP data
* [x] figshare: code: create/submit code archive
* [x] figshare: results: create/submit results archive
* [x] manuscript: add data/code links
* [x] manuscript: v2 feedback from PW
* [x] manuscript: finalize (fig. res. etc.)
* [x] cover letter
### Preprint
* [x] submit to [biorxiv](https://www.biorxiv.org/)
### Submission to Genome Biology
* [x] transfer preprint to Genome BiologyManuscript - v2Susheel BusiValentina Galatavalentina.galata@uni.luSusheel Busi2021-04-30https://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/126Analysis: metaT cov w.r.t. gene length2021-03-25T08:27:18+01:00Valentina Galatavalentina.galata@uni.luAnalysis: metaT cov w.r.t. gene lengthMetaT coverage of CDS of proteins should be also computed w.r.t. sequence length, i.e. how much of the transcript is covered.MetaT coverage of CDS of proteins should be also computed w.r.t. sequence length, i.e. how much of the transcript is covered.Stretch goalhttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/125Figure: typo in barrnap kingdom names2021-03-19T14:17:26+01:00Valentina Galatavalentina.galata@uni.luFigure: typo in barrnap kingdom namesReplace "Archea" by "Archaea" in [this script](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/master/workflow_report/scripts/const.R#L100)
* [x] fix typo
* [x] recreate figuresReplace "Archea" by "Archaea" in [this script](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/master/workflow_report/scripts/const.R#L100)
* [x] fix typo
* [x] recreate figuresManuscript - v2Valentina Galatavalentina.galata@uni.luValentina Galatavalentina.galata@uni.luhttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/124Bug: bbmap: quality encoding offset for LR (GDB, preprocessing)2021-03-24T07:22:12+01:00Valentina Galatavalentina.galata@uni.luBug: bbmap: quality encoding offset for LR (GDB, preprocessing)Usind `bbmap`'s parameters `ignorebadquality qin=64 qout=64` for long reads when processing them appears to be wrong,
i.e. need to change it in [this line](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/master/workflow/rul...Usind `bbmap`'s parameters `ignorebadquality qin=64 qout=64` for long reads when processing them appears to be wrong,
i.e. need to change it in [this line](https://git-r3lab.uni.lu/susheel.busi/ont_pilot_gitlab/-/blob/master/workflow/rules/preprocessing.smk#L170) and update the results.
Since this rule is used for GDB only to remove host contamination, only the LR/HY results for GDB will need to be updated.
Proof:
- quality string changed between input FASTQ and output FASTQ files: some characters replaced by `@`
- `testformat2.sh` from `bbmap` tools reports a quality offset of 33 but fails or prints warnings if it is not set or set to 64
Checking file format:
```bash
testformat2.sh trim=f sketch=f merge=f /scratch/users/vgalata/gdb/basecalling/lr.fastq.gz
# Warning! Changed from ASCII-33 to ASCII-64 on input 8: 56 -> 25
# Up to 641 prior reads may have been generated with incorrect qualities.
# If this is a problem you may wish to re-run with the flag 'qin=33' or 'qin=64'.
#
# The ASCII quality encoding offset (64) is not set correctly, or the reads are corrupt; quality value below -5.
# Please re-run with the flag 'qin=33', 'ignorebadquality', or '-da'.
# Problematic read number 641:
# [...]
# Offset=64
# java.lang.Exception: Aborting.
# [...]
```
```bash
testformat2.sh qin=33 trim=f sketch=f merge=f /scratch/users/vgalata/gdb/basecalling/lr.fastq.gz
# Format fastq
# Compression gz
# Interleaved false
# [...]
# QualOffset 33
# [...]
```
**TODOs**
* [x] change parameters in the rule
* [x] remove preprocessed LR files (link `lr.proc.fastq.gz` and file `lr.nohost.fastq.gz`) for GDB
* [x] re-run "preprocessing" for LR for GDB
* [x] re-run "assembly" for GDB
* [x] re-run "mapping" for GDB
* [x] re-run "annotation" for GDB
* [x] re-run "analysis" for GDB
* [x] re-run "taxonomy" for GDB
* [x] re-create reports
* [x] re-create GDB extra-analysis: `rgi`
* [x] re-create GDB extra analysis: `barrnap`, metaT
* [x] re-create GDB extra-analysis: metaT ave. cov. of unique `mmseqs2` proteins
* [x] re-create metaP results
* [x] re-create paper figuresManuscript - v2Valentina Galatavalentina.galata@uni.luValentina Galatavalentina.galata@uni.luhttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/123Figure: GDB, nudged RGI hit to ARO:30004454 in Flye2021-03-02T15:05:25+01:00Valentina Galatavalentina.galata@uni.luFigure: GDB, nudged RGI hit to ARO:30004454 in FlyeFigure to show the one nudged hit to `ARO:30004454` in `Flye` in sample `GDB`.
See also notes `notes/gdb_rgi_aro3004454_flye.md`.
- metaT coverage
- sequences of ARO, Prodigal's protein, "new" proteinFigure to show the one nudged hit to `ARO:30004454` in `Flye` in sample `GDB`.
See also notes `notes/gdb_rgi_aro3004454_flye.md`.
- metaT coverage
- sequences of ARO, Prodigal's protein, "new" proteinManuscript - v2Valentina Galatavalentina.galata@uni.luValentina Galatavalentina.galata@uni.luhttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/122Manuscript: discovery of novel taxa in GDB2021-03-25T08:27:40+01:00Valentina Galatavalentina.galata@uni.luManuscript: discovery of novel taxa in GDBOne of the preprocessing steps was to remove rRNA reads from metaT data using `bbduk`.
However, this step did not remove reads demonstrarting a certain level if dissimilarity to the used references.
Some rRNA genes have a rather high met...One of the preprocessing steps was to remove rRNA reads from metaT data using `bbduk`.
However, this step did not remove reads demonstrarting a certain level if dissimilarity to the used references.
Some rRNA genes have a rather high metaT coverage and that can be used to find "novel taxa", i.e. those not covered by the used rRNA references.Stretch goalhttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/121Data: mean metaT cov for "unique" genes/proteins (mmseqs2, GDB)2021-02-26T12:31:17+01:00Valentina Galatavalentina.galata@uni.luData: mean metaT cov for "unique" genes/proteins (mmseqs2, GDB)Collect the data: ave. metaT coverage for the genes/proteins identified as "unique" using `mmseqs2`.
> unique proteins = proteins from a cluster which contains only proteins from one assemblyCollect the data: ave. metaT coverage for the genes/proteins identified as "unique" using `mmseqs2`.
> unique proteins = proteins from a cluster which contains only proteins from one assemblyManuscript - v2Valentina Galatavalentina.galata@uni.luValentina Galatavalentina.galata@uni.luhttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/120Investigation: did assembly polishing work?2021-01-25T11:18:32+01:00Valentina Galatavalentina.galata@uni.luInvestigation: did assembly polishing work?Did assembly polishing work?
Current assembly polishing strategy:
- LR: racon/LR, racon/SR (x4), metadaka/LR
- Hy: racon/SR (x5)Did assembly polishing work?
Current assembly polishing strategy:
- LR: racon/LR, racon/SR (x4), metadaka/LR
- Hy: racon/SR (x5)AMR analysisValentina Galatavalentina.galata@uni.luValentina Galatavalentina.galata@uni.luhttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/119Analysis/Figure: protein clustering2021-01-27T16:14:43+01:00Valentina Galatavalentina.galata@uni.luAnalysis/Figure: protein clusteringCluster **all** proteins, i.e. from all assemblies, together.
Generate summary files and plots showing number of "shared" and unique proteins.
Can potentially replace pairwise assembly comparisons with `cdhit` and `diamond`.Cluster **all** proteins, i.e. from all assemblies, together.
Generate summary files and plots showing number of "shared" and unique proteins.
Can potentially replace pairwise assembly comparisons with `cdhit` and `diamond`.AMR analysisValentina Galatavalentina.galata@uni.luValentina Galatavalentina.galata@uni.luhttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/118Analysis/Figure: cov of RGI hits2021-01-22T14:19:01+01:00Valentina Galatavalentina.galata@uni.luAnalysis/Figure: cov of RGI hitsCollect required data and plot coverage of genes/proteins with RGI hits.Collect required data and plot coverage of genes/proteins with RGI hits.AMR analysisValentina Galatavalentina.galata@uni.luValentina Galatavalentina.galata@uni.luhttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/117Binning - methylation vs non-methylation2021-01-14T08:39:41+01:00Susheel BusiBinning - methylation vs non-methylation- Hypothesis: `methylation status does not affect binning, especially for bacteria`
- To be tested in the future.
- Previous `binning` folders from methylation-aware and non-methylation-aware basecalling stored on the `work` folder for ...- Hypothesis: `methylation status does not affect binning, especially for bacteria`
- To be tested in the future.
- Previous `binning` folders from methylation-aware and non-methylation-aware basecalling stored on the `work` folder for reference and useStretch goalSusheel BusiSusheel Busihttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/116Error: assembly alignment error in mummer/dnadiff (rumen, LR assemblies)2021-01-19T09:07:29+01:00Valentina Galatavalentina.galata@uni.luError: assembly alignment error in mummer/dnadiff (rumen, LR assemblies)`dnadiff` aborts with an error for `flye` and `raven`
```
Building alignments
ERROR: failed to merge alignments at position 30798
Please file a bug report
ERROR: Failed to run nucmer, aborting.
````dnadiff` aborts with an error for `flye` and `raven`
```
Building alignments
ERROR: failed to merge alignments at position 30798
Please file a bug report
ERROR: Failed to run nucmer, aborting.
```Valentina Galatavalentina.galata@uni.luValentina Galatavalentina.galata@uni.luhttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/115Error: segmentation fault for hmmsearch (rumen/aquifer, LR assemblies)2021-01-19T09:07:25+01:00Valentina Galatavalentina.galata@uni.luError: segmentation fault for hmmsearch (rumen/aquifer, LR assemblies)`hmmsearch` using KEGG HMMs crashes with "Segmentation fault" for `rumen` (`flye`, `raven`) and `aquifer` (`flye`).`hmmsearch` using KEGG HMMs crashes with "Segmentation fault" for `rumen` (`flye`, `raven`) and `aquifer` (`flye`).Valentina Galatavalentina.galata@uni.luValentina Galatavalentina.galata@uni.luhttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/114IMP3 - GDB metaG/metaT2020-12-10T11:21:03+01:00Susheel BusiIMP3 - GDB metaG/metaTTODO
- [x] Run `IMP3` on GDB metaG and metaT data
Results archived on isilon
- Path: `/mnt/isilon/projects/ecosystem_biology/ONT_pilot/users/sbusi/gdb_imp3.tar.gz`TODO
- [x] Run `IMP3` on GDB metaG and metaT data
Results archived on isilon
- Path: `/mnt/isilon/projects/ecosystem_biology/ONT_pilot/users/sbusi/gdb_imp3.tar.gz`https://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/113Prodigal benchmark2021-01-19T09:07:17+01:00Valentina Galatavalentina.galata@uni.luProdigal benchmarkCompare `prodigal` predictions to a ground truth.
Use a couple of genomes from NCBI to run `prodigal` on a FASTA of concatenated genomes and compare the predictions to the genome annotations (GFF files).Compare `prodigal` predictions to a ground truth.
Use a couple of genomes from NCBI to run `prodigal` on a FASTA of concatenated genomes and compare the predictions to the genome annotations (GFF files).Susheel BusiSusheel Busihttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/112Report/Figure: saving upsetr plots2021-01-12T10:58:41+01:00Valentina Galatavalentina.galata@uni.luReport/Figure: saving upsetr plotsCurrently, the `UpSetR` plots are not saved to PDF (only in the rendered HTML report).
Use `print(...)` when creating plots to save them to PDF:
```R
pdf(...)
print(UpSetR::upset(...))
dev.off()
```
Also, the plot object can be saved ...Currently, the `UpSetR` plots are not saved to PDF (only in the rendered HTML report).
Use `print(...)` when creating plots to save them to PDF:
```R
pdf(...)
print(UpSetR::upset(...))
dev.off()
```
Also, the plot object can be saved as the other plots and plotted later using `print(...)`. At the moment, the plotting function is called in the rMarkdown files and the rendering script.The bright futureValentina Galatavalentina.galata@uni.luValentina Galatavalentina.galata@uni.luhttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/111Zymo: extra analysis2021-01-19T09:07:10+01:00Valentina Galatavalentina.galata@uni.luZymo: extra analysisTo investigate the discrepancy in gene content in different assemblies, focus on `zymo` where the reference genomes are available.
See comments below for ideas what should/can be done.To investigate the discrepancy in gene content in different assemblies, focus on `zymo` where the reference genomes are available.
See comments below for ideas what should/can be done.The bright futureValentina Galatavalentina.galata@uni.luValentina Galatavalentina.galata@uni.luhttps://git-r3lab.uni.lu/ESB/ont_pilot_gitlab/-/issues/110Assembly: canu's genomeSize parameter test2020-12-08T13:50:00+01:00Valentina Galatavalentina.galata@uni.luAssembly: canu's genomeSize parameter testTest `canu` with different values for `genomeSize` to see if it affects the results.Test `canu` with different values for `genomeSize` to see if it affects the results.The bright futureSusheel BusiSusheel Busi