Commit 3ace90aa authored by Valentina Galata's avatar Valentina Galata
Browse files

notes: updated rrna notes in gdb_barrnap_metat.md (issue #124)

parent 1d6efd3f
......@@ -27,24 +27,24 @@ Script to compute average coverage from per-base coevarge file and GFF file gene
For each assembly, get the metaT coverage and GFF files and compute average metaT coverage:
```bash
cat "tools.txt" | while read rtype tool; do
cov="/scratch/users/vgalata/gdb/results/mapping/metat/${rtype}/${tool}/ASSEMBLY.POLISHED.sr.cov.perbase"
gff="${rtype}.${tool}.barrnap.gff"
cat $(find "../../annotation/barrnap/${rtype}/${tool}/" -type f -name "*.gff") | grep -v "^#" > ${gff}
echo "${gff}, ${cov}"
python ~/projects/ONT_pilot/notes/barrnap_rrna_avecov.py ${gff} ${cov} "${rtype}.${tool}.barrnap.metat.tsv"
cov="/scratch/users/vgalata/gdb/results/mapping/metat/${rtype}/${tool}/ASSEMBLY.POLISHED.sr.cov.perbase"
gff="${rtype}.${tool}.barrnap.gff"
cat $(find "/scratch/users/vgalata/gdb/results/annotation/barrnap/${rtype}/${tool}/" -type f -name "*.gff") | grep -v "^#" > ${gff}
echo "${gff}, ${cov}"
python ~/projects/ONT_pilot/notes/barrnap_rrna_avecov.py ${gff} ${cov} "${rtype}.${tool}.barrnap.metat.tsv"
done
```
Printing number of number of rRNA hits (+ header) and those with ave. metaT cov. above 50:
```bash
find . -type f -name "*.metat.tsv" | sort | while read f; do
echo "$(wc -l ${f}), $(awk -F'\t' '$8 > 50 {print $0}' ${f} | wc -l)"
echo "$(wc -l ${f}), $(awk -F'\t' '$8 > 50 {print $0}' ${f} | wc -l)"
done
# 712 ./hy.metaspadeshybrid.barrnap.metat.tsv, 74
# 743 ./hy.operamsmegahit.barrnap.metat.tsv, 88
# 633 ./hy.operamsmetaspades.barrnap.metat.tsv, 75
# 952 ./lr.flye.barrnap.metat.tsv, 117
# 1399 ./lr.raven.barrnap.metat.tsv, 120
# 741 ./hy.operamsmegahit.barrnap.metat.tsv, 90
# 639 ./hy.operamsmetaspades.barrnap.metat.tsv, 80
# 908 ./lr.flye.barrnap.metat.tsv, 118
# 1386 ./lr.raven.barrnap.metat.tsv, 122
# 540 ./sr.megahit.barrnap.metat.tsv, 74
# 413 ./sr.metaspades.barrnap.metat.tsv, 64
```
......@@ -54,105 +54,11 @@ Note that
- i.e. those appearing in multiple GFF files with identical information (start/end, strand, attributes etc.)
- number of rows includes also the header
# Bacterial 23S rRNA gene on contig contig_4826:1.0 (LR, Flye assembly)
**NOTE: Looked promising but seems to be not very "novel" after all: see below.**
As an example of a rRNA gene with high metaT coverage.
Average metaT coverage for rRNA genes found on contig `contig_4826:1.0` in `Flye` assembly:
```bash
# metaT coverage
grep -F "contig_4826:1.0-36062.0" lr.flye.barrnap.metat.tsv
# contig_4826:1.0-36062.0 31057 31160 - Name=5S_rRNA;product=5S ribosomal RNA 31057_31160 contig_4826:1.0-36062.0__31057_31160 0.0
# contig_4826:1.0-36062.0 31374 34249 - Name=23S_rRNA;product=23S ribosomal RNA 31374_34249 contig_4826:1.0-36062.0__31374_34249 8817.937413073714
# contig_4826:1.0-36062.0 31379 34238 - Name=23S_rRNA;product=23S ribosomal RNA 31379_34238 contig_4826:1.0-36062.0__31379_34238 8867.245454545455
# contig_4826:1.0-36062.0 31468 32477 - Name=28S_rRNA;product=28S ribosomal RNA (partial);note=aligned only 34 percent of the 28S ribosomal RNA 31468_32477 contig_4826:1.0-36062.0__31468_32477 234.7792079207921
# contig_4826:1.0-36062.0 31580 32477 - Name=16S_rRNA;product=16S ribosomal RNA (partial);note=aligned only 56 percent of the 16S ribosomal RNA 31580_32477 contig_4826:1.0-36062.0__31580_32477 264.0612472160356
# contig_4826:1.0-36062.0 32498 33547 - Name=28S_rRNA;product=28S ribosomal RNA (partial);note=aligned only 36 percent of the 28S ribosomal RNA 32498_33547 contig_4826:1.0-36062.0__32498_33547 23926.06
# contig_4826:1.0-36062.0 33136 33794 - Name=16S_rRNA;product=16S ribosomal RNA (partial);note=aligned only 41 percent of the 16S ribosomal RNA 33136_33794 contig_4826:1.0-36062.0__33136_33794 0.8209408194233687
# contig_4826:1.0-36062.0 34563 36108 - Name=16S_rRNA;product=16S ribosomal RNA 34563_36108 contig_4826:1.0-36062.0__34563_36108 0.0
# contig_4826:1.0-36062.0 34564 36100 - Name=16S_rRNA;product=16S ribosomal RNA 34564_36100 contig_4826:1.0-36062.0__34564_36100 0.0
# contig_4826:1.0-36062.0 34570 35425 - Name=18S_rRNA;product=18S ribosomal RNA (partial);note=aligned only 45 percent of the 18S ribosomal RNA 34570_35425 contig_4826:1.0-36062.0__34570_35425 0.0
# contig_4826:1.0-36062.0 34571 35605 - Name=12S_rRNA;product=12S ribosomal RNA 34571_35605 contig_4826:1.0-36062.0__34571_35605 0.0
```
- some hits are only partial
- some hits have high ave. metaT coverage
If ignoring the partial hits there are two similar 23S rRNA gene hits with high ave. metaT cov.:
- contig_4826:1.0-36062.0, 31374-34249 <-- hit to bacterial database
- contig_4826:1.0-36062.0, 31379-34238 <-- hit to archeal database
```bash
grep -F "contig_4826:1.0-36062.0" /scratch/users/vgalata/gdb/results/annotation/barrnap/lr/flye/bac.gff
# contig_4826:1.0-36062.0 barrnap:0.9 rRNA 31057 31160 8.9e-10 - . Name=5S_rRNA;product=5S ribosomal RNA
# contig_4826:1.0-36062.0 barrnap:0.9 rRNA 31374 34249 0 - . Name=23S_rRNA;product=23S ribosomal RNA
# contig_4826:1.0-36062.0 barrnap:0.9 rRNA 34563 36108 2.5e-290 - . Name=16S_rRNA;product=16S ribosomal RNA
grep -F "contig_4826:1.0-36062.0" /scratch/users/vgalata/gdb/results/annotation/barrnap/lr/flye/arc.gff
# contig_4826:1.0-36062.0 barrnap:0.9 rRNA 31057 31160 8.9e-10 - . Name=5S_rRNA;product=5S ribosomal RNA
# contig_4826:1.0-36062.0 barrnap:0.9 rRNA 31379 34238 0 - . Name=23S_rRNA;product=23S ribosomal RNA
# contig_4826:1.0-36062.0 barrnap:0.9 rRNA 34564 36100 9.3e-163 - . Name=16S_rRNA;product=16S ribosomal RNA
```
Focusing on the bacterial hit, i.e. `contig_4826:1.0-36062.0:31373-34249(-)`:
Nucl. sequence from `/scratch/users/vgalata/gdb/results/annotation/barrnap/lr/flye/bac.fa`:
```fasta
>23S_rRNA::contig_4826:1.0-36062.0:31373-34249(-)
ACGATTAAGGGCGCGCGGTGGATGCCTAGGCCGGAGGCGAAGAAGGACGCGATAAGCCGCGATAGTCCACGGCGAGGTGCAAAACCCTTGACCCGTGGGTTTCCGAATGGGGGCAACCCGGCCGTCTAGATGACGGTCATGCATTATCATGCAGGCGAACGCGGGGAACTGAAACATCTCATTACCCGCAGGAGAAGAAAACAAAATGTGATTCCCCCAGTAGTGGCGAGCGAACGGGGAAGAGCCCAAACCGCTGCCGTTTCGGCGACGGCGGGGTCATGGGACCGCGACATTCGACGGAAGACGAAGTGCAATTACCTGGAAAGGTAAGTCGTGGAGGGTGACAGCCCCGTGCACGACACCTCTTCTTAGATAGCGGAACTCGAGTAGGGCGGGACACGAGAAATCCTGCCTGAATTCGCCAGGACCTCCTGGCAAGGCTAAATACTCCTGAGAGACCGATAGTGAACCAGTACCGCGAGGGAAAGGTGAAAAGTACCCCGAGCAGGGGGTGAAATAGACCCTGAACCCGCGCGCCTACAAGCGGTCGGAGCCACGCGATGTGGTGACGGCGTGCCTTTGCATAATGAGCCTACGAGTTAGTCGTCACCAGCGAGGTTAAGGGCTTCAGGCCCGTCAGCCGAAGTGAAAGCGAGCCTTAACAGGGCGTAAAGTTGGTGGCGCTAGACGCTAAACCTTGTGGATCTACCCACGAGCAGGTTGAAGTCGCGGTAACACGTGATGGAGGACCGAACCGGTAAACGTTGAAAAGTTTTCGGATGACTCGCGGGTAGGGGTGAAAGGCCAATCAAACTGGGAAATAGCTCGTACCTCCGAAATGCATTTCGGTGCAGCCTGTGATGTTCTGTTCCAGAGGTAGAGCTACTGGTTGGACGCGAGGGCTTCACCGCCTATCAAATCCGGATAAACCCGAATGCTGGAACACGAAATCATGGAGTGAGCCCGCGGGTGCTAAGGTCCGCGGACGAGAGGGAAGAACCCGGACCACCGGCTAAGGTCCCGGATGGACAGTTAAGTTGATCAAACGAGGTGGGATCGCGGAGACAGCTAGGATGTTGGCTTGGAAGCAGCCATTCATTTAAAGAGTGCGTAATGTTACTCACTAGTCGAGCGATCCCGCTGTGGGATAATACACGGGCATAAAACTGTCAACCGAAGCCGTGGGGTTAGCTTTATTGCTAACCGGTAGGAGAGCATTCCTGCCAGCTTTGAAGGTCACCCGCGAGAGTGACTGGAGCGGCATGAAAAGCAAATGTAGGCATAAGTAACGATAATGGGGGCGAGAACCCCCCACGCCGCAAGACCAAGGCTTCCCCGGCAATGTCAATCAGCCGGGGGTCAGCCGGCCTCTAAGGCTCACCCGAATGGGGATGCCGACGAGAAACGGGTTAACATTCCCGTGCTTCTCGTCACCGTGACGCGGTGACGGGGTGATGAATGGACCGCGCGCTGACGGAATAGCGCGTTGAAGGCCGTACCCTTTGAGGGTGGTAGTCAAGCACGCCACCCGAGGCGACAGCCGATAGTACCTCGAGGCCTCGGCCGAGGGGATAGCGTCCAGGCAATTAACCCCCTAGAAAACCCGCTAAACTTCAAGTGACGAGGAACCGTACTGTAAACGGACACACGTGGTCGGGTAGAACATACCAAGGCGCTCTGAGTGATTCATGGTCAAGGAACTAGGCAAAATAGTCCTGTAACTTAGGGATAAAGGACGCTCTAGTGATAGAGCCGCAGAGAAATGGCCCAGGCGACTGTTTACCAAAAACACATGGCTATGCCAAATCGAAAGATGACGTATATGGCCTGACACCTGCCCGGTGCCGGAAGGTTAAGAGGAGAGCTCATCCGCGAGGAGAAGGTTTGAATTGAAGCCCCGGTAAACGGCGGCCGTAACTATAACGGTCCTAAGGTAGCGAAATTCCTTGTCGGGTAAGTTCCGACCTGCACGAATGGTGTAACGATCTGGGCACTGTCTCGACCATGAGACCCGGTGAAATTGTAGTTCCGGTGAAGATGCCGGGTACCCGCGACGGGACGGAAAGACCCCGTGAACCTTTACTGCAGCTTCGCGTTGTTCCCGGGCACGGGACGTGTAGGATAGGTCGGAGGCTACGATTCGGTTTCGCCAGGAATCGAGGAGCCATCGTTGAAATACGACCCTTCTCTTGTCTGGGATCTAACCCTGGTTATAGGGGACACCGCGTGGTGGGTAGTTTGACTGGGGGTGGTCGCCTCCAAAAAGCGTAACGGAGGCTTCCAAAGGTACCCTCGGGTTGGTTGGTAATCAACCTCAGAGTGCAATGGCACAAGGGTGCTTGACCGGGAGACCAACGAGTCGATCGGGTGCGAAAGCAGGGCATAGTGATCCGGTGATCCCGCGTGGAAGGGTCATCGCTCAAAGGATAAAAGGTACTCCGGGGATAACAGGCTGATCGCCCCCAAGAGCTCATATCGACGGGGCGGTTTGGCACCTCGATGTCGGCTCGTCACATCCTGGGGCTGGAGAAGGTCCCAAGGGTTCGGCTGTTCGCCGATTAAAGTGGCACGCGAGCCTGGGTTCAGAACGTCGTGAGACAGTTCGGTCCCTATCTGTCGTGGGCGCTGGAGATTCGAGAGGTCCTGACTCTAGTACGAGAGGACCGGGTTGGACGCGCCCGCTAGTGAACCTGTTATGACGCCAGTCGTACGGCAGGGTAGCCACGCGCCGGACGGGATAAGCGCTGAAAGCATCTAAGCGCGAAGCCTGCCTCGAGATGAGATCTCCTTACAGGGTCGTCGTAGACGACGACGTTGATAGGCCACAGGTGTAAAGCCGGTGACGGCAAAGCCGAGTGGTACTAATCACCCGAAAG
```
Saved to `lr.flye.barrnap.fasta`.
## BLASTn search in local rRNA references
To confirm that this rRNA gene is not in the reference FASTA files used to remove rRNA reads:
```bash
# conda env. containing blast
conda create -n blast && conda activate blast && conda install -c conda-forge -c bioconda blast=2.10.1
# blast database
cat /scratch/users/vgalata/ONT_pilot_DBs/sortmerna/*.fasta > sortmerna.fasta
makeblastdb -in sortmerna.fasta -title "Rfam and SILVA" -dbtype nucl
# blast search
blastn -query lr.flye.barrnap.fasta -task blastn -db sortmerna.fasta -out lr.flye.barrnap.localblast.asn -outfmt 11 -perc_identity 50
header='qseqid qstart qend qlen sseqid sstart send slen bitscore evalue pident mismatch gaps qcovs qcovhsp qcovus'
blast_formatter -archive lr.flye.barrnap.localblast.asn -out lr.flye.barrnap.localblast.tsv -outfmt "6 $header"
sed -i "1 i\\$(echo -e $header | sed 's/\s/\t/g')" lr.flye.barrnap.localblast.tsv
```
Created `sortmerna.fasta*` files were removed afterwards.
- Results:
- best identity: 84.936% to (AnpTher5, from `sortmerna`'s file `silva-bac-23s-id98.fasta`)
## Binning analysis
File: `ONTP_flye_contig_4826_1.0-36062.0_SB_Analyses.zip`
## BLASTn search online
- refseq_rna|blastn
- DB: Reference RNA sequences (refseq_rna)
- Task: Somewhat similar sequences (blastn)
- Result:
- top hit: 99% query cov., 82.04% identiy to NR_076850.1 (Odoribacter splanchnicus 23S ribosomal RNA gene)
- best identity: 92% query cov., 84.32% identity to NR_121995.1 (Draconibacterium orientale strain FH5 23S ribosomal RNA gene)
- nr/nt|megablast
- DB: Nucleotide collection (nr/nt)
- Task: Highly similar sequences (megablast)
- Results:
- top hit: 100% 0.0 95.88% 4925405 CP032819.1 (Butyricimonas faecalis strain H184)
- aligned to positions 86074..88943
- genome's 23S annotations: 86075..88952, complement(1863290..1866167), complement(2360577..2363454), 3992520..3995397, 4268734..4271611
**Conclusions**
- the found rRNA gene is not really "novel"
- hits to rRNA gene reference databases can be misleading as they seem to be still rather incomplete
*UPDATE everything below this line*
# All rRNA genes
## Flye, bacteral rRNA genes
## Flye, bacterial rRNA genes
```bash
tab="lr.flye.barrnap.metat.tsv"
......@@ -167,24 +73,39 @@ done | sort > "${cand_ids}"
seqkit grep -f "${cand_ids}" "${fa}" -o "${cand_fa}"
```
Online BLASTn search:
- DB: Nucleotide collection (nr/nt)
- Task: Highly similar sequences (megablast)
[Online BLASTn search](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch):
- Input: `lr.flye.barrnap.bac.candidates.fa`
- Dadatabase: `Nucleotide collection (nr/nt)`
- Organism: `bacteria (taxid:2)`
- Task: `Highly similar sequences (megablast)` (to limit the search)
- saved to `lr.flye.barrnap.bac.candidates.onlineblast.*`
- had to limit the search because of failed jobs
```python
import pandas
df = pandas.read_csv("lr.flye.barrnap.bac.candidates.onlineblastn.csv", sep=",", header=None, names=["qseqid", "sseqid", "pident", "XXX1", "XXX2", "XXX3", "qstart", "qend", "sstart", "send", "evalue", "score"])
for gr_name, gr_df in df.groupby(by=["qseqid"], axis=0):
gr_df.sort_values(by=["pident"], ascending=[False], axis=0, inplace=True)
if all(gr_df.head(1)["pident"] < 90):
print(gr_df.head(1)[["qseqid", "sseqid", "pident"]])
```
For some queries, the top 3 hits had less than 90% identity:
- 23S_rRNA::contig_5059:1.0-21953.0:10957-13837(-)
- 23S_rRNA::contig_1516:1.0-109591.0:20346-23183(-)
- 23S_rRNA::contig_2177:1.0-26740.0:14426-17286(+)
- 5S_rRNA::contig_135:6.0-295493.0:5190-5294(+)
- 5S_rRNA::contig_273:1.0-213900.0:199799-199903(-)
- 5S_rRNA::contig_87:1.0-404736.0:402545-402649(-)
- 23S_rRNA::contig_3372:1.0-107743.0:105359-108115(-) LR699004.1 85.714
- 23S_rRNA::contig_4970:1.0-23257.0:10967-13847(-) CP003259.1 88.789
- 23S_rRNA::contig_701:1.0-109585.0:20807-23643(-) LR698981.1 89.297
```bash
# fasta file from contig IDs of above rRNA genes
for cid in "contig_5059:1.0-21953.0" "contig_1516:1.0-109591.0" "contig_2177:1.0-26740.0" "contig_135:6.0-295493.0" "contig_273:1.0-213900.0" "contig_87:1.0-404736.0"; do
for cid in "contig_3372:1.0-107743.0" "contig_4970:1.0-23257.0" "contig_701:1.0-109585.0"; do
grep -A 1 -F ">${cid}" /scratch/users/vgalata/gdb/results/assembly/lr/flye/ASSEMBLY.POLISHED.fasta
done > lr.flye.barrnap.bac.candidates.contigs.fasta
done > lr.flye.barrnap.bac.candidates.subset.contigs.fasta
```
Online BLASTn search (nr/nt|blastn): "Your total query length is greater than allowed on the BLAST webserver [...]"
[Online BLASTn search](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch) for contigs:
- Input: `lr.flye.barrnap.bac.candidates.subset.contigs.fasta`
- Dadatabase: `Nucleotide collection (nr/nt)`
- Organism: `bacteria (taxid:2)`
- Task: `Highly similar sequences (megablast)`
- Saved to `lr.flye.barrnap.bac.candidates.subset.contigs.onlineblast.*`
- Error: "Error: Process size limit exceeded, resulting in SIGXFSZ (25)."
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment