Commit 712d2fea authored by Valentina Galata's avatar Valentina Galata
Browse files

notes: added notes for GDB and barrnap/metat analysis (incl a script)

parent 43d6b1c4
#!/usr/bin/env python
# Given GFF file from barrnap and per-base coverage file, compute the average coverage for the features in the GFF file
# ARGS:
# 1) GFF file
# 2) cov. file
# 3) output file
import os
import sys
import pandas
def parse_barrnap_gff(ifile_path):
"""
Parse GFF3 files created by barrnap (rRNA gene prediction from given contigs)
GFF format: https://www.ensembl.org/info/website/upload/gff.html
"""
# import pandas
genes = pandas.read_csv(ifile_path, sep="\t", comment="#", header=None, names=["contig_id", "source", "feature", "start", "end", "score", "strand", "frame", "info"])
genes = genes[["contig_id", "start", "end", "strand", "info"]]
genes["prot_num"] = genes["start"].astype(str) + "_" + genes["end"].astype(str)
genes["prot_id"] = genes["contig_id"] + "__" + genes["prot_num"]
assert (genes.start < genes.end).all() # TODO: TEST
return genes
def ave_gene_cov(cov_file, genes):
"""
Compute average coverage from given per-base coverage file and a list of genes
Genes must be in a table as created by parse_prodigal_faa_headers()
"""
genes = genes.sort_values(by=["contig_id", "prot_num"])
genes = genes.assign(ave_cov=None)
with open(cov_file, "r") as ifile:
last_contig_id = ""
last_contig_cov = [] # array of cov. values for bases 1..N (idx 0..N-1)
for line in ifile:
contig_id, base, coverage = line.strip().split("\t")
if contig_id != last_contig_id:
# ave. cov. for genes from this contig
contig_genes = genes.contig_id == last_contig_id
if contig_id != "":
genes.loc[contig_genes, "ave_cov"] = genes.loc[contig_genes,:].apply(lambda x: sum(last_contig_cov[(x.start-1):x.end]) / (x.end-x.start+1), axis=1)
# reset
last_contig_id = contig_id
last_contig_cov = []
last_contig_cov.append(float(coverage))
return genes
if __name__ == "__main__":
print("GFF: %s, COV: %s, OUTPUT: %s" % (sys.argv[1], sys.argv[2], sys.argv[3]))
genes = parse_barrnap_gff(sys.argv[1])
genes.drop_duplicates(keep='first', inplace=True)
genes = ave_gene_cov(sys.argv[2], genes)
genes.to_csv(sys.argv[3], sep="\t", header=True, index=False, index_label=False)
# About
Analysis of some rRNA genes detected by `barrnap` in GDB.
One of the preprocessing steps in `workflow` was to remove rRNA reads using `bbduk` from `bbmap`.
However, the used (default) parameters are sensitive to sequence variation,
reads which were somewhat different from the used references were not removed.
This can be exploited to find "novel" rRNA genes not represented by the used references.
Based on files generated by `workflow`.
# Average metaT coverage of rRNA gene hits
Help file `tools.txt`:
```text
lr flye
sr megahit
sr metaspades
hy metaspadeshybrid
hy operamsmegahit
hy operamsmetaspades
lr raven
```
Script to compute average coverage from per-base coevarge file and GFF file generated by `barrnap`: `notes/barrnap_rrna_avecov.py`
For each assembly, get the metaT coverage and GFF files and compute average metaT coverage:
```bash
cat "tools.txt" | while read rtype tool; do
cov="/scratch/users/vgalata/gdb/results/mapping/metat/${rtype}/${tool}/ASSEMBLY.POLISHED.sr.cov.perbase"
gff="${rtype}.${tool}.barrnap.gff"
cat $(find "../../annotation/barrnap/${rtype}/${tool}/" -type f -name "*.gff") | grep -v "^#" > ${gff}
echo "${gff}, ${cov}"
python ~/projects/ONT_pilot/notes/barrnap_rrna_avecov.py ${gff} ${cov} "${rtype}.${tool}.barrnap.metat.tsv"
done
```
Printing number of number of rRNA hits (+ header) and those with ave. metaT cov. above 50:
```bash
find . -type f -name "*.metat.tsv" | sort | while read f; do
echo "$(wc -l ${f}), $(awk -F'\t' '$8 > 50 {print $0}' ${f} | wc -l)"
done
# 712 ./hy.metaspadeshybrid.barrnap.metat.tsv, 74
# 743 ./hy.operamsmegahit.barrnap.metat.tsv, 88
# 633 ./hy.operamsmetaspades.barrnap.metat.tsv, 75
# 952 ./lr.flye.barrnap.metat.tsv, 117
# 1399 ./lr.raven.barrnap.metat.tsv, 120
# 540 ./sr.megahit.barrnap.metat.tsv, 74
# 413 ./sr.metaspades.barrnap.metat.tsv, 64
```
Note that
- duplicate/identical rRNA gene hits were removed
- i.e. those appearing in multiple GFF files with identical information (start/end, strand, attributes etc.)
- number of rows includes also the header
# Bacterial 23S rRNA gene on contig contig_4826:1.0 (LR, Flye assembly)
**NOTE: Looked promising but seems to be not very "novel" after all: see below.**
As an example of a rRNA gene with high metaT coverage.
Average metaT coverage for rRNA genes found on contig `contig_4826:1.0` in `Flye` assembly:
```bash
# metaT coverage
grep -F "contig_4826:1.0-36062.0" lr.flye.barrnap.metat.tsv
# contig_4826:1.0-36062.0 31057 31160 - Name=5S_rRNA;product=5S ribosomal RNA 31057_31160 contig_4826:1.0-36062.0__31057_31160 0.0
# contig_4826:1.0-36062.0 31374 34249 - Name=23S_rRNA;product=23S ribosomal RNA 31374_34249 contig_4826:1.0-36062.0__31374_34249 8817.937413073714
# contig_4826:1.0-36062.0 31379 34238 - Name=23S_rRNA;product=23S ribosomal RNA 31379_34238 contig_4826:1.0-36062.0__31379_34238 8867.245454545455
# contig_4826:1.0-36062.0 31468 32477 - Name=28S_rRNA;product=28S ribosomal RNA (partial);note=aligned only 34 percent of the 28S ribosomal RNA 31468_32477 contig_4826:1.0-36062.0__31468_32477 234.7792079207921
# contig_4826:1.0-36062.0 31580 32477 - Name=16S_rRNA;product=16S ribosomal RNA (partial);note=aligned only 56 percent of the 16S ribosomal RNA 31580_32477 contig_4826:1.0-36062.0__31580_32477 264.0612472160356
# contig_4826:1.0-36062.0 32498 33547 - Name=28S_rRNA;product=28S ribosomal RNA (partial);note=aligned only 36 percent of the 28S ribosomal RNA 32498_33547 contig_4826:1.0-36062.0__32498_33547 23926.06
# contig_4826:1.0-36062.0 33136 33794 - Name=16S_rRNA;product=16S ribosomal RNA (partial);note=aligned only 41 percent of the 16S ribosomal RNA 33136_33794 contig_4826:1.0-36062.0__33136_33794 0.8209408194233687
# contig_4826:1.0-36062.0 34563 36108 - Name=16S_rRNA;product=16S ribosomal RNA 34563_36108 contig_4826:1.0-36062.0__34563_36108 0.0
# contig_4826:1.0-36062.0 34564 36100 - Name=16S_rRNA;product=16S ribosomal RNA 34564_36100 contig_4826:1.0-36062.0__34564_36100 0.0
# contig_4826:1.0-36062.0 34570 35425 - Name=18S_rRNA;product=18S ribosomal RNA (partial);note=aligned only 45 percent of the 18S ribosomal RNA 34570_35425 contig_4826:1.0-36062.0__34570_35425 0.0
# contig_4826:1.0-36062.0 34571 35605 - Name=12S_rRNA;product=12S ribosomal RNA 34571_35605 contig_4826:1.0-36062.0__34571_35605 0.0
```
- some hits are only partial
- some hits have high ave. metaT coverage
If ignoring the partial hits there are two similar 23S rRNA gene hits with high ave. metaT cov.:
- contig_4826:1.0-36062.0, 31374-34249 <-- hit to bacterial database
- contig_4826:1.0-36062.0, 31379-34238 <-- hit to archeal database
```bash
grep -F "contig_4826:1.0-36062.0" /scratch/users/vgalata/gdb/results/annotation/barrnap/lr/flye/bac.gff
# contig_4826:1.0-36062.0 barrnap:0.9 rRNA 31057 31160 8.9e-10 - . Name=5S_rRNA;product=5S ribosomal RNA
# contig_4826:1.0-36062.0 barrnap:0.9 rRNA 31374 34249 0 - . Name=23S_rRNA;product=23S ribosomal RNA
# contig_4826:1.0-36062.0 barrnap:0.9 rRNA 34563 36108 2.5e-290 - . Name=16S_rRNA;product=16S ribosomal RNA
grep -F "contig_4826:1.0-36062.0" /scratch/users/vgalata/gdb/results/annotation/barrnap/lr/flye/arc.gff
# contig_4826:1.0-36062.0 barrnap:0.9 rRNA 31057 31160 8.9e-10 - . Name=5S_rRNA;product=5S ribosomal RNA
# contig_4826:1.0-36062.0 barrnap:0.9 rRNA 31379 34238 0 - . Name=23S_rRNA;product=23S ribosomal RNA
# contig_4826:1.0-36062.0 barrnap:0.9 rRNA 34564 36100 9.3e-163 - . Name=16S_rRNA;product=16S ribosomal RNA
```
Focusing on the bacterial hit, i.e. `contig_4826:1.0-36062.0:31373-34249(-)`:
Nucl. sequence from `/scratch/users/vgalata/gdb/results/annotation/barrnap/lr/flye/bac.fa`:
```fasta
>23S_rRNA::contig_4826:1.0-36062.0:31373-34249(-)
ACGATTAAGGGCGCGCGGTGGATGCCTAGGCCGGAGGCGAAGAAGGACGCGATAAGCCGCGATAGTCCACGGCGAGGTGCAAAACCCTTGACCCGTGGGTTTCCGAATGGGGGCAACCCGGCCGTCTAGATGACGGTCATGCATTATCATGCAGGCGAACGCGGGGAACTGAAACATCTCATTACCCGCAGGAGAAGAAAACAAAATGTGATTCCCCCAGTAGTGGCGAGCGAACGGGGAAGAGCCCAAACCGCTGCCGTTTCGGCGACGGCGGGGTCATGGGACCGCGACATTCGACGGAAGACGAAGTGCAATTACCTGGAAAGGTAAGTCGTGGAGGGTGACAGCCCCGTGCACGACACCTCTTCTTAGATAGCGGAACTCGAGTAGGGCGGGACACGAGAAATCCTGCCTGAATTCGCCAGGACCTCCTGGCAAGGCTAAATACTCCTGAGAGACCGATAGTGAACCAGTACCGCGAGGGAAAGGTGAAAAGTACCCCGAGCAGGGGGTGAAATAGACCCTGAACCCGCGCGCCTACAAGCGGTCGGAGCCACGCGATGTGGTGACGGCGTGCCTTTGCATAATGAGCCTACGAGTTAGTCGTCACCAGCGAGGTTAAGGGCTTCAGGCCCGTCAGCCGAAGTGAAAGCGAGCCTTAACAGGGCGTAAAGTTGGTGGCGCTAGACGCTAAACCTTGTGGATCTACCCACGAGCAGGTTGAAGTCGCGGTAACACGTGATGGAGGACCGAACCGGTAAACGTTGAAAAGTTTTCGGATGACTCGCGGGTAGGGGTGAAAGGCCAATCAAACTGGGAAATAGCTCGTACCTCCGAAATGCATTTCGGTGCAGCCTGTGATGTTCTGTTCCAGAGGTAGAGCTACTGGTTGGACGCGAGGGCTTCACCGCCTATCAAATCCGGATAAACCCGAATGCTGGAACACGAAATCATGGAGTGAGCCCGCGGGTGCTAAGGTCCGCGGACGAGAGGGAAGAACCCGGACCACCGGCTAAGGTCCCGGATGGACAGTTAAGTTGATCAAACGAGGTGGGATCGCGGAGACAGCTAGGATGTTGGCTTGGAAGCAGCCATTCATTTAAAGAGTGCGTAATGTTACTCACTAGTCGAGCGATCCCGCTGTGGGATAATACACGGGCATAAAACTGTCAACCGAAGCCGTGGGGTTAGCTTTATTGCTAACCGGTAGGAGAGCATTCCTGCCAGCTTTGAAGGTCACCCGCGAGAGTGACTGGAGCGGCATGAAAAGCAAATGTAGGCATAAGTAACGATAATGGGGGCGAGAACCCCCCACGCCGCAAGACCAAGGCTTCCCCGGCAATGTCAATCAGCCGGGGGTCAGCCGGCCTCTAAGGCTCACCCGAATGGGGATGCCGACGAGAAACGGGTTAACATTCCCGTGCTTCTCGTCACCGTGACGCGGTGACGGGGTGATGAATGGACCGCGCGCTGACGGAATAGCGCGTTGAAGGCCGTACCCTTTGAGGGTGGTAGTCAAGCACGCCACCCGAGGCGACAGCCGATAGTACCTCGAGGCCTCGGCCGAGGGGATAGCGTCCAGGCAATTAACCCCCTAGAAAACCCGCTAAACTTCAAGTGACGAGGAACCGTACTGTAAACGGACACACGTGGTCGGGTAGAACATACCAAGGCGCTCTGAGTGATTCATGGTCAAGGAACTAGGCAAAATAGTCCTGTAACTTAGGGATAAAGGACGCTCTAGTGATAGAGCCGCAGAGAAATGGCCCAGGCGACTGTTTACCAAAAACACATGGCTATGCCAAATCGAAAGATGACGTATATGGCCTGACACCTGCCCGGTGCCGGAAGGTTAAGAGGAGAGCTCATCCGCGAGGAGAAGGTTTGAATTGAAGCCCCGGTAAACGGCGGCCGTAACTATAACGGTCCTAAGGTAGCGAAATTCCTTGTCGGGTAAGTTCCGACCTGCACGAATGGTGTAACGATCTGGGCACTGTCTCGACCATGAGACCCGGTGAAATTGTAGTTCCGGTGAAGATGCCGGGTACCCGCGACGGGACGGAAAGACCCCGTGAACCTTTACTGCAGCTTCGCGTTGTTCCCGGGCACGGGACGTGTAGGATAGGTCGGAGGCTACGATTCGGTTTCGCCAGGAATCGAGGAGCCATCGTTGAAATACGACCCTTCTCTTGTCTGGGATCTAACCCTGGTTATAGGGGACACCGCGTGGTGGGTAGTTTGACTGGGGGTGGTCGCCTCCAAAAAGCGTAACGGAGGCTTCCAAAGGTACCCTCGGGTTGGTTGGTAATCAACCTCAGAGTGCAATGGCACAAGGGTGCTTGACCGGGAGACCAACGAGTCGATCGGGTGCGAAAGCAGGGCATAGTGATCCGGTGATCCCGCGTGGAAGGGTCATCGCTCAAAGGATAAAAGGTACTCCGGGGATAACAGGCTGATCGCCCCCAAGAGCTCATATCGACGGGGCGGTTTGGCACCTCGATGTCGGCTCGTCACATCCTGGGGCTGGAGAAGGTCCCAAGGGTTCGGCTGTTCGCCGATTAAAGTGGCACGCGAGCCTGGGTTCAGAACGTCGTGAGACAGTTCGGTCCCTATCTGTCGTGGGCGCTGGAGATTCGAGAGGTCCTGACTCTAGTACGAGAGGACCGGGTTGGACGCGCCCGCTAGTGAACCTGTTATGACGCCAGTCGTACGGCAGGGTAGCCACGCGCCGGACGGGATAAGCGCTGAAAGCATCTAAGCGCGAAGCCTGCCTCGAGATGAGATCTCCTTACAGGGTCGTCGTAGACGACGACGTTGATAGGCCACAGGTGTAAAGCCGGTGACGGCAAAGCCGAGTGGTACTAATCACCCGAAAG
```
Saved to `lr.flye.barrnap.fasta`.
## BLASTn search in local rRNA references
To confirm that this rRNA gene is not in the reference FASTA files used to remove rRNA reads:
```bash
# conda env. containing blast
conda create -n blast && conda activate blast && conda install -c conda-forge -c bioconda blast=2.10.1
# blast database
cat /scratch/users/vgalata/ONT_pilot_DBs/sortmerna/*.fasta > sortmerna.fasta
makeblastdb -in sortmerna.fasta -title "Rfam and SILVA" -dbtype nucl
# blast search
blastn -query lr.flye.barrnap.fasta -task blastn -db sortmerna.fasta -out lr.flye.barrnap.localblast.asn -outfmt 11 -perc_identity 50
header='qseqid qstart qend qlen sseqid sstart send slen bitscore evalue pident mismatch gaps qcovs qcovhsp qcovus'
blast_formatter -archive lr.flye.barrnap.localblast.asn -out lr.flye.barrnap.localblast.tsv -outfmt "6 $header"
sed -i "1 i\\$(echo -e $header | sed 's/\s/\t/g')" lr.flye.barrnap.localblast.tsv
```
Created `sortmerna.fasta*` files were removed afterwards.
- Results:
- best identity: 84.936% to (AnpTher5, from `sortmerna`'s file `silva-bac-23s-id98.fasta`)
## Binning analysis
File: `ONTP_flye_contig_4826_1.0-36062.0_SB_Analyses.zip`
## BLASTn search online
- refseq_rna|blastn
- DB: Reference RNA sequences (refseq_rna)
- Task: Somewhat similar sequences (blastn)
- Result:
- top hit: 99% query cov., 82.04% identiy to NR_076850.1 (Odoribacter splanchnicus 23S ribosomal RNA gene)
- best identity: 92% query cov., 84.32% identity to NR_121995.1 (Draconibacterium orientale strain FH5 23S ribosomal RNA gene)
- nr/nt|megablast
- DB: Nucleotide collection (nr/nt)
- Task: Highly similar sequences (megablast)
- Results:
- top hit: 100% 0.0 95.88% 4925405 CP032819.1 (Butyricimonas faecalis strain H184)
- aligned to positions 86074..88943
- genome's 23S annotations: 86075..88952, complement(1863290..1866167), complement(2360577..2363454), 3992520..3995397, 4268734..4271611
**Conclusions**
- the found rRNA gene is not really "novel"
- hits to rRNA gene reference databases can be misleading as they seem to be still rather incomplete
# All rRNA genes
## Flye, bacteral rRNA genes
```bash
tab="lr.flye.barrnap.metat.tsv"
fa="/scratch/users/vgalata/gdb/results/annotation/barrnap/lr/flye/bac.fa"
cand_ids="lr.flye.barrnap.bac.candidates.txt"
cand_fa="lr.flye.barrnap.bac.candidates.fa"
# ignore partial rRNA genes | filter by ave. metaT cov | grep for IDs in FASTA files
grep -v "partial" "${tab}" | awk -F'\t' '$8 > 50 {print $1":"$2-1"-"$3"("$4")"}' | while read sid; do
grep -F "${sid}" "${fa}" | sed 's/^>//'
done | sort > "${cand_ids}"
# using seqkit=0.15.0
seqkit grep -f "${cand_ids}" "${fa}" -o "${cand_fa}"
```
Online BLASTn search:
- DB: Nucleotide collection (nr/nt)
- Task: Highly similar sequences (megablast)
- saved to `lr.flye.barrnap.bac.candidates.onlineblast.*`
For some queries, the top 3 hits had less than 90% identity:
- 23S_rRNA::contig_5059:1.0-21953.0:10957-13837(-)
- 23S_rRNA::contig_1516:1.0-109591.0:20346-23183(-)
- 23S_rRNA::contig_2177:1.0-26740.0:14426-17286(+)
- 5S_rRNA::contig_135:6.0-295493.0:5190-5294(+)
- 5S_rRNA::contig_273:1.0-213900.0:199799-199903(-)
- 5S_rRNA::contig_87:1.0-404736.0:402545-402649(-)
```bash
# fasta file from contig IDs of above rRNA genes
for cid in "contig_5059:1.0-21953.0" "contig_1516:1.0-109591.0" "contig_2177:1.0-26740.0" "contig_135:6.0-295493.0" "contig_273:1.0-213900.0" "contig_87:1.0-404736.0"; do
grep -A 1 -F ">${cid}" /scratch/users/vgalata/gdb/results/assembly/lr/flye/ASSEMBLY.POLISHED.fasta
done > lr.flye.barrnap.bac.candidates.contigs.fasta
```
Online BLASTn search (nr/nt|blastn): "Your total query length is greater than allowed on the BLAST webserver [...]"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment