Commit f0ae27df authored by Valentina Galata's avatar Valentina Galata
Browse files

notes/figures: updated rgi/card example for gdb/flye (issue #124)

parent 807358cf
# About # About
Notes for sample `GDB` and its RGI (CARD) results. Notes for sample `GDB` and its RGI (CARD) results.
Based on file `report/amr.tsv` created by `workflow_report`. Based on results produced by `workflow` and the file `report/amr.tsv` created by `workflow_report`.
# Hit contig_4930:1.0-5516.0_1 (LR, Flye), ARO:3004454 # Hits in contig_5026:14.0-10236.0 (LR, Flye) to ARO:3004454
Hit of protein `contig_4930:1.0-5516.0_1` from `GDB`'s LR-assembly constructed with `Flye` to `ARO:3004454`. Two hits in contig `contig_5026:14.0-10236.0` from `GDB`'s LR-assembly constructed with `Flye` to `ARO:3004454`:
- `contig_5026:14.0-10236.0_8`
- `contig_5026:14.0-10236.0_9`
## ARO:3004454 ## ARO:3004454
...@@ -27,163 +29,167 @@ TTAAATCTAAAAAAAGGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGAGGGCGGAAAATACTA ...@@ -27,163 +29,167 @@ TTAAATCTAAAAAAAGGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGAGGGCGGAAAATACTA
TCGATTCAAGTGCATCATGCCGTTTGTGACGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAA TCGATTCAAGTGCATCATGCCGTTTGTGACGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAA
``` ```
## Matched protein ## Matched proteins
Protein `contig_4930:1.0-5516.0_1` the first protein-coding gene on contig `contig_4930:1.0-5516.0` (LR, `Flye`). Protein `contig_4930:1.0-5516.0_1` the first protein-coding gene on contig `contig_4930:1.0-5516.0` (LR, `Flye`).
The first two genes from the same contig: Protein information from `Prodigal`'s FAA file:
``` ```bash
>contig_4930:1.0-5516.0_1 # 182 # 448 # 1 # ID=4060_1;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.356 grep "^>contig_5026:14.0-10236.0_" /scratch/users/vgalata/gdb/results/annotation/prodigal/lr/flye/proteins.faa
>contig_4930:1.0-5516.0_2 # 1073 # 2317 # 1 # ID=4060_2;partial=00;start_type=ATG;rbs_motif=AGGAGG;rbs_spacer=5-10bp;gc_cont=0.553 # >contig_5026:14.0-10236.0_8 # 5547 # 5816 # 1 # ID=4198_8;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=3-4bp;gc_cont=0.356
# >contig_5026:14.0-10236.0_9 # 5788 # 6174 # 1 # ID=4198_9;partial=00;start_type=ATG;rbs_motif=3Base/5BMM;rbs_spacer=13-15bp;gc_cont=0.377
``` ```
For `contig_4930:1.0-5516.0_1`, protein (from `Prodigal`) and gene (extracted from contig) sequences: - consecutive CDSs
``` - both CDS overlap
>contig_4930:1.0-5516.0_1 # 182 # 448 # 1 # ID=4060_1;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.356
MQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILNRHEEFRTALDKNGQVGVFFRNAALLHNLS*
>contig_4930:1.0-5516.0_1 # 182 # 448 Nucleotide and protein sequences of both CDS/proteins:
ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAATCGACACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGAAATGCTGCCTTGCTACACAATCTTTCATAA
``` ```
>contig_5026:14.0-10236.0_8_nucl
ATGATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCAC
TATTTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTG
AAAAAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAAT
CGACACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGA
AATGCTGCCTTGCTACACAATCTTTCATAA
Protein `contig_4930:1.0-5516.0_1` clusters only with itself (based on `mmseqs2` output, created by `Workflow`): >contig_5026:14.0-10236.0_9_nucl
```bash ATGCTGCCTTGCTACACAATCTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACT
grep "contig_4930:1\.0-5516\.0_1" clusters.tsv GAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTAT
# flye__contig_4930:1.0-5516.0_1 flye__contig_4930:1.0-5516.0_1 GGTGAACGAAAGGGAATGTTCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCT
``` ATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTAT
CTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGATGGCGGAAAATACTATATTCCC
TTATCGATTCAAGTGCATCATGCCGTTTGTGATGGCTTTCATGTTTGCCGTTTTTTGGAT
GAATTACAAGACTTGCTGAATAAATAA
## Looking for alternative protein coding sequences >contig_5026:14.0-10236.0_8_prot
MMQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILN
RHEEFRTALDKNGQVGVFFRNAALLHNLS*
First 1000 bases of the contig `contig_4930:1.0-5516.0`: >contig_5026:14.0-10236.0_9_prot
``` MLPCYTIFHKETETFSSIWTEFTADYTEFLQNYQKDIDAYGERKGMFAKPNPPENTFPVS
>contig_4930:1.0-5516.0 1-1000 MIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEDGGKYYIPLSIQVHHAVCDGFHVCRFLD
CATTGGTTGCCGCCCAGCAGACGGCACAGGCTGCCCTAGGCACTTCACGCGGGAACAATACAGCAGCAGCTTACCTCCGGTACTGATCAGAGCGAGGGAAGTGGAGACGAGCCGAAAAAACAGGGATGGTTTAGTCGGTTGTTTCGTGGTTAGATAAATCAGGATTTGCGGAGGATAAATGATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAATCGACACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGAAATGCTGCCTTGCTACACAATCTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACTGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTATGGTGAACGAAAGGGAATGTTCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCTATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAGGGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGATGGCGGAAAATACTATATTCCCTTATCGATTCAAGTGCATCATGCCGTTTGTGATGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAAAATCCAAGTTTGTCGTGGAAACGGGGAGCCGCCCGCAGGCGGCAGGGGCAGCGCCCCTCCGGAGCAGTCCGGCTAATGGCGCATAAGCGTCGAGGCAGGCTGCTCCGGAAACCCTCCCCTTTAGGGGCGCAAAGCACTTTCCTACTGCTGCACCAACGGTGCTGCGGTGGGAAAGTGCTTTTGCGTTACTTTCT ELQDLLNK*
``` ```
extracted using Nucleotide sequences were extracted from the relevant part of the contig:
```bash ```bash
grep -A1 ">contig_4930:1.0-5516.0" assembly/lr/flye/ASSEMBLY.POLISHED.fasta | sed -r 's/(.{100})/\1\n/g' # grep for the contig ID + next line (seq) | get seq |
# cut into 100bp chunks | extract relevant rows
grep -A 1 "contig_5026:14.0-10236.0$" /scratch/users/vgalata/gdb/results/assembly/lr/flye/ASSEMBLY.POLISHED.fasta | tail -n 1 | sed -r 's/(.{100})/\1\n/g' | awk 'NR > 55 && NR < 63 {print $0}'
``` ```
and checked by translating them using the [ExPASy tool](https://web.expasy.org/translate/).
Using a translation tool [ExPASy](https://web.expasy.org/translate/), two longest translations with a start and stop codon: These proteins clustered only with one other protein, also from `Flye` assembly:
```bash
grep -A 1 "contig_5026:14.0-10236.0_8" /scratch/users/vgalata/gdb/results/analysis/mmseqs2/clusters.tsv
# flye__contig_5026:14.0-10236.0_8 flye__contig_5026:14.0-10236.0_8
# flye__contig_5027:1.0-43229.0_23 flye__contig_5027:1.0-43229.0_23
grep -A 1 "contig_5026:14.0-10236.0_9" /scratch/users/vgalata/gdb/results/analysis/mmseqs2/clusters.tsv
# flye__contig_5026:14.0-10236.0_9 flye__contig_5026:14.0-10236.0_9
# flye__contig_5027:1.0-43229.0_24 flye__contig_5027:1.0-43229.0_24
``` ```
> contig_4930:1.0-5516.0_1 (VIRT-56253:5'3' Frame 2, start_pos=60) - both clustered proteins are neighbors from the same contig
MQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILNRHEEFRTALDKNGQVGVFFRNAALLHNLS* - both clustered proteins have no `RGI` hit
> contig_4930:1.0-5516.0_1NEW (VIRT-7066:5'3' Frame 3, start_pos=139) MetaT coverage for the contig region containing both genes:
MLPCYTIFHKETETFSSIWTEFTADYTEFLQNYQKDIDAYGERKGMFAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEDGGKYYIPLSIQVHHAVCDGFHVCRFLDELQDLLNK* ```bash
grep -P "^contig_5026:14\.0-10236\.0\t" /scratch/users/vgalata/gdb/results/mapping/metat/lr/flye/ASSEMBLY.POLISHED.sr.cov.perbase | awk -F'\t' '$2 >= 5547 && $2 <= 6174 {print $0}' | less
``` ```
The first protein is the same as the protein predicted by `Prodigal`, i.e. it is `contig_4930:1.0-5516.0_1`. - both genes are covered by metaT
The second protein was **not** predicted by `Prodigal` - we will label it as `contig_4930:1.0-5516.0_1NEW`.
Complete output of `ExPASy` for these two translations is in `lr_flye_contig_4930:1.0-5516.0_1-1000_translated.txt`.
Nucleotide (gene) sequence of the "new" protein `contig_4930:1.0-5516.0_1NEW`:
```
>contig_4930:1.0-5516.0_1NEW # 420 # 806
atgctgccttgctacacaatctttcataaggaaactgaaaccttttcgagtatttggactgagtttacagcagactatactgagtttcttcagaactatcaaaaggatatagacgcttatggtgaacgaaagggaatgttcgcaaagcctaatcctccggaaaacactttccctgtttctatgataccgtggacaagctttgaaggctttaacttaaatctaaaaaagggatatgactatctactgccgatatttacgtttgggaagtattatgaggatggcggaaaatactatattcccttatcgattcaagtgcatcatgccgtttgtgatggctttcatgtttgccgttttttggatgaattacaagacttgctgaataaataa
```
## MSA w/ ARO:3004454 ## MSA w/ ARO:3004454
Multiple-sequence alignment using [EMBOSS Clustal Omega](https://www.ebi.ac.uk/Tools/msa/clustalo/). Multiple-sequence alignment using [EMBOSS Clustal Omega](https://www.ebi.ac.uk/Tools/msa/clustalo/) (`CLUSTAL O(1.2.4)`).
``` ```
VIRT-56253:5'3' MQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILNR 60 Protein alignment:
gb|AAA23018.1|+|Campylobacter MQFTKIDINNWTRKEYFDHYFGNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTIINR 60
VIRT-7066:5'3' ------------------------------------------------------------ 0
contig_5026:14.0-10236.0_8_prot MMQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILN 60
gb|AAA23018.1|+|Campylobacter -MQFTKIDINNWTRKEYFDHYFGNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTIIN 59
contig_5026:14.0-10236.0_9_prot ------------------------------------------------------------ 0
VIRT-56253:5'3' HEEFRTALDKNGQVGVFFRNAALLHNLS*------------------------------- 88 contig_5026:14.0-10236.0_8_prot RHEEFRTALDKNGQVGVFFRNAALLHNLS*------------------------------ 89
gb|AAA23018.1|+|Campylobacter HEEFRTALDENGQVGVFSEMLPCYTVFHKETETFSSIWTEFTADYTEFLQNYQKDIDAFG 120 gb|AAA23018.1|+|Campylobacter RHEEFRTALDENGQVGVFSEMLPCYTVFHKETETFSSIWTEFTADYTEFLQNYQKDIDAF 119
VIRT-7066:5'3' -------------------MLPCYTIFHKETETFSSIWTEFTADYTEFLQNYQKDIDAYG 41 contig_5026:14.0-10236.0_9_prot --------------------MLPCYTIFHKETETFSSIWTEFTADYTEFLQNYQKDIDAY 40
:
contig_5026:14.0-10236.0_8_prot ------------------------------------------------------------ 89
gb|AAA23018.1|+|Campylobacter GERMGMSAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEEGGKYYIP 179
contig_5026:14.0-10236.0_9_prot GERKGMFAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEDGGKYYIP 100
VIRT-56253:5'3' ------------------------------------------------------------ 88 contig_5026:14.0-10236.0_8_prot ----------------------------- 89
gb|AAA23018.1|+|Campylobacter ERMGMSAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEEGGKYYIPL 180 gb|AAA23018.1|+|Campylobacter LSIQVHHAVCDGFHVCRFLDELQDLLNK- 207
VIRT-7066:5'3' ERKGMFAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEDGGKYYIPL 101 contig_5026:14.0-10236.0_9_prot LSIQVHHAVCDGFHVCRFLDELQDLLNK* 128
VIRT-56253:5'3' ---------------------------- 88 DNA alignment:
gb|AAA23018.1|+|Campylobacter SIQVHHAVCDGFHVCRFLDELQDLLNK- 207
VIRT-7066:5'3' SIQVHHAVCDGFHVCRFLDELQDLLNK* 128
```
``` contig_5026:14.0-10236.0_8_nucl ATGATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCAC 60
contig_4930:1.0-5516.0_1 ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTAT 60 gb|M35190.1|+|309-932|Campylobacter ---ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCAC 57
gb|M35190.1|+|309-932|Campylobacter ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTAT 60 contig_5026:14.0-10236.0_9_nucl ------------------------------------------------------------ 0
contig_4930:1.0-5516.0_1NEW ------------------------------------------------------------ 0
contig_4930:1.0-5516.0_1 TTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAA 120 contig_5026:14.0-10236.0_8_nucl TATTTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTG 120
gb|M35190.1|+|309-932|Campylobacter TTTGGCAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAA 120 gb|M35190.1|+|309-932|Campylobacter TATTTTGGCAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTG 117
contig_4930:1.0-5516.0_1NEW ------------------------------------------------------------ 0 contig_5026:14.0-10236.0_9_nucl ------------------------------------------------------------ 0
contig_4930:1.0-5516.0_1 AAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAATCGA 180 contig_5026:14.0-10236.0_8_nucl AAAAAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAAT 180
gb|M35190.1|+|309-932|Campylobacter AAGGATGGAAAAAAGTTATACCCAACTCTTTTATATGGAGTTACAACGATCATCAATCGA 180 gb|M35190.1|+|309-932|Campylobacter AAAAAGGATGGAAAAAAGTTATACCCAACTCTTTTATATGGAGTTACAACGATCATCAAT 177
contig_4930:1.0-5516.0_1NEW ------------------------------------------------------------ 0 contig_5026:14.0-10236.0_9_nucl ------------------------------------------------------------ 0
contig_4930:1.0-5516.0_1 CACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGAAAT 240 contig_5026:14.0-10236.0_8_nucl CGACACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGA 240
gb|M35190.1|+|309-932|Campylobacter CATGAAGAGTTCAGGACCGCATTAGATGAAAACGGACAGGTAGGCGTTTTT-TCAGAAAT 239 gb|M35190.1|+|309-932|Campylobacter CGACATGAAGAGTTCAGGACCGCATTAGATGAAAACGGACAGGTAGGCGTTTTT-TCAGA 236
contig_4930:1.0-5516.0_1NEW ----------------------------------------------------------at 2 contig_5026:14.0-10236.0_9_nucl ------------------------------------------------------------ 0
**
contig_4930:1.0-5516.0_1 GCTGCCTTGCTACACAATCTTTCATAA--------------------------------- 267 contig_5026:14.0-10236.0_8_nucl AATGCTGCCTTGCTACACAATCTTTCATAA------------------------------ 270
gb|M35190.1|+|309-932|Campylobacter GCTGCCTTGCTACACAGTTTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACTGA 299 gb|M35190.1|+|309-932|Campylobacter AATGCTGCCTTGCTACACAGTTTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGAC 296
contig_4930:1.0-5516.0_1NEW gctgccttgctacacaatctttcataaggaaactgaaaccttttcgagtatttggactga 62 contig_5026:14.0-10236.0_9_nucl -ATGCTGCCTTGCTACACAATCTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGAC 59
**************** * ******** ****************** * ********
contig_4930:1.0-5516.0_1 ------------------------------------------------------------ 267 contig_5026:14.0-10236.0_8_nucl ------------------------------------------------------------ 270
gb|M35190.1|+|309-932|Campylobacter GTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTTTGG 359 gb|M35190.1|+|309-932|Campylobacter TGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTT 356
contig_4930:1.0-5516.0_1NEW gtttacagcagactatactgagtttcttcagaactatcaaaaggatatagacgcttatgg 122 contig_5026:14.0-10236.0_9_nucl TGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTA 119
contig_4930:1.0-5516.0_1 ------------------------------------------------------------ 267 contig_5026:14.0-10236.0_8_nucl ------------------------------------------------------------ 270
gb|M35190.1|+|309-932|Campylobacter TGAACGAATGGGAATGTCCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCTAT 419 gb|M35190.1|+|309-932|Campylobacter TGGTGAACGAATGGGAATGTCCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTC 416
contig_4930:1.0-5516.0_1NEW tgaacgaaagggaatgttcgcaaagcctaatcctccggaaaacactttccctgtttctat 182 contig_5026:14.0-10236.0_9_nucl TGGTGAACGAAAGGGAATGTTCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTC 179
contig_4930:1.0-5516.0_1 ------------------------------------------------------------ 267 contig_5026:14.0-10236.0_8_nucl ------------------------------------------------------------ 270
gb|M35190.1|+|309-932|Campylobacter GATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTATCT 479 gb|M35190.1|+|309-932|Campylobacter TATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTA 476
contig_4930:1.0-5516.0_1NEW gataccgtggacaagctttgaaggctttaacttaaatctaaaaaagggatatgactatct 242 contig_5026:14.0-10236.0_9_nucl TATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTA 239
contig_4930:1.0-5516.0_1 ------------------------------------------------------------ 267 contig_5026:14.0-10236.0_8_nucl ------------------------------------------------------------ 270
gb|M35190.1|+|309-932|Campylobacter ACTGCCGATATTTACGTTTGGGAAGTATTATGAGGAGGGCGGAAAATACTATATTCCCTT 539 gb|M35190.1|+|309-932|Campylobacter TCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGAGGGCGGAAAATACTATATTCC 536
contig_4930:1.0-5516.0_1NEW actgccgatatttacgtttgggaagtattatgaggatggcggaaaatactatattccctt 302 contig_5026:14.0-10236.0_9_nucl TCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGATGGCGGAAAATACTATATTCC 299
contig_4930:1.0-5516.0_1 ------------------------------------------------------------ 267 contig_5026:14.0-10236.0_8_nucl ------------------------------------------------------------ 270
gb|M35190.1|+|309-932|Campylobacter ATCGATTCAAGTGCATCATGCCGTTTGTGACGGCTTTCATGTTTGCCGTTTTTTGGATGA 599 gb|M35190.1|+|309-932|Campylobacter CTTATCGATTCAAGTGCATCATGCCGTTTGTGACGGCTTTCATGTTTGCCGTTTTTTGGA 596
contig_4930:1.0-5516.0_1NEW atcgattcaagtgcatcatgccgtttgtgatggctttcatgtttgccgttttttggatga 362 contig_5026:14.0-10236.0_9_nucl CTTATCGATTCAAGTGCATCATGCCGTTTGTGATGGCTTTCATGTTTGCCGTTTTTTGGA 359
contig_4930:1.0-5516.0_1 ------------------------- 267 contig_5026:14.0-10236.0_8_nucl ---------------------------- 270
gb|M35190.1|+|309-932|Campylobacter ATTACAAGACTTGCTGAATAAATAA 624 gb|M35190.1|+|309-932|Campylobacter TGAATTACAAGACTTGCTGAATAAATAA 624
contig_4930:1.0-5516.0_1NEW attacaagacttgctgaataaataa 387 contig_5026:14.0-10236.0_9_nucl TGAATTACAAGACTTGCTGAATAAATAA 387
``` ```
Complete alignment output is in `lr_flye_contig_4930:1.0-5516.0_1-1000_translated_alignment.txt`. - also saved in `lr_flye__ARO3004454_contig_5026:14.0-10236.0__clustalo_prot.txt`
A plot for protein sequences created using `BioJS-MSA` was saved to `lr_flye_contig_4930:1.0-5516.0_1-1000_translated_alignment_biojs-msa.png`. - alignments: almost perfect, only few mismatches/gaps
- both CDS/protein sequences cover the complete ARO sequence
The alignments are almost perfect, with only a few mismatches, and both proteins cover the complete ARO sequence.
`contig_4930:1.0-5516.0_1` and `contig_4930:1.0-5516.0_1NEW` have an overlap of 29 bps: `atgctgccttgctacacaatctttcataa`: `contig_5026:14.0-10236.0_8` and `contig_5026:14.0-10236.0_9` have an overlap of 29 bps: `ATGCTGCCTTGCTACACAATCTTTCATAA`:
``` ```
atgctgccttgctacacaatctttcataa AATGCTGCCTTGCTACACAATCTTTCATAA.
.A..A..L..L..H..N..L..S..*. .N..A..A..L..L..H..N..L..S..*.
atgctgccttgctacacaatctttcataa .ATGCTGCCTTGCTACACAATCTTTCATAAG
.M..L..P..C..Y..T..I..F..H. ..M..L..P..C..Y..T..I..F..H..K.
```
## metaT coverage
Using file `mapping/metat/lr/flye/ASSEMBLY.POLISHED.sr.cov.perbase` created by `workflow`.
MetaT coverage of first 1000 bases of contig `contig_4930:1.0-5516.0` (i.e. including `contig_4930:1.0-5516.0_1` and `contig_4930:1.0-5516.0_1NEW`):
```bash
# grep for contig id and get first 1000 bases
grep -F "contig_4930:1.0-5516.0" mapping/metat/lr/flye/ASSEMBLY.POLISHED.sr.cov.perbase | head -n 1000 | less
``` ```
\ No newline at end of file
...@@ -187,10 +187,10 @@ rule fig_gdb_rgi_aro3004454_flye_metaT: ...@@ -187,10 +187,10 @@ rule fig_gdb_rgi_aro3004454_flye_metaT:
input: input:
os.path.join(config["samples"]["gdb"], "mapping/metat/lr/flye/ASSEMBLY.POLISHED.sr.cov.perbase") os.path.join(config["samples"]["gdb"], "mapping/metat/lr/flye/ASSEMBLY.POLISHED.sr.cov.perbase")
output: output:
temp("figures/fig_gdb_lr_flye_contig_4930:1.0-5516.0_1-1000_metatcov.tsv") temp("figures/fig_gdb_lr_flye_contig_5026:14.0-10236.0_metatcov.tsv")
shell: shell:
# metaT cov. of contig contig_4930:1.0-5516.0 for bases 1-1000 # metaT cov. of contig contig_4930:1.0-5516.0 for bases 1-1000
"grep -F \"contig_4930:1.0-5516.0\" {input} | awk -F'\\t' '$1 == \"contig_4930:1.0-5516.0\" && $2 >= 1 && $2 <= 1000 {{print $0}}' > {output}" "grep -F \"contig_5026:14.0-10236.0\" {input} | awk -F'\\t' '$1 == \"contig_5026:14.0-10236.0\" && $2 >= 1 && $2 <= 6200 {{print $0}}' > {output}"
rule fig_gdb_rgi_aro3004454_flye: rule fig_gdb_rgi_aro3004454_flye:
input: input:
......
...@@ -20,22 +20,23 @@ suppressMessages(library(viridis)) # color palettes ...@@ -20,22 +20,23 @@ suppressMessages(library(viridis)) # color palettes
############################## DATA ############################## DATA
# metaT coverage: bases 1-1000 of contig contig_4930:1.0 from Flye assembly from GDB sample # metaT coverage from base 1 to XXX of contig contig_5026:14.0-10236.0 from Flye assembly from GDB sample
metat_cov <- read.csv(snakemake@input$metat_cov, header=FALSE, sep="\t", col.names=c("contigID", "pos", "cov"), stringsAsFactors=FALSE) metat_cov <- read.csv(snakemake@input$metat_cov, header=FALSE, sep="\t", col.names=c("contigID", "pos", "cov"), stringsAsFactors=FALSE)
print(sprintf("metaT cov: from %d to %d", min(metat_cov$pos), max(metat_cov$pos)))
# gene/protein sequences to be plotted # gene/protein sequences to be plotted
seqs <- data.frame( seqs <- data.frame(
sid=c("contig_4930:1.0-5516.0_1", "ARO:30004454", "contig_4930:1.0-5516.0_1NEW"), sid=c("contig_5026:14.0-10236.0_8", "ARO:30004454", "contig_5026:14.0-10236.0_9"),
start=c(182,182,420), start=c(5547,5547,5788),
end=c(448,806,806), end=c(5816,6174,6174),
nucl=c( nucl=c(
"ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAATCGACACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGAAATGCTGCCTTGCTACACAATCTTTCATAA", "ATGATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAATCGACACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGAAATGCTGCCTTGCTACACAATCTTTCATAA",
"ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGGCAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAGTTATACCCAACTCTTTTATATGGAGTTACAACGATCATCAATCGACATGAAGAGTTCAGGACCGCATTAGATGAAAACGGACAGGTAGGCGTTTTT-TCAGAAATGCTGCCTTGCTACACAGTTTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACTGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTTTGGTGAACGAATGGGAATGTCCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCTATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGAGGGCGGAAAATACTATATTCCCTTATCGATTCAAGTGCATCATGCCGTTTGTGACGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAA", "---ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGGCAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAGTTATACCCAACTCTTTTATATGGAGTTACAACGATCATCAATCGACATGAAGAGTTCAGGACCGCATTAGATGAAAACGGACAGGTAGGCGTTTTT-TCAGAAATGCTGCCTTGCTACACAGTTTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACTGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTTTGGTGAACGAATGGGAATGTCCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCTATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGAGGGCGGAAAATACTATATTCCCTTATCGATTCAAGTGCATCATGCCGTTTGTGACGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAA",
"ATGCTGCCTTGCTACACAATCTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACTGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTATGGTGAACGAAAGGGAATGTTCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCTATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAGGGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGATGGCGGAAAATACTATATTCCCTTATCGATTCAAGTGCATCATGCCGTTTGTGATGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAA" "ATGCTGCCTTGCTACACAATCTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACTGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTATGGTGAACGAAAGGGAATGTTCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCTATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGATGGCGGAAAATACTATATTCCCTTATCGATTCAAGTGCATCATGCCGTTTGTGATGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAA"
), ),
prot=c( prot=c(
"MQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILNRHEEFRTALDKNGQVGVFFRNAALLHNLS*", "MMQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILNRHEEFRTALDKNGQVGVFFRNAALLHNLS*",
"MQFTKIDINNWTRKEYFDHYFGNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTIINRHEEFRTALDENGQVGVFSEMLPCYTVFHKETETFSSIWTEFTADYTEFLQNYQKDIDAFGERMGMSAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEEGGKYYIPLSIQVHHAVCDGFHVCRFLDELQDLLNK*", "-MQFTKIDINNWTRKEYFDHYFGNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTIINRHEEFRTALDENGQVGVFSEMLPCYTVFHKETETFSSIWTEFTADYTEFLQNYQKDIDAFGERMGMSAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEEGGKYYIPLSIQVHHAVCDGFHVCRFLDELQDLLNK*",
"MLPCYTIFHKETETFSSIWTEFTADYTEFLQNYQKDIDAYGERKGMFAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEDGGKYYIPLSIQVHHAVCDGFHVCRFLDELQDLLNK*" "MLPCYTIFHKETETFSSIWTEFTADYTEFLQNYQKDIDAYGERKGMFAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEDGGKYYIPLSIQVHHAVCDGFHVCRFLDELQDLLNK*"
), ),
stringsAsFactors=FALSE stringsAsFactors=FALSE
...@@ -49,11 +50,16 @@ for(sid in seqs$sid){ ...@@ -49,11 +50,16 @@ for(sid in seqs$sid){
sid_prot <- unlist(strsplit(seqs[sid,"prot"], split="")) sid_prot <- unlist(strsplit(seqs[sid,"prot"], split=""))
sid_start <- seqs[sid,"start"] sid_start <- seqs[sid,"start"]
sid_end <- seqs[sid,"end"] sid_end <- seqs[sid,"end"]
# print(sprintf("%s: %d, %d, %d", sid, sid_start, sid_end, length(sid_nucl))) print(sprintf("%s: %d, %d, %d", sid, sid_start, sid_end, length(sid_nucl)))
metat_cov[,sid] <- NA metat_cov[,sid] <- NA
# metat_cov[sid_start:sid_end,sid] <- sid_nucl # nucl # metat_cov[sid_start:sid_end,sid] <- sid_nucl # nucl
metat_cov[seq(sid_start+ifelse(sid=="contig_4930:1.0-5516.0_1NEW",0,1), sid_end, 3),sid] <- sid_prot # prot # AA seq.: start + 1 covering each 3rd base to align AA with the 2nd base of its codon
# because two assembly proteins have different frames, to align both prot. seq.s in the plot
# one of the seq.s needs to be "shifted" (i.e. start + 0 instead of start + 1)
# --> for the sake of readability only
metat_cov[seq(sid_start+ifelse(sid=="contig_5026:14.0-10236.0_9",0,1), sid_end, 3),sid] <- sid_prot # prot
} }
print(head(metat_cov[min(seqs$start):nrow(metat_cov),]))
# data to plot seq.s # data to plot seq.s
metat_cov_seqs <- reshape2::melt( metat_cov_seqs <- reshape2::melt(
...@@ -62,6 +68,7 @@ metat_cov_seqs <- reshape2::melt( ...@@ -62,6 +68,7 @@ metat_cov_seqs <- reshape2::melt(
na.rm=TRUE na.rm=TRUE
) )
metat_cov_seqs$y <- as.numeric(metat_cov_seqs$variable) * -1 metat_cov_seqs$y <- as.numeric(metat_cov_seqs$variable) * -1
print(head(metat_cov_seqs))
############################## PLOT ############################## PLOT
# custom theme # custom theme
...@@ -74,18 +81,19 @@ custom_theme <- ...@@ -74,18 +81,19 @@ custom_theme <-
) )
# AA colors # AA colors
aa_colors <- c(pal_d3(palette="category20")(20), "black") aa_colors <- c(pal_d3(palette="category20")(20), "black", "black")
names(aa_colors) <- c(setdiff(sort(unique(metat_cov_seqs$value)), "*"),"*") names(aa_colors) <- c(setdiff(sort(unique(metat_cov_seqs$value)), c("*", "-")),"*", "-")
# Nucl. colors # Nucl. colors
# nc_colors <- c(pal_d3(palette="category20")(4), "black") # nc_colors <- c(pal_d3(palette="category20")(4), "black")
# names(nc_colors) <- c("A", "T", "C", "G", "-") # names(nc_colors) <- c("A", "T", "C", "G", "-")
# Plot # Plot
xlims <- c(min(seqs$start)-10, max(seqs$end)+10)
p_cov <- p_cov <-
ggplot(data=metat_cov %>% filter(pos >= 160, pos <= 830), aes(x=pos, y=cov)) + ggplot(data=metat_cov %>% filter(pos >= min(xlims), pos <= max(xlims)), aes(x=pos, y=cov)) +
# overlap area # overlap area: start s1 + overlap start s1 (+ overlap length)
geom_polygon(data=data.frame(x=c(182+238+29,182+238,182+238,182+238+29), y=c(-6,-6,30,30)), aes(x=x, y=y), fill="gray", alpha=0.3) + geom_polygon(data=data.frame(x=c(5547+240+29, 5547+240 ,5547+240 ,5547+240+29), y=c(-6,-6,30,30)), aes(x=x, y=y), fill="gray", alpha=0.3) +
# cov. baseline # cov. baseline
geom_hline(yintercept=0, linetype="dotted") + geom_hline(yintercept=0, linetype="dotted") +
# cov. # cov.
...@@ -93,20 +101,20 @@ p_cov <- ...@@ -93,20 +101,20 @@ p_cov <-
# seq.s # seq.s
geom_label( geom_label(
data=metat_cov_seqs, data=metat_cov_seqs,
aes(y=y, label=value, fill=value), aes(x=pos, y=y, label=value, fill=value),
color="white", size=2, fontface="bold", color="white", size=2, fontface="bold",
label.padding=unit(0.1, "lines"), label.size=0, label.r=unit(0, "lines") label.padding=unit(0.05, "lines"), label.size=0, label.r=unit(0, "lines")
) + ) +
# seq. annotations # seq. annotations
annotate( annotate( # arrow
geom="segment", geom="segment",
x=c(182, 182, 404), y=c(1.5, -4.5, -4.5), x=c(5547, 5547+3, 5780), y=c(1.5, -4.5, -4.5),
xend=c(182, 182, 417), yend=c(-0.5, -2.5, -3), xend=c(5547, 5547+3, 5786), yend=c(-0.5, -2.5, -3),
arrow=arrow(length=unit(2, "mm")) arrow=arrow(length=unit(2, "mm"))
) + ) +
annotate( annotate( # text
geom="text", geom="text",
x=c(182, 182, 402), y=c(2, -5, -5), x=c(5547, 5547, 5780), y=c(2, -5, -5),
label=levels(metat_cov_seqs$variable), label=levels(metat_cov_seqs$variable),
hjust=c("left", "left", "right"), hjust=c("left", "left", "right"),
size=4 size=4
...@@ -114,11 +122,11 @@ p_cov <- ...@@ -114,11 +122,11 @@ p_cov <-
# colors # colors
scale_fill_manual(values=aa_colors) + scale_fill_manual(values=aa_colors) +
# axis range # axis range
scale_x_continuous(limits=c(160, 830), expand=c(0, 0)) + scale_x_continuous(limits=xlims, expand=c(0, 0)) +
scale_y_continuous(limits=c(-6,30), expand=c(0, 0)) + scale_y_continuous(limits=c(-6,30), expand=c(0, 0)) +
# titles # titles
labs( labs(
x="Contig position (LR, Flye, contig_4930:1.0-5516.0)", x="Contig position (LR, Flye)",
y="metaT coverage" y="metaT coverage"
) + ) +
# theme # theme
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment