Commit f0ae27df authored by Valentina Galata's avatar Valentina Galata
Browse files

notes/figures: updated rgi/card example for gdb/flye (issue #124)

parent 807358cf
# About
Notes for sample `GDB` and its RGI (CARD) results.
Based on file `report/amr.tsv` created by `workflow_report`.
Based on results produced by `workflow` and the file `report/amr.tsv` created by `workflow_report`.
# Hit contig_4930:1.0-5516.0_1 (LR, Flye), ARO:3004454
# Hits in contig_5026:14.0-10236.0 (LR, Flye) to ARO:3004454
Hit of protein `contig_4930:1.0-5516.0_1` from `GDB`'s LR-assembly constructed with `Flye` to `ARO:3004454`.
Two hits in contig `contig_5026:14.0-10236.0` from `GDB`'s LR-assembly constructed with `Flye` to `ARO:3004454`:
- `contig_5026:14.0-10236.0_8`
- `contig_5026:14.0-10236.0_9`
## ARO:3004454
......@@ -27,163 +29,167 @@ TTAAATCTAAAAAAAGGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGAGGGCGGAAAATACTA
TCGATTCAAGTGCATCATGCCGTTTGTGACGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAA
```
## Matched protein
## Matched proteins
Protein `contig_4930:1.0-5516.0_1` the first protein-coding gene on contig `contig_4930:1.0-5516.0` (LR, `Flye`).
The first two genes from the same contig:
```
>contig_4930:1.0-5516.0_1 # 182 # 448 # 1 # ID=4060_1;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.356
>contig_4930:1.0-5516.0_2 # 1073 # 2317 # 1 # ID=4060_2;partial=00;start_type=ATG;rbs_motif=AGGAGG;rbs_spacer=5-10bp;gc_cont=0.553
Protein information from `Prodigal`'s FAA file:
```bash
grep "^>contig_5026:14.0-10236.0_" /scratch/users/vgalata/gdb/results/annotation/prodigal/lr/flye/proteins.faa
# >contig_5026:14.0-10236.0_8 # 5547 # 5816 # 1 # ID=4198_8;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=3-4bp;gc_cont=0.356
# >contig_5026:14.0-10236.0_9 # 5788 # 6174 # 1 # ID=4198_9;partial=00;start_type=ATG;rbs_motif=3Base/5BMM;rbs_spacer=13-15bp;gc_cont=0.377
```
For `contig_4930:1.0-5516.0_1`, protein (from `Prodigal`) and gene (extracted from contig) sequences:
```
>contig_4930:1.0-5516.0_1 # 182 # 448 # 1 # ID=4060_1;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.356
MQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILNRHEEFRTALDKNGQVGVFFRNAALLHNLS*
- consecutive CDSs
- both CDS overlap
>contig_4930:1.0-5516.0_1 # 182 # 448
ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAATCGACACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGAAATGCTGCCTTGCTACACAATCTTTCATAA
Nucleotide and protein sequences of both CDS/proteins:
```
>contig_5026:14.0-10236.0_8_nucl
ATGATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCAC
TATTTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTG
AAAAAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAAT
CGACACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGA
AATGCTGCCTTGCTACACAATCTTTCATAA
Protein `contig_4930:1.0-5516.0_1` clusters only with itself (based on `mmseqs2` output, created by `Workflow`):
```bash
grep "contig_4930:1\.0-5516\.0_1" clusters.tsv
# flye__contig_4930:1.0-5516.0_1 flye__contig_4930:1.0-5516.0_1
```
>contig_5026:14.0-10236.0_9_nucl
ATGCTGCCTTGCTACACAATCTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACT
GAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTAT
GGTGAACGAAAGGGAATGTTCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCT
ATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTAT
CTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGATGGCGGAAAATACTATATTCCC
TTATCGATTCAAGTGCATCATGCCGTTTGTGATGGCTTTCATGTTTGCCGTTTTTTGGAT
GAATTACAAGACTTGCTGAATAAATAA
## Looking for alternative protein coding sequences
>contig_5026:14.0-10236.0_8_prot
MMQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILN
RHEEFRTALDKNGQVGVFFRNAALLHNLS*
First 1000 bases of the contig `contig_4930:1.0-5516.0`:
```
>contig_4930:1.0-5516.0 1-1000
CATTGGTTGCCGCCCAGCAGACGGCACAGGCTGCCCTAGGCACTTCACGCGGGAACAATACAGCAGCAGCTTACCTCCGGTACTGATCAGAGCGAGGGAAGTGGAGACGAGCCGAAAAAACAGGGATGGTTTAGTCGGTTGTTTCGTGGTTAGATAAATCAGGATTTGCGGAGGATAAATGATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAATCGACACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGAAATGCTGCCTTGCTACACAATCTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACTGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTATGGTGAACGAAAGGGAATGTTCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCTATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAGGGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGATGGCGGAAAATACTATATTCCCTTATCGATTCAAGTGCATCATGCCGTTTGTGATGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAAAATCCAAGTTTGTCGTGGAAACGGGGAGCCGCCCGCAGGCGGCAGGGGCAGCGCCCCTCCGGAGCAGTCCGGCTAATGGCGCATAAGCGTCGAGGCAGGCTGCTCCGGAAACCCTCCCCTTTAGGGGCGCAAAGCACTTTCCTACTGCTGCACCAACGGTGCTGCGGTGGGAAAGTGCTTTTGCGTTACTTTCT
>contig_5026:14.0-10236.0_9_prot
MLPCYTIFHKETETFSSIWTEFTADYTEFLQNYQKDIDAYGERKGMFAKPNPPENTFPVS
MIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEDGGKYYIPLSIQVHHAVCDGFHVCRFLD
ELQDLLNK*
```
extracted using
Nucleotide sequences were extracted from the relevant part of the contig:
```bash
grep -A1 ">contig_4930:1.0-5516.0" assembly/lr/flye/ASSEMBLY.POLISHED.fasta | sed -r 's/(.{100})/\1\n/g'
# grep for the contig ID + next line (seq) | get seq |
# cut into 100bp chunks | extract relevant rows
grep -A 1 "contig_5026:14.0-10236.0$" /scratch/users/vgalata/gdb/results/assembly/lr/flye/ASSEMBLY.POLISHED.fasta | tail -n 1 | sed -r 's/(.{100})/\1\n/g' | awk 'NR > 55 && NR < 63 {print $0}'
```
and checked by translating them using the [ExPASy tool](https://web.expasy.org/translate/).
Using a translation tool [ExPASy](https://web.expasy.org/translate/), two longest translations with a start and stop codon:
These proteins clustered only with one other protein, also from `Flye` assembly:
```bash
grep -A 1 "contig_5026:14.0-10236.0_8" /scratch/users/vgalata/gdb/results/analysis/mmseqs2/clusters.tsv
# flye__contig_5026:14.0-10236.0_8 flye__contig_5026:14.0-10236.0_8
# flye__contig_5027:1.0-43229.0_23 flye__contig_5027:1.0-43229.0_23
grep -A 1 "contig_5026:14.0-10236.0_9" /scratch/users/vgalata/gdb/results/analysis/mmseqs2/clusters.tsv
# flye__contig_5026:14.0-10236.0_9 flye__contig_5026:14.0-10236.0_9
# flye__contig_5027:1.0-43229.0_24 flye__contig_5027:1.0-43229.0_24
```
> contig_4930:1.0-5516.0_1 (VIRT-56253:5'3' Frame 2, start_pos=60)
MQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILNRHEEFRTALDKNGQVGVFFRNAALLHNLS*
- both clustered proteins are neighbors from the same contig
- both clustered proteins have no `RGI` hit
> contig_4930:1.0-5516.0_1NEW (VIRT-7066:5'3' Frame 3, start_pos=139)
MLPCYTIFHKETETFSSIWTEFTADYTEFLQNYQKDIDAYGERKGMFAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEDGGKYYIPLSIQVHHAVCDGFHVCRFLDELQDLLNK*
MetaT coverage for the contig region containing both genes:
```bash
grep -P "^contig_5026:14\.0-10236\.0\t" /scratch/users/vgalata/gdb/results/mapping/metat/lr/flye/ASSEMBLY.POLISHED.sr.cov.perbase | awk -F'\t' '$2 >= 5547 && $2 <= 6174 {print $0}' | less
```
The first protein is the same as the protein predicted by `Prodigal`, i.e. it is `contig_4930:1.0-5516.0_1`.
The second protein was **not** predicted by `Prodigal` - we will label it as `contig_4930:1.0-5516.0_1NEW`.
Complete output of `ExPASy` for these two translations is in `lr_flye_contig_4930:1.0-5516.0_1-1000_translated.txt`.
Nucleotide (gene) sequence of the "new" protein `contig_4930:1.0-5516.0_1NEW`:
```
>contig_4930:1.0-5516.0_1NEW # 420 # 806
atgctgccttgctacacaatctttcataaggaaactgaaaccttttcgagtatttggactgagtttacagcagactatactgagtttcttcagaactatcaaaaggatatagacgcttatggtgaacgaaagggaatgttcgcaaagcctaatcctccggaaaacactttccctgtttctatgataccgtggacaagctttgaaggctttaacttaaatctaaaaaagggatatgactatctactgccgatatttacgtttgggaagtattatgaggatggcggaaaatactatattcccttatcgattcaagtgcatcatgccgtttgtgatggctttcatgtttgccgttttttggatgaattacaagacttgctgaataaataa
```
- both genes are covered by metaT
## MSA w/ ARO:3004454
Multiple-sequence alignment using [EMBOSS Clustal Omega](https://www.ebi.ac.uk/Tools/msa/clustalo/).
Multiple-sequence alignment using [EMBOSS Clustal Omega](https://www.ebi.ac.uk/Tools/msa/clustalo/) (`CLUSTAL O(1.2.4)`).
```
VIRT-56253:5'3' MQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILNR 60
gb|AAA23018.1|+|Campylobacter MQFTKIDINNWTRKEYFDHYFGNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTIINR 60
VIRT-7066:5'3' ------------------------------------------------------------ 0
Protein alignment:
contig_5026:14.0-10236.0_8_prot MMQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILN 60
gb|AAA23018.1|+|Campylobacter -MQFTKIDINNWTRKEYFDHYFGNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTIIN 59
contig_5026:14.0-10236.0_9_prot ------------------------------------------------------------ 0
VIRT-56253:5'3' HEEFRTALDKNGQVGVFFRNAALLHNLS*------------------------------- 88
gb|AAA23018.1|+|Campylobacter HEEFRTALDENGQVGVFSEMLPCYTVFHKETETFSSIWTEFTADYTEFLQNYQKDIDAFG 120
VIRT-7066:5'3' -------------------MLPCYTIFHKETETFSSIWTEFTADYTEFLQNYQKDIDAYG 41
contig_5026:14.0-10236.0_8_prot RHEEFRTALDKNGQVGVFFRNAALLHNLS*------------------------------ 89
gb|AAA23018.1|+|Campylobacter RHEEFRTALDENGQVGVFSEMLPCYTVFHKETETFSSIWTEFTADYTEFLQNYQKDIDAF 119
contig_5026:14.0-10236.0_9_prot --------------------MLPCYTIFHKETETFSSIWTEFTADYTEFLQNYQKDIDAY 40
:
contig_5026:14.0-10236.0_8_prot ------------------------------------------------------------ 89
gb|AAA23018.1|+|Campylobacter GERMGMSAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEEGGKYYIP 179
contig_5026:14.0-10236.0_9_prot GERKGMFAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEDGGKYYIP 100
VIRT-56253:5'3' ------------------------------------------------------------ 88
gb|AAA23018.1|+|Campylobacter ERMGMSAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEEGGKYYIPL 180
VIRT-7066:5'3' ERKGMFAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEDGGKYYIPL 101
contig_5026:14.0-10236.0_8_prot ----------------------------- 89
gb|AAA23018.1|+|Campylobacter LSIQVHHAVCDGFHVCRFLDELQDLLNK- 207
contig_5026:14.0-10236.0_9_prot LSIQVHHAVCDGFHVCRFLDELQDLLNK* 128
VIRT-56253:5'3' ---------------------------- 88
gb|AAA23018.1|+|Campylobacter SIQVHHAVCDGFHVCRFLDELQDLLNK- 207
VIRT-7066:5'3' SIQVHHAVCDGFHVCRFLDELQDLLNK* 128
```
DNA alignment:
```
contig_4930:1.0-5516.0_1 ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTAT 60
gb|M35190.1|+|309-932|Campylobacter ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTAT 60
contig_4930:1.0-5516.0_1NEW ------------------------------------------------------------ 0
contig_5026:14.0-10236.0_8_nucl ATGATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCAC 60
gb|M35190.1|+|309-932|Campylobacter ---ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCAC 57
contig_5026:14.0-10236.0_9_nucl ------------------------------------------------------------ 0
contig_4930:1.0-5516.0_1 TTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAA 120
gb|M35190.1|+|309-932|Campylobacter TTTGGCAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAA 120
contig_4930:1.0-5516.0_1NEW ------------------------------------------------------------ 0
contig_5026:14.0-10236.0_8_nucl TATTTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTG 120
gb|M35190.1|+|309-932|Campylobacter TATTTTGGCAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTG 117
contig_5026:14.0-10236.0_9_nucl ------------------------------------------------------------ 0
contig_4930:1.0-5516.0_1 AAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAATCGA 180
gb|M35190.1|+|309-932|Campylobacter AAGGATGGAAAAAAGTTATACCCAACTCTTTTATATGGAGTTACAACGATCATCAATCGA 180
contig_4930:1.0-5516.0_1NEW ------------------------------------------------------------ 0
contig_5026:14.0-10236.0_8_nucl AAAAAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAAT 180
gb|M35190.1|+|309-932|Campylobacter AAAAAGGATGGAAAAAAGTTATACCCAACTCTTTTATATGGAGTTACAACGATCATCAAT 177
contig_5026:14.0-10236.0_9_nucl ------------------------------------------------------------ 0
contig_4930:1.0-5516.0_1 CACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGAAAT 240
gb|M35190.1|+|309-932|Campylobacter CATGAAGAGTTCAGGACCGCATTAGATGAAAACGGACAGGTAGGCGTTTTT-TCAGAAAT 239
contig_4930:1.0-5516.0_1NEW ----------------------------------------------------------at 2
**
contig_5026:14.0-10236.0_8_nucl CGACACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGA 240
gb|M35190.1|+|309-932|Campylobacter CGACATGAAGAGTTCAGGACCGCATTAGATGAAAACGGACAGGTAGGCGTTTTT-TCAGA 236
contig_5026:14.0-10236.0_9_nucl ------------------------------------------------------------ 0
contig_4930:1.0-5516.0_1 GCTGCCTTGCTACACAATCTTTCATAA--------------------------------- 267
gb|M35190.1|+|309-932|Campylobacter GCTGCCTTGCTACACAGTTTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACTGA 299
contig_4930:1.0-5516.0_1NEW gctgccttgctacacaatctttcataaggaaactgaaaccttttcgagtatttggactga 62
**************** * ********
contig_5026:14.0-10236.0_8_nucl AATGCTGCCTTGCTACACAATCTTTCATAA------------------------------ 270
gb|M35190.1|+|309-932|Campylobacter AATGCTGCCTTGCTACACAGTTTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGAC 296
contig_5026:14.0-10236.0_9_nucl -ATGCTGCCTTGCTACACAATCTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGAC 59
****************** * ********
contig_4930:1.0-5516.0_1 ------------------------------------------------------------ 267
gb|M35190.1|+|309-932|Campylobacter GTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTTTGG 359
contig_4930:1.0-5516.0_1NEW gtttacagcagactatactgagtttcttcagaactatcaaaaggatatagacgcttatgg 122
contig_5026:14.0-10236.0_8_nucl ------------------------------------------------------------ 270
gb|M35190.1|+|309-932|Campylobacter TGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTT 356
contig_5026:14.0-10236.0_9_nucl TGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTA 119
contig_4930:1.0-5516.0_1 ------------------------------------------------------------ 267
gb|M35190.1|+|309-932|Campylobacter TGAACGAATGGGAATGTCCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCTAT 419
contig_4930:1.0-5516.0_1NEW tgaacgaaagggaatgttcgcaaagcctaatcctccggaaaacactttccctgtttctat 182
contig_5026:14.0-10236.0_8_nucl ------------------------------------------------------------ 270
gb|M35190.1|+|309-932|Campylobacter TGGTGAACGAATGGGAATGTCCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTC 416
contig_5026:14.0-10236.0_9_nucl TGGTGAACGAAAGGGAATGTTCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTC 179
contig_4930:1.0-5516.0_1 ------------------------------------------------------------ 267
gb|M35190.1|+|309-932|Campylobacter GATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTATCT 479
contig_4930:1.0-5516.0_1NEW gataccgtggacaagctttgaaggctttaacttaaatctaaaaaagggatatgactatct 242
contig_5026:14.0-10236.0_8_nucl ------------------------------------------------------------ 270
gb|M35190.1|+|309-932|Campylobacter TATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTA 476
contig_5026:14.0-10236.0_9_nucl TATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTA 239
contig_4930:1.0-5516.0_1 ------------------------------------------------------------ 267
gb|M35190.1|+|309-932|Campylobacter ACTGCCGATATTTACGTTTGGGAAGTATTATGAGGAGGGCGGAAAATACTATATTCCCTT 539
contig_4930:1.0-5516.0_1NEW actgccgatatttacgtttgggaagtattatgaggatggcggaaaatactatattccctt 302
contig_5026:14.0-10236.0_8_nucl ------------------------------------------------------------ 270
gb|M35190.1|+|309-932|Campylobacter TCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGAGGGCGGAAAATACTATATTCC 536
contig_5026:14.0-10236.0_9_nucl TCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGATGGCGGAAAATACTATATTCC 299
contig_4930:1.0-5516.0_1 ------------------------------------------------------------ 267
gb|M35190.1|+|309-932|Campylobacter ATCGATTCAAGTGCATCATGCCGTTTGTGACGGCTTTCATGTTTGCCGTTTTTTGGATGA 599
contig_4930:1.0-5516.0_1NEW atcgattcaagtgcatcatgccgtttgtgatggctttcatgtttgccgttttttggatga 362
contig_5026:14.0-10236.0_8_nucl ------------------------------------------------------------ 270
gb|M35190.1|+|309-932|Campylobacter CTTATCGATTCAAGTGCATCATGCCGTTTGTGACGGCTTTCATGTTTGCCGTTTTTTGGA 596
contig_5026:14.0-10236.0_9_nucl CTTATCGATTCAAGTGCATCATGCCGTTTGTGATGGCTTTCATGTTTGCCGTTTTTTGGA 359
contig_4930:1.0-5516.0_1 ------------------------- 267
gb|M35190.1|+|309-932|Campylobacter ATTACAAGACTTGCTGAATAAATAA 624
contig_4930:1.0-5516.0_1NEW attacaagacttgctgaataaataa 387
contig_5026:14.0-10236.0_8_nucl ---------------------------- 270
gb|M35190.1|+|309-932|Campylobacter TGAATTACAAGACTTGCTGAATAAATAA 624
contig_5026:14.0-10236.0_9_nucl TGAATTACAAGACTTGCTGAATAAATAA 387
```
Complete alignment output is in `lr_flye_contig_4930:1.0-5516.0_1-1000_translated_alignment.txt`.
A plot for protein sequences created using `BioJS-MSA` was saved to `lr_flye_contig_4930:1.0-5516.0_1-1000_translated_alignment_biojs-msa.png`.
The alignments are almost perfect, with only a few mismatches, and both proteins cover the complete ARO sequence.
- also saved in `lr_flye__ARO3004454_contig_5026:14.0-10236.0__clustalo_prot.txt`
- alignments: almost perfect, only few mismatches/gaps
- both CDS/protein sequences cover the complete ARO sequence
`contig_4930:1.0-5516.0_1` and `contig_4930:1.0-5516.0_1NEW` have an overlap of 29 bps: `atgctgccttgctacacaatctttcataa`:
`contig_5026:14.0-10236.0_8` and `contig_5026:14.0-10236.0_9` have an overlap of 29 bps: `ATGCTGCCTTGCTACACAATCTTTCATAA`:
```
atgctgccttgctacacaatctttcataa
.A..A..L..L..H..N..L..S..*.
atgctgccttgctacacaatctttcataa
.M..L..P..C..Y..T..I..F..H.
```
## metaT coverage
Using file `mapping/metat/lr/flye/ASSEMBLY.POLISHED.sr.cov.perbase` created by `workflow`.
MetaT coverage of first 1000 bases of contig `contig_4930:1.0-5516.0` (i.e. including `contig_4930:1.0-5516.0_1` and `contig_4930:1.0-5516.0_1NEW`):
```bash
# grep for contig id and get first 1000 bases
grep -F "contig_4930:1.0-5516.0" mapping/metat/lr/flye/ASSEMBLY.POLISHED.sr.cov.perbase | head -n 1000 | less
AATGCTGCCTTGCTACACAATCTTTCATAA.
.N..A..A..L..L..H..N..L..S..*.
.ATGCTGCCTTGCTACACAATCTTTCATAAG
..M..L..P..C..Y..T..I..F..H..K.
```
\ No newline at end of file
......@@ -187,10 +187,10 @@ rule fig_gdb_rgi_aro3004454_flye_metaT:
input:
os.path.join(config["samples"]["gdb"], "mapping/metat/lr/flye/ASSEMBLY.POLISHED.sr.cov.perbase")
output:
temp("figures/fig_gdb_lr_flye_contig_4930:1.0-5516.0_1-1000_metatcov.tsv")
temp("figures/fig_gdb_lr_flye_contig_5026:14.0-10236.0_metatcov.tsv")
shell:
# metaT cov. of contig contig_4930:1.0-5516.0 for bases 1-1000
"grep -F \"contig_4930:1.0-5516.0\" {input} | awk -F'\\t' '$1 == \"contig_4930:1.0-5516.0\" && $2 >= 1 && $2 <= 1000 {{print $0}}' > {output}"
"grep -F \"contig_5026:14.0-10236.0\" {input} | awk -F'\\t' '$1 == \"contig_5026:14.0-10236.0\" && $2 >= 1 && $2 <= 6200 {{print $0}}' > {output}"
rule fig_gdb_rgi_aro3004454_flye:
input:
......
......@@ -20,22 +20,23 @@ suppressMessages(library(viridis)) # color palettes
############################## DATA
# metaT coverage: bases 1-1000 of contig contig_4930:1.0 from Flye assembly from GDB sample
# metaT coverage from base 1 to XXX of contig contig_5026:14.0-10236.0 from Flye assembly from GDB sample
metat_cov <- read.csv(snakemake@input$metat_cov, header=FALSE, sep="\t", col.names=c("contigID", "pos", "cov"), stringsAsFactors=FALSE)
print(sprintf("metaT cov: from %d to %d", min(metat_cov$pos), max(metat_cov$pos)))
# gene/protein sequences to be plotted
seqs <- data.frame(
sid=c("contig_4930:1.0-5516.0_1", "ARO:30004454", "contig_4930:1.0-5516.0_1NEW"),
start=c(182,182,420),
end=c(448,806,806),
sid=c("contig_5026:14.0-10236.0_8", "ARO:30004454", "contig_5026:14.0-10236.0_9"),
start=c(5547,5547,5788),
end=c(5816,6174,6174),
nucl=c(
"ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAATCGACACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGAAATGCTGCCTTGCTACACAATCTTTCATAA",
"ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGGCAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAGTTATACCCAACTCTTTTATATGGAGTTACAACGATCATCAATCGACATGAAGAGTTCAGGACCGCATTAGATGAAAACGGACAGGTAGGCGTTTTT-TCAGAAATGCTGCCTTGCTACACAGTTTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACTGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTTTGGTGAACGAATGGGAATGTCCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCTATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGAGGGCGGAAAATACTATATTCCCTTATCGATTCAAGTGCATCATGCCGTTTGTGACGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAA",
"ATGCTGCCTTGCTACACAATCTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACTGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTATGGTGAACGAAAGGGAATGTTCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCTATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAGGGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGATGGCGGAAAATACTATATTCCCTTATCGATTCAAGTGCATCATGCCGTTTGTGATGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAA"
"ATGATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAATCGACACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGAAATGCTGCCTTGCTACACAATCTTTCATAA",
"---ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGGCAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAGTTATACCCAACTCTTTTATATGGAGTTACAACGATCATCAATCGACATGAAGAGTTCAGGACCGCATTAGATGAAAACGGACAGGTAGGCGTTTTT-TCAGAAATGCTGCCTTGCTACACAGTTTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACTGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTTTGGTGAACGAATGGGAATGTCCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCTATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGAGGGCGGAAAATACTATATTCCCTTATCGATTCAAGTGCATCATGCCGTTTGTGACGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAA",
"ATGCTGCCTTGCTACACAATCTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACTGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTATGGTGAACGAAAGGGAATGTTCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCTATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGATGGCGGAAAATACTATATTCCCTTATCGATTCAAGTGCATCATGCCGTTTGTGATGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAA"
),
prot=c(
"MQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILNRHEEFRTALDKNGQVGVFFRNAALLHNLS*",
"MQFTKIDINNWTRKEYFDHYFGNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTIINRHEEFRTALDENGQVGVFSEMLPCYTVFHKETETFSSIWTEFTADYTEFLQNYQKDIDAFGERMGMSAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEEGGKYYIPLSIQVHHAVCDGFHVCRFLDELQDLLNK*",
"MMQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILNRHEEFRTALDKNGQVGVFFRNAALLHNLS*",
"-MQFTKIDINNWTRKEYFDHYFGNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTIINRHEEFRTALDENGQVGVFSEMLPCYTVFHKETETFSSIWTEFTADYTEFLQNYQKDIDAFGERMGMSAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEEGGKYYIPLSIQVHHAVCDGFHVCRFLDELQDLLNK*",
"MLPCYTIFHKETETFSSIWTEFTADYTEFLQNYQKDIDAYGERKGMFAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEDGGKYYIPLSIQVHHAVCDGFHVCRFLDELQDLLNK*"
),
stringsAsFactors=FALSE
......@@ -49,11 +50,16 @@ for(sid in seqs$sid){
sid_prot <- unlist(strsplit(seqs[sid,"prot"], split=""))
sid_start <- seqs[sid,"start"]
sid_end <- seqs[sid,"end"]
# print(sprintf("%s: %d, %d, %d", sid, sid_start, sid_end, length(sid_nucl)))
print(sprintf("%s: %d, %d, %d", sid, sid_start, sid_end, length(sid_nucl)))
metat_cov[,sid] <- NA
# metat_cov[sid_start:sid_end,sid] <- sid_nucl # nucl
metat_cov[seq(sid_start+ifelse(sid=="contig_4930:1.0-5516.0_1NEW",0,1), sid_end, 3),sid] <- sid_prot # prot
# AA seq.: start + 1 covering each 3rd base to align AA with the 2nd base of its codon
# because two assembly proteins have different frames, to align both prot. seq.s in the plot
# one of the seq.s needs to be "shifted" (i.e. start + 0 instead of start + 1)
# --> for the sake of readability only
metat_cov[seq(sid_start+ifelse(sid=="contig_5026:14.0-10236.0_9",0,1), sid_end, 3),sid] <- sid_prot # prot
}
print(head(metat_cov[min(seqs$start):nrow(metat_cov),]))
# data to plot seq.s
metat_cov_seqs <- reshape2::melt(
......@@ -62,6 +68,7 @@ metat_cov_seqs <- reshape2::melt(
na.rm=TRUE
)
metat_cov_seqs$y <- as.numeric(metat_cov_seqs$variable) * -1
print(head(metat_cov_seqs))
############################## PLOT
# custom theme
......@@ -74,18 +81,19 @@ custom_theme <-
)
# AA colors
aa_colors <- c(pal_d3(palette="category20")(20), "black")
names(aa_colors) <- c(setdiff(sort(unique(metat_cov_seqs$value)), "*"),"*")
aa_colors <- c(pal_d3(palette="category20")(20), "black", "black")
names(aa_colors) <- c(setdiff(sort(unique(metat_cov_seqs$value)), c("*", "-")),"*", "-")
# Nucl. colors
# nc_colors <- c(pal_d3(palette="category20")(4), "black")
# names(nc_colors) <- c("A", "T", "C", "G", "-")
# Plot
xlims <- c(min(seqs$start)-10, max(seqs$end)+10)
p_cov <-
ggplot(data=metat_cov %>% filter(pos >= 160, pos <= 830), aes(x=pos, y=cov)) +
# overlap area
geom_polygon(data=data.frame(x=c(182+238+29,182+238,182+238,182+238+29), y=c(-6,-6,30,30)), aes(x=x, y=y), fill="gray", alpha=0.3) +
ggplot(data=metat_cov %>% filter(pos >= min(xlims), pos <= max(xlims)), aes(x=pos, y=cov)) +
# overlap area: start s1 + overlap start s1 (+ overlap length)
geom_polygon(data=data.frame(x=c(5547+240+29, 5547+240 ,5547+240 ,5547+240+29), y=c(-6,-6,30,30)), aes(x=x, y=y), fill="gray", alpha=0.3) +
# cov. baseline
geom_hline(yintercept=0, linetype="dotted") +
# cov.
......@@ -93,20 +101,20 @@ p_cov <-
# seq.s
geom_label(
data=metat_cov_seqs,
aes(y=y, label=value, fill=value),
aes(x=pos, y=y, label=value, fill=value),
color="white", size=2, fontface="bold",
label.padding=unit(0.1, "lines"), label.size=0, label.r=unit(0, "lines")
label.padding=unit(0.05, "lines"), label.size=0, label.r=unit(0, "lines")
) +
# seq. annotations
annotate(
annotate( # arrow
geom="segment",
x=c(182, 182, 404), y=c(1.5, -4.5, -4.5),
xend=c(182, 182, 417), yend=c(-0.5, -2.5, -3),
x=c(5547, 5547+3, 5780), y=c(1.5, -4.5, -4.5),
xend=c(5547, 5547+3, 5786), yend=c(-0.5, -2.5, -3),
arrow=arrow(length=unit(2, "mm"))
) +
annotate(
annotate( # text
geom="text",
x=c(182, 182, 402), y=c(2, -5, -5),
x=c(5547, 5547, 5780), y=c(2, -5, -5),
label=levels(metat_cov_seqs$variable),
hjust=c("left", "left", "right"),
size=4
......@@ -114,11 +122,11 @@ p_cov <-
# colors
scale_fill_manual(values=aa_colors) +
# axis range
scale_x_continuous(limits=c(160, 830), expand=c(0, 0)) +
scale_x_continuous(limits=xlims, expand=c(0, 0)) +
scale_y_continuous(limits=c(-6,30), expand=c(0, 0)) +
# titles
labs(
x="Contig position (LR, Flye, contig_4930:1.0-5516.0)",
x="Contig position (LR, Flye)",
y="metaT coverage"
) +
# theme
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment