Skip to content
GitLab
Projects
Groups
Snippets
/
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Sign in
Toggle navigation
Menu
Open sidebar
ESB
ONT_pilot_gitlab
Commits
f0ae27df
Commit
f0ae27df
authored
Mar 10, 2021
by
Valentina Galata
Browse files
notes/figures: updated rgi/card example for gdb/flye (issue
#124
)
parent
807358cf
Changes
3
Hide whitespace changes
Inline
Side-by-side
notes/gdb_rgi_aro3004454_flye.md
View file @
f0ae27df
# About
Notes for sample
`GDB`
and its RGI (CARD) results.
Based on file
`report/amr.tsv`
created by
`workflow_report`
.
Based on
results produced by
`workflow`
and the
file
`report/amr.tsv`
created by
`workflow_report`
.
# Hit contig_
4930
:1.0-
551
6.0
_1
(LR, Flye)
,
ARO:3004454
# Hit
s in
contig_
5026
:1
4
.0-
1023
6.0 (LR, Flye)
to
ARO:3004454
Hit of protein
`contig_4930:1.0-5516.0_1`
from
`GDB`
's LR-assembly constructed with
`Flye`
to
`ARO:3004454`
.
Two hits in contig
`contig_5026:14.0-10236.0`
from
`GDB`
's LR-assembly constructed with
`Flye`
to
`ARO:3004454`
:
-
`contig_5026:14.0-10236.0_8`
-
`contig_5026:14.0-10236.0_9`
## ARO:3004454
...
...
@@ -27,163 +29,167 @@ TTAAATCTAAAAAAAGGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGAGGGCGGAAAATACTA
TCGATTCAAGTGCATCATGCCGTTTGTGACGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAA
```
## Matched protein
## Matched protein
s
Protein
`contig_4930:1.0-5516.0_1`
the first protein-coding gene on contig
`contig_4930:1.0-5516.0`
(LR,
`Flye`
).
The first two genes from the same contig:
```
>contig_4930:1.0-5516.0_1 # 182 # 448 # 1 # ID=4060_1;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.356
>contig_4930:1.0-5516.0_2 # 1073 # 2317 # 1 # ID=4060_2;partial=00;start_type=ATG;rbs_motif=AGGAGG;rbs_spacer=5-10bp;gc_cont=0.553
Protein information from
`Prodigal`
's FAA file:
```
bash
grep
"^>contig_5026:14.0-10236.0_"
/scratch/users/vgalata/gdb/results/annotation/prodigal/lr/flye/proteins.faa
# >contig_5026:14.0-10236.0_8 # 5547 # 5816 # 1 # ID=4198_8;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=3-4bp;gc_cont=0.356
# >contig_5026:14.0-10236.0_9 # 5788 # 6174 # 1 # ID=4198_9;partial=00;start_type=ATG;rbs_motif=3Base/5BMM;rbs_spacer=13-15bp;gc_cont=0.377
```
For
`contig_4930:1.0-5516.0_1`
, protein (from
`Prodigal`
) and gene (extracted from contig) sequences:
```
>contig_4930:1.0-5516.0_1 # 182 # 448 # 1 # ID=4060_1;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.356
MQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILNRHEEFRTALDKNGQVGVFFRNAALLHNLS*
-
consecutive CDSs
-
both CDS overlap
>contig_4930:1.0-5516.0_1 # 182 # 448
ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAATCGACACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGAAATGCTGCCTTGCTACACAATCTTTCATAA
Nucleotide and protein sequences of both CDS/proteins:
```
>contig_5026:14.0-10236.0_8_nucl
ATGATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCAC
TATTTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTG
AAAAAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAAT
CGACACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGA
AATGCTGCCTTGCTACACAATCTTTCATAA
Protein
`contig_4930:1.0-5516.0_1`
clusters only with itself (based on
`mmseqs2`
output, created by
`Workflow`
):
```
bash
grep
"contig_4930:1
\.
0-5516
\.
0_1"
clusters.tsv
# flye__contig_4930:1.0-5516.0_1 flye__contig_4930:1.0-5516.0_1
```
>contig_5026:14.0-10236.0_9_nucl
ATGCTGCCTTGCTACACAATCTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACT
GAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTAT
GGTGAACGAAAGGGAATGTTCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCT
ATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTAT
CTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGATGGCGGAAAATACTATATTCCC
TTATCGATTCAAGTGCATCATGCCGTTTGTGATGGCTTTCATGTTTGCCGTTTTTTGGAT
GAATTACAAGACTTGCTGAATAAATAA
## Looking for alternative protein coding sequences
>contig_5026:14.0-10236.0_8_prot
MMQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILN
RHEEFRTALDKNGQVGVFFRNAALLHNLS*
First 1000 bases of the contig
`
contig_
4930
:1.0-
5516.0`
:
```
>contig_4930:1.0-5516.0 1-1000
CATTGGTTGCCGCCCAGCAGACGGCACAGGCTGCCCTAGGCACTTCACGCGGGAACAATACAGCAGCAGCTTACCTCCGGTACTGATCAGAGCGAGGGAAGTGGAGACGAGCCGAAAAAACAGGGATGGTTTAGTCGGTTGTTTCGTGGTTAGATAAATCAGGATTTGCGGAGGATAAATGATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAATCGACACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGAAATGCTGCCTTGCTACACAATCTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACTGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTATGGTGAACGAAAGGGAATGTTCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCTATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAGGGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGATGGCGGAAAATACTATATTCCCTTATCGATTCAAGTGCATCATGCCGTTTGTGATGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAAAATCCAAGTTTGTCGTGGAAACGGGGAGCCGCCCGCAGGCGGCAGGGGCAGCGCCCCTCCGGAGCAGTCCGGCTAATGGCGCATAAGCGTCGAGGCAGGCTGCTCCGGAAACCCTCCCCTTTAGGGGCGCAAAGCACTTTCCTACTGCTGCACCAACGGTGCTGCGGTGGGAAAGTGCTTTTGCGTTACTTTCT
>
contig_
5026
:1
4
.0-
10236.0_9_prot
MLPCYTIFHKETETFSSIWTEFTADYTEFLQNYQKDIDAYGERKGMFAKPNPPENTFPVS
MIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEDGGKYYIPLSIQVHHAVCDGFHVCRFLD
ELQDLLNK*
```
extracted using
Nucleotide sequences were extracted from the relevant part of the contig:
```
bash
grep
-A1
">contig_4930:1.0-5516.0"
assembly/lr/flye/ASSEMBLY.POLISHED.fasta |
sed
-r
's/(.{100})/\1\n/g'
# grep for the contig ID + next line (seq) | get seq |
# cut into 100bp chunks | extract relevant rows
grep
-A
1
"contig_5026:14.0-10236.0$"
/scratch/users/vgalata/gdb/results/assembly/lr/flye/ASSEMBLY.POLISHED.fasta |
tail
-n
1 |
sed
-r
's/(.{100})/\1\n/g'
|
awk
'NR > 55 && NR < 63 {print $0}'
```
and checked by translating them using the
[
ExPASy tool
](
https://web.expasy.org/translate/
)
.
Using a translation tool
[
ExPASy
](
https://web.expasy.org/translate/
)
, two longest translations with a start and stop codon:
These proteins clustered only with one other protein, also from
`Flye`
assembly:
```
bash
grep
-A
1
"contig_5026:14.0-10236.0_8"
/scratch/users/vgalata/gdb/results/analysis/mmseqs2/clusters.tsv
# flye__contig_5026:14.0-10236.0_8 flye__contig_5026:14.0-10236.0_8
# flye__contig_5027:1.0-43229.0_23 flye__contig_5027:1.0-43229.0_23
grep
-A
1
"contig_5026:14.0-10236.0_9"
/scratch/users/vgalata/gdb/results/analysis/mmseqs2/clusters.tsv
# flye__contig_5026:14.0-10236.0_9 flye__contig_5026:14.0-10236.0_9
# flye__contig_5027:1.0-43229.0_24 flye__contig_5027:1.0-43229.0_24
```
> contig_4930:1.0-5516.0_1 (VIRT-56253:5'3' Frame 2, start_pos=60)
MQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILNRHEEFRTALDKNGQVGVFFRNAALLHNLS*
-
both clustered proteins are neighbors from the same contig
-
both clustered proteins have no
`RGI`
hit
> contig_4930:1.0-5516.0_1NEW (VIRT-7066:5'3' Frame 3, start_pos=139)
MLPCYTIFHKETETFSSIWTEFTADYTEFLQNYQKDIDAYGERKGMFAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEDGGKYYIPLSIQVHHAVCDGFHVCRFLDELQDLLNK*
MetaT coverage for the contig region containing both genes:
```
bash
grep
-P
"^contig_5026:14
\.
0-10236
\.
0
\t
"
/scratch/users/vgalata/gdb/results/mapping/metat/lr/flye/ASSEMBLY.POLISHED.sr.cov.perbase |
awk
-F
'\t'
'$2 >= 5547 && $2 <= 6174 {print $0}'
| less
```
The first protein is the same as the protein predicted by
`Prodigal`
, i.e. it is
`contig_4930:1.0-5516.0_1`
.
The second protein was
**not**
predicted by
`Prodigal`
- we will label it as
`contig_4930:1.0-5516.0_1NEW`
.
Complete output of
`ExPASy`
for these two translations is in
`lr_flye_contig_4930:1.0-5516.0_1-1000_translated.txt`
.
Nucleotide (gene) sequence of the "new" protein
`contig_4930:1.0-5516.0_1NEW`
:
```
>contig_4930:1.0-5516.0_1NEW # 420 # 806
atgctgccttgctacacaatctttcataaggaaactgaaaccttttcgagtatttggactgagtttacagcagactatactgagtttcttcagaactatcaaaaggatatagacgcttatggtgaacgaaagggaatgttcgcaaagcctaatcctccggaaaacactttccctgtttctatgataccgtggacaagctttgaaggctttaacttaaatctaaaaaagggatatgactatctactgccgatatttacgtttgggaagtattatgaggatggcggaaaatactatattcccttatcgattcaagtgcatcatgccgtttgtgatggctttcatgtttgccgttttttggatgaattacaagacttgctgaataaataa
```
-
both genes are covered by metaT
## MSA w/ ARO:3004454
Multiple-sequence alignment using
[
EMBOSS Clustal Omega
](
https://www.ebi.ac.uk/Tools/msa/clustalo/
)
.
Multiple-sequence alignment using
[
EMBOSS Clustal Omega
](
https://www.ebi.ac.uk/Tools/msa/clustalo/
)
(
`CLUSTAL
O(1.2.4)
`)
.
```
VIRT-56253:5'3' MQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILNR 60
gb|AAA23018.1|+|Campylobacter MQFTKIDINNWTRKEYFDHYFGNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTIINR 60
VIRT-7066:5'3' ------------------------------------------------------------ 0
Protein alignment:
contig_5026:14.0-10236.0_8_prot MMQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILN 60
gb|AAA23018.1|+|Campylobacter -MQFTKIDINNWTRKEYFDHYFGNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTIIN 59
contig_5026:14.0-10236.0_9_prot ------------------------------------------------------------ 0
VIRT-56253:5'3' HEEFRTALDKNGQVGVFFRNAALLHNLS*------------------------------- 88
gb|AAA23018.1|+|Campylobacter HEEFRTALDENGQVGVFSEMLPCYTVFHKETETFSSIWTEFTADYTEFLQNYQKDIDAFG 120
VIRT-7066:5'3' -------------------MLPCYTIFHKETETFSSIWTEFTADYTEFLQNYQKDIDAYG 41
contig_5026:14.0-10236.0_8_prot RHEEFRTALDKNGQVGVFFRNAALLHNLS*------------------------------ 89
gb|AAA23018.1|+|Campylobacter RHEEFRTALDENGQVGVFSEMLPCYTVFHKETETFSSIWTEFTADYTEFLQNYQKDIDAF 119
contig_5026:14.0-10236.0_9_prot --------------------MLPCYTIFHKETETFSSIWTEFTADYTEFLQNYQKDIDAY 40
:
contig_5026:14.0-10236.0_8_prot ------------------------------------------------------------ 89
gb|AAA23018.1|+|Campylobacter GERMGMSAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEEGGKYYIP 179
contig_5026:14.0-10236.0_9_prot GERKGMFAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEDGGKYYIP 100
VIRT-56253:5'3' ------------
-----------------------------
-------------------
8
8
gb|AAA23018.1|+|Campylobacter
ERMGMSAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEEGGKYYIPL 180
VIRT-7066:5'3' ERKGMFAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEDGGKYYIPL 101
contig_5026:14.0-10236.0_8_prot
----------------------------- 8
9
gb|AAA23018.1|+|Campylobacter
LSIQVHHAVCDGFHVCRFLDELQDLLNK- 207
contig_5026:14.0-10236.0_9_prot LSIQVHHAVCDGFHVCRFLDELQDLLNK* 128
VIRT-56253:5'3' ---------------------------- 88
gb|AAA23018.1|+|Campylobacter SIQVHHAVCDGFHVCRFLDELQDLLNK- 207
VIRT-7066:5'3' SIQVHHAVCDGFHVCRFLDELQDLLNK* 128
```
DNA alignment:
```
contig_4930:1.0-5516.0_1 ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTAT 60
gb|M35190.1|+|309-932|Campylobacter ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTAT 60
contig_4930:1.0-5516.0_1NEW ------------------------------------------------------------ 0
contig_5026:14.0-10236.0_8_nucl ATGATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCAC 60
gb|M35190.1|+|309-932|Campylobacter ---ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCAC 57
contig_5026:14.0-10236.0_9_nucl ------------------------------------------------------------ 0
contig_
4930
:1.0-
5516.0_1
TTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTG
AAA
120
gb|M35190.1|+|309-932|Campylobacter TTTGGCAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTG
AAA 120
contig_
4930
:1.0-
5516.0_1NEW
------------------------------------------------------------ 0
contig_
5026
:1
4
.0-
10236.0_8_nucl
TAT
TTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTG 120
gb|M35190.1|+|309-932|Campylobacter
TAT
TTTGGCAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTG
117
contig_
5026
:1
4
.0-
10236.0_9_nucl
------------------------------------------------------------ 0
contig_
4930
:1.0-
5516.0_1
AAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAAT
CGA
180
gb|M35190.1|+|309-932|Campylobacter AAGGATGGAAAAAAGTTATACCCAACTCTTTTATATGGAGTTACAACGATCATCAAT
CGA 180
contig_
4930
:1.0-
5516.0_1NEW
------------------------------------------------------------ 0
contig_
5026
:1
4
.0-
10236.0_8_nucl
AAA
AAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAAT 180
gb|M35190.1|+|309-932|Campylobacter
AAA
AAGGATGGAAAAAAGTTATACCCAACTCTTTTATATGGAGTTACAACGATCATCAAT
177
contig_
5026
:1
4
.0-
10236.0_9_nucl
------------------------------------------------------------ 0
contig_
4930
:1.0-
5516.0_1
CACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGA
AAT
240
gb|M35190.1|+|309-932|Campylobacter CATGAAGAGTTCAGGACCGCATTAGATGAAAACGGACAGGTAGGCGTTTTT-TCAGA
AAT
23
9
contig_
4930
:1.0-
5516.0_1NEW
----------------------------------------------------------
at 2
**
contig_
5026
:1
4
.0-
10236.0_8_nucl
CGA
CACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGA 240
gb|M35190.1|+|309-932|Campylobacter
CGA
CATGAAGAGTTCAGGACCGCATTAGATGAAAACGGACAGGTAGGCGTTTTT-TCAGA 23
6
contig_
5026
:1
4
.0-
10236.0_9_nucl
----------------------------------------------------------
-- 0
contig_
4930
:1.0-
5516.0_1
GCTGCCTTGCTACACAATCTTTCATAA------------------------------
--- 267
gb|M35190.1|+|309-932|Campylobacter GCTGCCTTGCTACACAGTTTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGAC
TGA
29
9
contig_
4930
:1.0-
5516.0_1NEW gctgccttgctacacaatctttcataaggaaactgaaaccttttcgagtatttggactga 62
**************** * ********
contig_
5026
:1
4
.0-
10236.0_8_nucl
AAT
GCTGCCTTGCTACACAATCTTTCATAA------------------------------
270
gb|M35190.1|+|309-932|Campylobacter
AAT
GCTGCCTTGCTACACAGTTTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGAC 29
6
contig_
5026
:1
4
.0-
10236.0_9_nucl -ATGCTGCCTTGCTACACAATCTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGAC 59
**
**************** * ********
contig_
4930
:1.0-
5516.0_1
------------------------------------------------------------ 2
6
7
gb|M35190.1|+|309-932|Campylobacter GTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTT
TGG
35
9
contig_
4930
:1.0-
5516.0_1NEW gtttacagcagactatactgagtttcttcagaactatcaaaaggatatagacgcttatgg 122
contig_
5026
:1
4
.0-
10236.0_8_nucl
------------------------------------------------------------ 27
0
gb|M35190.1|+|309-932|Campylobacter
TGA
GTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTT 35
6
contig_
5026
:1
4
.0-
10236.0_9_nucl TGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTA 119
contig_
4930
:1.0-
5516.0_1
------------------------------------------------------------ 2
6
7
gb|M35190.1|+|309-932|Campylobacter TGAACGAATGGGAATGTCCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTC
TAT
41
9
contig_
4930
:1.0-
5516.0_1NEW tgaacgaaagggaatgttcgcaaagcctaatcctccggaaaacactttccctgtttctat 182
contig_
5026
:1
4
.0-
10236.0_8_nucl
------------------------------------------------------------ 27
0
gb|M35190.1|+|309-932|Campylobacter
TGG
TGAACGAATGGGAATGTCCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTC 41
6
contig_
5026
:1
4
.0-
10236.0_9_nucl TGGTGAACGAAAGGGAATGTTCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTC 179
contig_
4930
:1.0-
5516.0_1
------------------------------------------------------------ 2
6
7
gb|M35190.1|+|309-932|Campylobacter GATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTA
TCT
47
9
contig_
4930
:1.0-
5516.0_1NEW gataccgtggacaagctttgaaggctttaacttaaatctaaaaaagggatatgactatct 242
contig_
5026
:1
4
.0-
10236.0_8_nucl
------------------------------------------------------------ 27
0
gb|M35190.1|+|309-932|Campylobacter
TAT
GATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTA 47
6
contig_
5026
:1
4
.0-
10236.0_9_nucl TATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTA 239
contig_
4930
:1.0-
5516.0_1
------------------------------------------------------------ 2
6
7
gb|M35190.1|+|309-932|Campylobacter ACTGCCGATATTTACGTTTGGGAAGTATTATGAGGAGGGCGGAAAATACTATATTCC
CTT
53
9
contig_
4930
:1.0-
5516.0_1NEW actgccgatatttacgtttgggaagtattatgaggatggcggaaaatactatattccctt 302
contig_
5026
:1
4
.0-
10236.0_8_nucl
------------------------------------------------------------ 27
0
gb|M35190.1|+|309-932|Campylobacter
TCT
ACTGCCGATATTTACGTTTGGGAAGTATTATGAGGAGGGCGGAAAATACTATATTCC 53
6
contig_
5026
:1
4
.0-
10236.0_9_nucl TCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGATGGCGGAAAATACTATATTCC 299
contig_
4930
:1.0-
5516.0_1
------------------------------------------------------------ 2
6
7
gb|M35190.1|+|309-932|Campylobacter ATCGATTCAAGTGCATCATGCCGTTTGTGACGGCTTTCATGTTTGCCGTTTTTTGGA
TGA
59
9
contig_
4930
:1.0-
5516.0_1NEW atcgattcaagtgcatcatgccgtttgtgatggctttcatgtttgccgttttttggatga 362
contig_
5026
:1
4
.0-
10236.0_8_nucl
------------------------------------------------------------ 27
0
gb|M35190.1|+|309-932|Campylobacter
CTT
ATCGATTCAAGTGCATCATGCCGTTTGTGACGGCTTTCATGTTTGCCGTTTTTTGGA 59
6
contig_
5026
:1
4
.0-
10236.0_9_nucl CTTATCGATTCAAGTGCATCATGCCGTTTGTGATGGCTTTCATGTTTGCCGTTTTTTGGA 359
contig_
4930
:1.0-
5516.0_1
-------------------------
267
gb|M35190.1|+|309-932|Campylobacter ATTACAAGACTTGCTGAATAAATAA 624
contig_
4930
:1.0-
5516.0_1NEW attacaagacttgctgaataaataa
387
contig_
5026
:1
4
.0-
10236.0_8_nucl
-------------------------
--- 270
gb|M35190.1|+|309-932|Campylobacter
TGA
ATTACAAGACTTGCTGAATAAATAA 624
contig_
5026
:1
4
.0-
10236.0_9_nucl TGAATTACAAGACTTGCTGAATAAATAA
387
```
Complete alignment output is in
`lr_flye_contig_4930:1.0-5516.0_1-1000_translated_alignment.txt`
.
A plot for protein sequences created using
`BioJS-MSA`
was saved to
`lr_flye_contig_4930:1.0-5516.0_1-1000_translated_alignment_biojs-msa.png`
.
The alignments are almost perfect, with only a few mismatches, and both proteins cover the complete ARO sequence.
- also saved in `
lr_flye__ARO3004454_contig_5026:14.0-10236.0__clustalo_prot.txt
`
- alignments: almost perfect, only few mismatches/gaps
- both CDS/protein sequences cover the complete ARO sequence
`contig_
4930
:1.0-
551
6.0_
1
`
and
`contig_
4930
:1.0-
5516.0_1NEW
`
have an overlap of 29 bps:
`
atgctgccttgctacacaatctttcataa
`
:
`
contig_
5026
:1
4
.0-
1023
6.0_
8
` and `
contig_
5026
:1
4
.0-
10236.0_9
` have an overlap of 29 bps: `
ATGCTGCCTTGCTACACAATCTTTCATAA
`:
```
atgctgccttgctacacaatctttcataa
.A..A..L..L..H..N..L..S..*.
atgctgccttgctacacaatctttcataa
.M..L..P..C..Y..T..I..F..H.
```
## metaT coverage
Using file
`mapping/metat/lr/flye/ASSEMBLY.POLISHED.sr.cov.perbase`
created by
`workflow`
.
MetaT coverage of first 1000 bases of contig
`contig_4930:1.0-5516.0`
(i.e. including
`contig_4930:1.0-5516.0_1`
and
`contig_4930:1.0-5516.0_1NEW`
):
```
bash
# grep for contig id and get first 1000 bases
grep
-F
"contig_4930:1.0-5516.0"
mapping/metat/lr/flye/ASSEMBLY.POLISHED.sr.cov.perbase |
head
-n
1000 | less
AATGCTGCCTTGCTACACAATCTTTCATAA.
.N..A..A..L..L..H..N..L..S..*.
.ATGCTGCCTTGCTACACAATCTTTCATAAG
..M..L..P..C..Y..T..I..F..H..K.
``
`
\ No newline at end of file
workflow_figures/Snakefile
View file @
f0ae27df
...
...
@@ -187,10 +187,10 @@ rule fig_gdb_rgi_aro3004454_flye_metaT:
input:
os.path.join(config["samples"]["gdb"], "mapping/metat/lr/flye/ASSEMBLY.POLISHED.sr.cov.perbase")
output:
temp("figures/fig_gdb_lr_flye_contig_
4930
:1.0-
5516.0_1-100
0_metatcov.tsv")
temp("figures/fig_gdb_lr_flye_contig_
5026
:1
4
.0-
10236.
0_metatcov.tsv")
shell:
# metaT cov. of contig contig_4930:1.0-5516.0 for bases 1-1000
"grep -F \"contig_
4930
:1.0-
551
6.0\" {input} | awk -F'\\t' '$1 == \"contig_
4930
:1.0-
551
6.0\" && $2 >= 1 && $2 <=
10
00 {{print $0}}' > {output}"
"grep -F \"contig_
5026
:1
4
.0-
1023
6.0\" {input} | awk -F'\\t' '$1 == \"contig_
5026
:1
4
.0-
1023
6.0\" && $2 >= 1 && $2 <=
62
00 {{print $0}}' > {output}"
rule fig_gdb_rgi_aro3004454_flye:
input:
...
...
workflow_figures/scripts/fig_gdb_rgi_aro3004454_flye.R
View file @
f0ae27df
...
...
@@ -20,22 +20,23 @@ suppressMessages(library(viridis)) # color palettes
############################## DATA
# metaT coverage
:
base
s
1
-1000
of contig contig_
4930:1
.0 from Flye assembly from GDB sample
# metaT coverage
from
base 1
to XXX
of contig contig_
5026:14.0-10236
.0 from Flye assembly from GDB sample
metat_cov
<-
read.csv
(
snakemake
@
input
$
metat_cov
,
header
=
FALSE
,
sep
=
"\t"
,
col.names
=
c
(
"contigID"
,
"pos"
,
"cov"
),
stringsAsFactors
=
FALSE
)
print
(
sprintf
(
"metaT cov: from %d to %d"
,
min
(
metat_cov
$
pos
),
max
(
metat_cov
$
pos
)))
# gene/protein sequences to be plotted
seqs
<-
data.frame
(
sid
=
c
(
"contig_
4930
:1.0-
551
6.0_
1
"
,
"ARO:30004454"
,
"contig_
4930
:1.0-
5516.0_1NEW
"
),
start
=
c
(
182
,
182
,
420
),
end
=
c
(
448
,
806
,
806
),
sid
=
c
(
"contig_
5026
:1
4
.0-
1023
6.0_
8
"
,
"ARO:30004454"
,
"contig_
5026
:1
4
.0-
10236.0_9
"
),
start
=
c
(
5547
,
5547
,
5788
),
end
=
c
(
5816
,
6174
,
6174
),
nucl
=
c
(
"ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAATCGACACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGAAATGCTGCCTTGCTACACAATCTTTCATAA"
,
"ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGGCAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAGTTATACCCAACTCTTTTATATGGAGTTACAACGATCATCAATCGACATGAAGAGTTCAGGACCGCATTAGATGAAAACGGACAGGTAGGCGTTTTT-TCAGAAATGCTGCCTTGCTACACAGTTTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACTGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTTTGGTGAACGAATGGGAATGTCCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCTATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGAGGGCGGAAAATACTATATTCCCTTATCGATTCAAGTGCATCATGCCGTTTGTGACGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAA"
,
"ATGCTGCCTTGCTACACAATCTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACTGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTATGGTGAACGAAAGGGAATGTTCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCTATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAA
G
GGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGATGGCGGAAAATACTATATTCCCTTATCGATTCAAGTGCATCATGCCGTTTGTGATGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAA"
"
ATG
ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGACAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAATTATATCCGACTCTCTTATATGGGGTTACAACAATCCTCAATCGACACGAAGAGTTCAGGACGGCATTAGATAAAAACGGACAGGTAGGTGTTTTTTTCAGAAATGCTGCCTTGCTACACAATCTTTCATAA"
,
"
---
ATGCAATTCACAAAGATTGATATAAATAATTGGACACGAAAAGAGTATTTCGACCACTATTTTGGCAATACGCCCTGCACATATAGTATGACGGTAAAACTCGATATTTCTAAGTTGAAAAAGGATGGAAAAAAGTTATACCCAACTCTTTTATATGGAGTTACAACGATCATCAATCGACATGAAGAGTTCAGGACCGCATTAGATGAAAACGGACAGGTAGGCGTTTTT-TCAGAAATGCTGCCTTGCTACACAGTTTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACTGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTTTGGTGAACGAATGGGAATGTCCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCTATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAAAGGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGAGGGCGGAAAATACTATATTCCCTTATCGATTCAAGTGCATCATGCCGTTTGTGACGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAA"
,
"ATGCTGCCTTGCTACACAATCTTTCATAAGGAAACTGAAACCTTTTCGAGTATTTGGACTGAGTTTACAGCAGACTATACTGAGTTTCTTCAGAACTATCAAAAGGATATAGACGCTTATGGTGAACGAAAGGGAATGTTCGCAAAGCCTAATCCTCCGGAAAACACTTTCCCTGTTTCTATGATACCGTGGACAAGCTTTGAAGGCTTTAACTTAAATCTAAAAAA
A
GGATATGACTATCTACTGCCGATATTTACGTTTGGGAAGTATTATGAGGATGGCGGAAAATACTATATTCCCTTATCGATTCAAGTGCATCATGCCGTTTGTGATGGCTTTCATGTTTGCCGTTTTTTGGATGAATTACAAGACTTGCTGAATAAATAA"
),
prot
=
c
(
"MQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILNRHEEFRTALDKNGQVGVFFRNAALLHNLS*"
,
"MQFTKIDINNWTRKEYFDHYFGNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTIINRHEEFRTALDENGQVGVFSEMLPCYTVFHKETETFSSIWTEFTADYTEFLQNYQKDIDAFGERMGMSAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEEGGKYYIPLSIQVHHAVCDGFHVCRFLDELQDLLNK*"
,
"
M
MQFTKIDINNWTRKEYFDHYFDNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTILNRHEEFRTALDKNGQVGVFFRNAALLHNLS*"
,
"
-
MQFTKIDINNWTRKEYFDHYFGNTPCTYSMTVKLDISKLKKDGKKLYPTLLYGVTTIINRHEEFRTALDENGQVGVFSEMLPCYTVFHKETETFSSIWTEFTADYTEFLQNYQKDIDAFGERMGMSAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEEGGKYYIPLSIQVHHAVCDGFHVCRFLDELQDLLNK*"
,
"MLPCYTIFHKETETFSSIWTEFTADYTEFLQNYQKDIDAYGERKGMFAKPNPPENTFPVSMIPWTSFEGFNLNLKKGYDYLLPIFTFGKYYEDGGKYYIPLSIQVHHAVCDGFHVCRFLDELQDLLNK*"
),
stringsAsFactors
=
FALSE
...
...
@@ -49,11 +50,16 @@ for(sid in seqs$sid){
sid_prot
<-
unlist
(
strsplit
(
seqs
[
sid
,
"prot"
],
split
=
""
))
sid_start
<-
seqs
[
sid
,
"start"
]
sid_end
<-
seqs
[
sid
,
"end"
]
#
print(sprintf("%s: %d, %d, %d", sid, sid_start, sid_end, length(sid_nucl)))
print
(
sprintf
(
"%s: %d, %d, %d"
,
sid
,
sid_start
,
sid_end
,
length
(
sid_nucl
)))
metat_cov
[,
sid
]
<-
NA
# metat_cov[sid_start:sid_end,sid] <- sid_nucl # nucl
metat_cov
[
seq
(
sid_start
+
ifelse
(
sid
==
"contig_4930:1.0-5516.0_1NEW"
,
0
,
1
),
sid_end
,
3
),
sid
]
<-
sid_prot
# prot
# AA seq.: start + 1 covering each 3rd base to align AA with the 2nd base of its codon
# because two assembly proteins have different frames, to align both prot. seq.s in the plot
# one of the seq.s needs to be "shifted" (i.e. start + 0 instead of start + 1)
# --> for the sake of readability only
metat_cov
[
seq
(
sid_start
+
ifelse
(
sid
==
"contig_5026:14.0-10236.0_9"
,
0
,
1
),
sid_end
,
3
),
sid
]
<-
sid_prot
# prot
}
print
(
head
(
metat_cov
[
min
(
seqs
$
start
)
:
nrow
(
metat_cov
),]))
# data to plot seq.s
metat_cov_seqs
<-
reshape2
::
melt
(
...
...
@@ -62,6 +68,7 @@ metat_cov_seqs <- reshape2::melt(
na.rm
=
TRUE
)
metat_cov_seqs
$
y
<-
as.numeric
(
metat_cov_seqs
$
variable
)
*
-1
print
(
head
(
metat_cov_seqs
))
############################## PLOT
# custom theme
...
...
@@ -74,18 +81,19 @@ custom_theme <-
)
# AA colors
aa_colors
<-
c
(
pal_d3
(
palette
=
"category20"
)(
20
),
"black"
)
names
(
aa_colors
)
<-
c
(
setdiff
(
sort
(
unique
(
metat_cov_seqs
$
value
)),
"*
"
),
"*"
)
aa_colors
<-
c
(
pal_d3
(
palette
=
"category20"
)(
20
),
"black"
,
"black"
)
names
(
aa_colors
)
<-
c
(
setdiff
(
sort
(
unique
(
metat_cov_seqs
$
value
)),
c
(
"*"
,
"-
"
)
)
,
"*"
,
"-"
)
# Nucl. colors
# nc_colors <- c(pal_d3(palette="category20")(4), "black")
# names(nc_colors) <- c("A", "T", "C", "G", "-")
# Plot
xlims
<-
c
(
min
(
seqs
$
start
)
-10
,
max
(
seqs
$
end
)
+10
)
p_cov
<-
ggplot
(
data
=
metat_cov
%>%
filter
(
pos
>=
160
,
pos
<=
830
),
aes
(
x
=
pos
,
y
=
cov
))
+
# overlap area
geom_polygon
(
data
=
data.frame
(
x
=
c
(
182+238+29
,
182+238
,
182+238
,
182+238
+29
),
y
=
c
(
-6
,
-6
,
30
,
30
)),
aes
(
x
=
x
,
y
=
y
),
fill
=
"gray"
,
alpha
=
0.3
)
+
ggplot
(
data
=
metat_cov
%>%
filter
(
pos
>=
min
(
xlims
),
pos
<=
max
(
xlims
)
),
aes
(
x
=
pos
,
y
=
cov
))
+
# overlap area
: start s1 + overlap start s1 (+ overlap length)
geom_polygon
(
data
=
data.frame
(
x
=
c
(
5547+240+29
,
5547+240
,
5547+240
,
5547+240
+29
),
y
=
c
(
-6
,
-6
,
30
,
30
)),
aes
(
x
=
x
,
y
=
y
),
fill
=
"gray"
,
alpha
=
0.3
)
+
# cov. baseline
geom_hline
(
yintercept
=
0
,
linetype
=
"dotted"
)
+
# cov.
...
...
@@ -93,20 +101,20 @@ p_cov <-
# seq.s
geom_label
(
data
=
metat_cov_seqs
,
aes
(
y
=
y
,
label
=
value
,
fill
=
value
),
aes
(
x
=
pos
,
y
=
y
,
label
=
value
,
fill
=
value
),
color
=
"white"
,
size
=
2
,
fontface
=
"bold"
,
label.padding
=
unit
(
0.
1
,
"lines"
),
label.size
=
0
,
label.r
=
unit
(
0
,
"lines"
)
label.padding
=
unit
(
0.
05
,
"lines"
),
label.size
=
0
,
label.r
=
unit
(
0
,
"lines"
)
)
+
# seq. annotations
annotate
(
annotate
(
# arrow
geom
=
"segment"
,
x
=
c
(
182
,
182
,
404
),
y
=
c
(
1.5
,
-4.5
,
-4.5
),
xend
=
c
(
182
,
182
,
417
),
yend
=
c
(
-0.5
,
-2.5
,
-3
),
x
=
c
(
5547
,
5547+3
,
5780
),
y
=
c
(
1.5
,
-4.5
,
-4.5
),
xend
=
c
(
5547
,
5547+3
,
5786
),
yend
=
c
(
-0.5
,
-2.5
,
-3
),
arrow
=
arrow
(
length
=
unit
(
2
,
"mm"
))
)
+
annotate
(
annotate
(
# text
geom
=
"text"
,
x
=
c
(
182
,
182
,
402
),
y
=
c
(
2
,
-5
,
-5
),
x
=
c
(
5547
,
5547
,
5780
),
y
=
c
(
2
,
-5
,
-5
),
label
=
levels
(
metat_cov_seqs
$
variable
),
hjust
=
c
(
"left"
,
"left"
,
"right"
),
size
=
4
...
...
@@ -114,11 +122,11 @@ p_cov <-
# colors
scale_fill_manual
(
values
=
aa_colors
)
+
# axis range
scale_x_continuous
(
limits
=
c
(
160
,
830
)
,
expand
=
c
(
0
,
0
))
+
scale_x_continuous
(
limits
=
xlims
,
expand
=
c
(
0
,
0
))
+
scale_y_continuous
(
limits
=
c
(
-6
,
30
),
expand
=
c
(
0
,
0
))
+
# titles
labs
(
x
=
"Contig position (LR, Flye
, contig_4930:1.0-5516.0
)"
,
x
=
"Contig position (LR, Flye)"
,
y
=
"metaT coverage"
)
+
# theme
...
...
Write
Preview
Supports
Markdown
0%
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment