Commit 202bec39 authored by Anna Buschart's avatar Anna Buschart
Browse files

add functional-annotation (try again)

parent 9d5b429f
To annotate the predictions of protein coding genes with functions, we used different sets of HMM databases (for KEGG KOs, enzymes from MetaCyc and Uniprot, PFAM families, and TIGRFAM families). The databases for Pfam were downloaded from (INSERT LINK) and for TIGRFAM from (INSERT LINK). The other databases were compiled by ???
HMMER 3.1 (http://hmmer.janelia.org/) was run on each database to find hits ( _contig.Prodigal.faa_ is the file with amino acid sequences of the protein coding gene predictions).
```
hmmsearch --cpu 12 --noali --notextw --tblout contig.Prodigal.faa.kegg.hmmscan KO.hmm contig.Prodigal.faa >/dev/null
hmmsearch --cpu 12 --noali --notextw --tblout contig.Prodigal.faa.metacyc.hmmscan metacyc.hmm contig.Prodigal.faa >/dev/null
hmmsearch --cpu 12 --noali --notextw --tblout contig.Prodigal.faa.Pfam-A.hmmscan Pfam-A.hmm contig.Prodigal.faa >/dev/null
hmmsearch --cpu 12 --noali --notextw --tblout contig.Prodigal.faa.tigrpfam.hmmscan tigrpfam.hmm contig.Prodigal.faa >/dev/null
hmmsearch --cpu 12 --noali --notextw --tblout contig.Prodigal.faa.swissprot.hmmscan swissprot.hmm contig.Prodigal.faa >/dev/null
```
The output from HMMER was parsed using a script ???written by Venkata? which returns all hits in a better format (tab-separated).
```
perl consolidate_hmmscan_results.pl contig.Prodigal.faa contig.Prodigal.faa.kegg.hmmscan contig.Prodigal.faa.metacyc.hmmscan contig.Prodigal.faa.Pfam-A.hmmscan contig.Prodigal.faa.swissprot.hmmscan contig.Prodigal.faa.tigrpfam.hmmscan
```
There is a second version for this script which does exactly the same, but only for the KO annotations (`consolidate_hmmscan_results_justKEGG.pl`).
The formatted output is then parsed once more to retain only the best annotation out of all databases, or to get the top or top x percent of best hits from any database. To get the best annotation per gene, we can use `150310_MUST_hmmBestAll.py`.
```
python 150310_MUST_hmmBestAll.py contig.Prodigal.faakegg.tsv contig.Prodigal.faametacyc.tsv contig.Prodigal.faaswissprot.tsv contig.Prodigal.faapfam.tsv contig.Prodigal.faatigrpfam.tsv -g $(grep ">" contig.Prodigal.faa | wc -l)
```
Alternatively, or additionally (in the case of MuSt), we can get the best annotation per database:
```
python 150705_MUST_hmmParsePfam.py contig.Prodigal.faapfam.tsv pfamID -g $(grep ">" contig.Prodigal.faa | wc -l) -k
python 150705_MUST_hmmParse.py contig.Prodigal.faametacyc.tsv metaCycID -g $(grep ">" contig.Prodigal.faa | wc -l)
python 150705_MUST_hmmParse.py contig.Prodigal.faaswissprot.tsv swissprotEC -g $(grep ">" contig.Prodigal.faa | wc -l)
python 150705_MUST_hmmParse.py contig.Prodigal.faatigrpfam.tsv tigrID -g $(grep ">" contig.Prodigal.faa | wc -l)
python 150705_MUST_hmmParse.py contig.Prodigal.faakegg.tsv KO -g $(grep ">" contig.Prodigal.faa | wc -l)
```
As a last alternative, we can annotate genes with the nodes of a [reconstructed metabolic network](reconstructed-KO-network) based on the KO annotations. The script for this uses a table (provided here in `150705_KOs_in_NW.tsv`) with the link between nodes and KOs (which node contains which KOs) which is built by the network reconstruction script.
```
python 150705_MUST_keggParseNW.py contig.Prodigal.faakegg.tsv -g $(grep ">" contig.Prodigal.faa | wc -l)
```
As you can see, I always use the -g option of these scripts, which sets a sample-specific threshold for annotations to be accepted. In accordance with the HMMer manual, this is log2 of the number of predicted genes.
Now we have the annotations for each gene, but we might also like to know how many reads (representing how many fragments) map to the genes with a functional annotation to do some expression analysis. This can be done using very similar a family of scripts, which also take a gene-wise read [coverage](calculating-coverage) table as input. In the MuSt, the gene expression analysis was based on the best annotation out of all databases:
```
python 150322_bestHmmReadParse.py contig.Prodigal.faakegg.tsv contig.Prodigal.faametacyc.tsv contig.Prodigal.faaswissprot.tsv contig.Prodigal.faapfam.tsv contig.Prodigal.faatigrpfam.tsv DNAonGenesrRNA.cov.tsv -g $(grep ">" contig.Prodigal.faa | wc -l)
python $calcScript contig.Prodigal.faakegg.tsv contig.Prodigal.faametacyc.tsv contig.Prodigal.faaswissprot.tsv contig.Prodigal.faapfam.tsv contig.Prodigal.faatigrpfam.tsv RNAonGenesrRNA.cov.tsv -g $(grep ">" contig.Prodigal.faa | wc -l)
```
There is also a version of this script which calculates the average coverage depth of each function from all genes annotated with this function (`150415_bestHmmAveCovParse.py`). The results of this were not used within the MuSt.
```
#python 150415_bestHmmAveCovParse.py contig.Prodigal.faakegg.tsv contig.Prodigal.faametacyc.tsv contig.Prodigal.faaswissprot.tsv contig.Prodigal.faapfam.tsv contig.Prodigal.faatigrpfam.tsv DNAonGenesrRNA.cov.tsv -g $(grep ">" contig.Prodigal.faa | wc -l)
#python $calcScript contig.Prodigal.faakegg.tsv contig.Prodigal.faametacyc.tsv contig.Prodigal.faaswissprot.tsv contig.Prodigal.faapfam.tsv contig.Prodigal.faatigrpfam.tsv RNAonGenesrRNA.cov.tsv -g $(grep ">" contig.Prodigal.faa | wc -l)
```
Alternatively, only single databases can be used. In the case of MuSt, we used the KO database.
```
python 150630_keggReadParse.py contig.Prodigal.faakegg.tsv DNAonGenesrRNA.mapped.tsv -g $(grep ">" contig.Prodigal.faa | wc -l) -p 100.0
python 150630_keggReadParse.py contig.Prodigal.faakegg.tsv RNAonGenesrRNA.mapped.tsv -g $(grep ">" contig.Prodigal.faa | wc -l) -p 100.0
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment