Newer
Older
# Files to download to this directory:
All files neccessary to run the paralog annotation tool locally can be downloaded from the Zenodo repository [Paralog variant classification and scoring](https://zenodo.org/record/802891#.WTacLcmkJTY)
wget https://zenodo.org/record/803358/files/CCDS_IDS.txt
wget https://zenodo.org/record/803358/files/CCDS2Ensembl2HGNC.txt
wget https://zenodo.org/record/803358/files/refSeqEnsCCDS.tsv
wget https://zenodo.org/record/803358/files/paraScores.tar.gz
tar -zxvf paraScores.tar.gz
```
## Parasub scores per gene
```
wget https://zenodo.org/record/803358/files/parasubScores.tar.gz
tar -zxvf parasubScores.tar.gz
```
## Para homology scores per gene
```
wget https://zenodo.org/record/803358/files/parahomoScores.tar.gz
After downloading and decompressing one directory for each of the thee different score types will be available in the ```data``` directory
```
paraScores # Directory with gene-specific files with para score
parasubScores # Directory with gene-specific files with parasub score
parahomoScores # Directory with gene-specific files with parahomo score
```
# Directory and subdirectory structures
# 1. paraScores
Subdirectories
* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `paraGeneIds.txt`)
* fasta - contains for each gene family a fasta file (family numbers and descriptions can be found in the files `paraFamilyIds.txt` and `paragroups.detailed.tsv`)
* cluster - contains for each gene family the cluster file (`.clstr`) generated with [cd-hit](http://weizhongli-lab.org/cd-hit/)
* parafamilies - contains the alignment files from [MUSCLE](http://www.drive5.com/muscle/) and the conservation scoring from [JalView](http://www.jalview.org/) per paralog family
Files
* paraGeneIds.txt - file with [HGNC](http://www.genenames.org/) gene names with `para_scores`
* paraFamilyIds.txt - file with paralog family ids
* paragroups.detailed.tsv - file with detailed description of each paralog family including family-id, number of genes per family and gene names
* parascore.genes.tsv - file with `para_score` statistics per gene
* parascores.families.tsv - file with `para_score` statistics per paralog family
# 2. parasubScores
Subdirectories
* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `parasubGeneIds.txt`)
* fasta - contains for each paralog subfamily a fasta file (subfamily numbers and descriptions can be found in the files `parasubFamilyIds.txt` and `parasubgroups.detailed.tsv`)
* parasubfamilies - contains the alignment files from [MUSCLE](http://www.drive5.com/muscle/) and the conservation scoring from [JalView](http://www.jalview.org/) per paralog subfamily
Files:
* parasubGeneIds.txt - file with [HGNC](http://www.genenames.org/) gene names with `parasub_scores`
* parasubFamilyIds.txt - file with paralog subfamily ids
* parasubgroups.detailed.tsv - file with detailed description of each paralog subfamily including subfamily-id, number of genes per subfamily and gene names
* parasubscore.genes.tsv - file with `parasub_score` statistics per gene
* parasubscores.subfamilies.tsv - file with `parasub_score` statistics per paralog subfamily
* parasubfam.probs.tsv - file with [ExAC](http://exac.broadinstitute.org) probilities for missense, LoF and missense.plus.LoF variants per para subfamily and number of genes per family
# 3. parahomoScores
Subdirectories
* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `parahomoGeneIds.txt`)
* fasta - contains for each paralog homology family a fasta file (homology family numbers and descriptions can be found in the files `parahomoFamilyIds.txt` and `parahomogroups.detailed.tsv`)
* cluster - contains for each gene family the cluster file (`.clstr`) generated with [cd-hit](http://weizhongli-lab.org/cd-hit/)
* parahomofamilies - contains the alignment files from [MUSCLE](http://www.drive5.com/muscle/) and the conservation scoring from [JalView](http://www.jalview.org/) per homology family
Files
* parahomoGeneIds.txt - file with [HGNC](http://www.genenames.org/) gene names with `parahomo_scores`
* parahomoFamilyIds.txt - file with homology subfamily ids
* parahomogroups.detailed.tsv - file with detailed description of each homology subfamily including subfamily-id, number of genes per subfamily and gene names
For each of the three scores (para, parasub, parahomo) the `.geneIds.txt` file gives the genenames for which the according score is available.
Per gene there are two diffferent files available in the according score directory":
score per aminio acid plus gene-specific metrices like median/median/standard deviation (STD) in the header and z-scores per position
format:AA position AA score score_minus_median (score-median)/STD score-mean (score-mean)/STD
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
```
# GENE: KRIT1
# TOTALSCORE=2883
# LEN=736
# MEAN(SCORE)=3.92
# STD(SCORE)=4.42
# MEDIAN(SCORE)=0
# SCORE>0=0.47
# MEAN(GREATER>0)=8.28
# STD(SCORE>0)=2.26
# MEDIAN(SCORE>0)=8
# MAXSCORES=0.14
#POS AA SCORE SCORE-MEDIAN (SCORE-MEDIAN)/STD SCORE-MEAN (SCORE-MEAN)/STD
1 M 0 0 0.00 -3.92 -0.89
2 G 0 0 0.00 -3.92 -0.89
3 N 0 0 0.00 -3.92 -0.89
4 P 0 0 0.00 -3.92 -0.89
5 E 0 0 0.00 -3.92 -0.89
..
82 A 10 10 2.26 6.08 1.38
83 N 10 10 2.26 6.08 1.38
84 Q 8 8 1.81 4.08 0.92
85 G 11 11 2.49 7.08 1.60
86 I 5 5 1.13 1.08 0.24
..
```