Newer
Older
# Files to download to this directory:
All files neccessary to run the paralog annotation tool locally can be downloaded from the Zenodo repository [Paralog variant classification and scoring](https://zenodo.org/record/802891#.WTacLcmkJTY)
wget https://zenodo.org/record/803358/files/CCDS_IDS.txt
wget https://zenodo.org/record/803358/files/CCDS2Ensembl2HGNC.txt
wget https://zenodo.org/record/803358/files/refSeqEnsCCDS.tsv
wget https://zenodo.org/record/803358/files/paraScores.tar.gz
tar -zxvf paraScores.tar.gz
```
## Parasub scores per gene
```
wget https://zenodo.org/record/803358/files/parasubScores.tar.gz
tar -zxvf parasubScores.tar.gz
```
## Para homology scores per gene
```
wget https://zenodo.org/record/803358/files/parahomoScores.tar.gz
After downloading and decompressing one directory for each of the thee different score types will be available in the ```data``` directory
```
paraScores # Directory with gene-specific files with para score
parasubScores # Directory with gene-specific files with parasub score
parahomoScores # Directory with gene-specific files with parahomo score
```
# Directory and subdirectory structures
# 1. paraScores
Subdirectories
* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `paraGeneIds.txt`)
* fasta - contains for each gene family a fasta file (family numbers and descriptions can be found in the files `paraFamilyIds.txt` and `paragroups.detailed.tsv`)
* cluster - contains for each gene family the cluster file (`.clstr`) generated with [cd-hit](http://weizhongli-lab.org/cd-hit/)
* parafamilies - contains the alignment files from [MUSCLE](http://www.drive5.com/muscle/) and the conservation scoring from [JalView](http://www.jalview.org/) per paralog family
Files
* paraGeneIds.txt - file with [HGNC](http://www.genenames.org/) gene names with `para_scores`
* paraFamilyIds.txt - file with paralog family ids
* paragroups.detailed.tsv - file with detailed description of each paralog family including family-id, number of genes per family and gene names
* parascore.genes.tsv - file with `para_score` statistics per gene
* parascores.families.tsv - file with `para_score` statistics per paralog family
# 2. parasubScores
Subdirectories
* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `parasubGeneIds.txt`)
* fasta - contains for each paralog subfamily a fasta file (subfamily numbers and descriptions can be found in the files `parasubFamilyIds.txt` and `parasubgroups.detailed.tsv`)
* parasubfamilies - contains the alignment files from [MUSCLE](http://www.drive5.com/muscle/) and the conservation scoring from [JalView](http://www.jalview.org/) per paralog subfamily
Files:
* parasubGeneIds.txt - file with [HGNC](http://www.genenames.org/) gene names with `parasub_scores`
* parasubFamilyIds.txt - file with paralog subfamily ids
* parasubgroups.detailed.tsv - file with detailed description of each paralog subfamily including subfamily-id, number of genes per subfamily and gene names
* parasubscore.genes.tsv - file with `parasub_score` statistics per gene
* parasubscores.subfamilies.tsv - file with `parasub_score` statistics per paralog subfamily
* parasubfam.probs.tsv - file with [ExAC](http://exac.broadinstitute.org) probilities for missense, LoF and missense.plus.LoF variants per para subfamily and number of genes per family
# 3. parahomoScores
Subdirectories
* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `parahomoGeneIds.txt`)
* fasta - contains for each paralog homology family a fasta file (homology family numbers and descriptions can be found in the files `parahomoFamilyIds.txt` and `parahomogroups.detailed.tsv`)
* cluster - contains for each gene family the cluster file (`.clstr`) generated with [cd-hit](http://weizhongli-lab.org/cd-hit/)
* parahomofamilies - contains the alignment files from [MUSCLE](http://www.drive5.com/muscle/) and the conservation scoring from [JalView](http://www.jalview.org/) per homology family
Files
* parahomoGeneIds.txt - file with [HGNC](http://www.genenames.org/) gene names with `parahomo_scores`
* parahomoFamilyIds.txt - file with homology subfamily ids
* parahomogroups.detailed.tsv - file with detailed description of each homology subfamily including subfamily-id, number of genes per subfamily and gene names
For each of the three scores (para, parasub, parahomo) the `.geneIds.txt` file gives the genenames for which the according score is available.
Per gene there are two diffferent files available in the according score directory":
score per aminio acid plus gene-specific metrices like median/median/standard deviation (STD) in the header and z-scores per position
format:AA position AA score score_minus_median (score-median)/STD score-mean (score-mean)/STD
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
```
# GENE: KRIT1
# TOTALSCORE=2883
# LEN=736
# MEAN(SCORE)=3.92
# STD(SCORE)=4.42
# MEDIAN(SCORE)=0
# SCORE>0=0.47
# MEAN(GREATER>0)=8.28
# STD(SCORE>0)=2.26
# MEDIAN(SCORE>0)=8
# MAXSCORES=0.14
#POS AA SCORE SCORE-MEDIAN (SCORE-MEDIAN)/STD SCORE-MEAN (SCORE-MEAN)/STD
1 M 0 0 0.00 -3.92 -0.89
2 G 0 0 0.00 -3.92 -0.89
3 N 0 0 0.00 -3.92 -0.89
4 P 0 0 0.00 -3.92 -0.89
5 E 0 0 0.00 -3.92 -0.89
..
82 A 10 10 2.26 6.08 1.38
83 N 10 10 2.26 6.08 1.38
84 Q 8 8 1.81 4.08 0.92
85 G 11 11 2.49 7.08 1.60
86 I 5 5 1.13 1.08 0.24
..
```
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
## Para statistics files
# Para score statistics
* parascore.genes.tsv (with header, 9 columns)
1. gene name
2. total para score over all positions
3. gene length
4. mean para score
5. std para score
6. median para score
7. percentage para score greater than 0 (minimum)
8. percentage para score equal 11 (maximum)
9. percentage para_zscore greater zero (=conserved)
```
GENE TOTALPARASCORE LENGTH MEAN STD MEDIAN PARASCORE_GREATER_0 PARASCORE_EQUAL_11 PARASCORE_ZSCORE_GREATER0
A1BG 669 495 1.32 2.49 0 0.34 0.02 0.27
A1CF 2410 602 3.93 3.83 3 0.66 0.09 0.42
```
* parascores.families.tsv (with header, 10 columns)
1. family id as number
2. gene names per family as csv list
3. sum of all para scores over all genes within the family
4. sum of lengths of all genes within a family
5. mean para score over all positions over all genes within the family
6. standard deviation para score over all positions over all genes within the family
7. median para score over all positions over all genes within the family
8. percentage of positions with para score greater than 0 (minimum)
9. percentage of positions with para score equal 11 (maximum)
10. percentage of positions with para_zscore per family greater than 0
```
FAMILY GENES TOTALPARASCORE TOTALLENGTH MEAN STD MEDIAN PARASCORE_GREATER_0 PARASCORE_11 PARASCORE_FAMILY_ZSCORE_GREATER0
2 SNX6,SNX5,SNX32 11199 1225 8.88 2.92 11 0.98 0.52 0.60
3 KLHDC2,HCFC1,KLHDC1,RABEPK,LZTR1,HCFC2,KLHDC3,KLHDC10 11442 5675 1.98 2.89 0 0.43 0.03 0.31
```
# Parasub score statistics
* parasubscore.genes.tsv (with header, 9 columns)
1. gene name
2. total parasub score over all positions
3. gene length
4. mean parasub score
5. std parasub score
6. median parasub score
7. percentage parasub score greater than 0 (minimum)
8. percentage parasub score equal 11 (maximum)
9. percentage parasub_zscore greater zero (=para conserved)
```
GENE TOTALPARASUBSCORE LENGTH MEAN STD MEDIAN PARASUBSCORE_GREATER_0 PARASUBSCORE_EQUAL_11 PARASUBSCORE_ZSCORE_GREATER0
PNCK 3036 426 6.93 3.70 8 0.88 0.29 0.52
CLMP 1279 373 3.32 3.23 3 0.75 0.06 0.40
```
* parasubscores.subfamilies.tsv (with headr, 10 columns)
1. subfamily id given as family id dot para dot cluster id, e.g. 2.para.1
2. gene names per subfamily as csv list
3. sum of all para scores over all genes within the subfamily
4. sum of lengths of all genes within a subfamily
5. mean para score over all positions over all genes within the subfamily
6. standard deviation para score over all positions over all genes within the subfamily
7. median para score over all positions over all genes within the subfamily
8. percentage of positions with para score greater than 0 (minimum) per subfamily
9. percentage of positions with para score equal 11 (maximum) per subfamily
10. percentage of positions with para_zscore per subfamily greater than 0
```
SUBFAMILY GENES TOTALSUBPARASCORE TOTALLENGTH MEAN STD MEDIAN PARASUBSCORE_GREATER_0 PARASUBSCORE_11 PARASUBSCORE_SUBFAMILY_ZSCORE_GREATER0
2.para.1 SNX6,SNX32,SNX5 11199 1225 8.88 2.92 11 0.98 0.52 0.60
14.para.1 USP44,USP49 13016 1400 9.14 2.84 11 0.97 0.60 0.61
14.para.2 USP45,USP16 13724 1637 8.26 2.96 9 0.96 0.39 0.56
```
* parasubfam.probs.tsv (with header, 7 columns)
1. subfamily name
2. number of genes per subfamily
3. list of gene names
4. missing genes
5. probability for missense variants
6. probability for loss-of-function (lof) variants
7. probability for missinse plus lof variants
```
subfamname number_of_genes genes tmissing_genes p_mis p_lof p_mislof
1000.para.1 2 TARBP2,PRKRA 1.911761986731e-05 4.21231818355e-06 2.332993805086e-05
1000.para.2 2 STAU2,STAU1 2.831356316e-05 4.7126497699e-06 3.30262129299e-05
```