Newer
Older
# Files to download to this directory:
All files neccessary to run the paralog annotation tool locally can be downloaded from the Zenodo repository [Paralog variant classification and scoring](https://zenodo.org/record/802891#.WTacLcmkJTY)
wget https://zenodo.org/record/803358/files/CCDS_IDS.txt
wget https://zenodo.org/record/803358/files/CCDS2Ensembl2HGNC.txt
wget https://zenodo.org/record/803358/files/refSeqEnsCCDS.tsv
wget https://zenodo.org/record/803358/files/paraScores.tar.gz
tar -zxvf paraScores.tar.gz
```
## Parasub scores per gene
```
wget https://zenodo.org/record/803358/files/parasubScores.tar.gz
tar -zxvf parasubScores.tar.gz
```
## Para homology scores per gene
```
wget https://zenodo.org/record/803358/files/parahomoScores.tar.gz
After downloading and decompressing one directory for each of the thee different score types will be available in the ```data``` directory
```
paraScores # Directory with gene-specific files with para score
parasubScores # Directory with gene-specific files with parasub score
parahomoScores # Directory with gene-specific files with parahomo score
```
# Directory and subdirectory structures
# 1. paraScores
Subdirectories
* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `paraGeneIds.txt`)
* fasta - contains for each gene family a fasta file (family numbers and descriptions can be found in the files `paraFamilyIds.txt` and `paragroups.detailed.tsv`)
* cluster - contains for each gene family the cluster file (`.clstr`) generated with [cd-hit](http://weizhongli-lab.org/cd-hit/)
* parafamilies - contains the alignment files from [MUSCLE](http://www.drive5.com/muscle/) and the conservation scoring from [JalView](http://www.jalview.org/) per paralog family
Files
* paraGeneIds.txt - file with [HGNC](http://www.genenames.org/) gene names with `para_scores`
* paraFamilyIds.txt - file with paralog family ids
* paragroups.detailed.tsv - file with detailed description of each paralog family including family-id, number of genes per family and gene names
* parascore.genes.tsv - file with `para_score` statistics per gene
* parascores.families.tsv - file with `para_score` statistics per paralog family
# 2. parasubScores
Subdirectories
* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `parasubGeneIds.txt`)
* fasta - contains for each paralog subfamily a fasta file (subfamily numbers and descriptions can be found in the files `parasubFamilyIds.txt` and `parasubgroups.detailed.tsv`)
* parasubfamilies - contains the alignment files from [MUSCLE](http://www.drive5.com/muscle/) and the conservation scoring from [JalView](http://www.jalview.org/) per paralog subfamily
Files:
* parasubGeneIds.txt - file with [HGNC](http://www.genenames.org/) gene names with `parasub_scores`
* parasubFamilyIds.txt - file with paralog subfamily ids
* parasubgroups.detailed.tsv - file with detailed description of each paralog subfamily including subfamily-id, number of genes per subfamily and gene names
* parasubscore.genes.tsv - file with `parasub_score` statistics per gene
* parasubscores.subfamilies.tsv - file with `parasub_score` statistics per paralog subfamily
* parasubfam.probs.tsv - file with [ExAC](http://exac.broadinstitute.org) probilities for missense, LoF and missense.plus.LoF variants per para subfamily and number of genes per family
# 3. parahomoScores
Subdirectories
* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `parahomoGeneIds.txt`)
* fasta - contains for each paralog homology family a fasta file (homology family numbers and descriptions can be found in the files `parahomoFamilyIds.txt` and `parahomogroups.detailed.tsv`)
* cluster - contains for each gene family the cluster file (`.clstr`) generated with [cd-hit](http://weizhongli-lab.org/cd-hit/)
* parahomofamilies - contains the alignment files from [MUSCLE](http://www.drive5.com/muscle/) and the conservation scoring from [JalView](http://www.jalview.org/) per homology family
Files
* parahomoGeneIds.txt - file with [HGNC](http://www.genenames.org/) gene names with `parahomo_scores`
* parahomoFamilyIds.txt - file with homology subfamily ids
* parahomogroups.detailed.tsv - file with detailed description of each homology subfamily including subfamily-id, number of genes per subfamily and gene names
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
For each of the three scores (para, parasub, parahomo) the `.geneIds.txt` file gives the genenames for which the according score is available.
Per gene there are two diffferent files available in the according score directory":
1. `genename`.txt eg. KRIT1.txt
(parascore per aminoacid given in the format: AA position<tab>AA<tab>para_score)
```
1 M 0
2 G 0
3 N 0
4 P 0
..
```
2. `genename`.withStats.txt eg. KRIT1.withStats.txt
(format:AA position<tab>AA<tab>para_score<tab>score_minus_median<tab>(score-median)/STD<tab>score-mean<tab>(score-mean)/STD)
```
# GENE: KRIT1
# TOTALSCORE=2883
# LEN=736
# MEAN(SCORE)=3.92
# STD(SCORE)=4.42
# MEDIAN(SCORE)=0
# SCORE>0=0.47
# MEAN(GREATER>0)=8.28
# STD(SCORE>0)=2.26
# MEDIAN(SCORE>0)=8
# MAXSCORES=0.14
#POS AA SCORE SCORE-MEDIAN (SCORE-MEDIAN)/STD SCORE-MEAN (SCORE-MEAN)/STD
1 M 0 0 0.00 -3.92 -0.89
2 G 0 0 0.00 -3.92 -0.89
3 N 0 0 0.00 -3.92 -0.89
4 P 0 0 0.00 -3.92 -0.89
5 E 0 0 0.00 -3.92 -0.89
..
82 A 10 10 2.26 6.08 1.38
83 N 10 10 2.26 6.08 1.38
84 Q 8 8 1.81 4.08 0.92
85 G 11 11 2.49 7.08 1.60
86 I 5 5 1.13 1.08 0.24
..
```