Skip to content
Snippets Groups Projects 6.07 KiB
Newer Older
Patrick May's avatar
Patrick May committed
# Files to download to this directory:
All files neccessary to run the paralog annotation tool locally can be downloaded from the Zenodo repository [Paralog variant classification and scoring](
Patrick May's avatar
Patrick May committed
## Annotation files
Patrick May's avatar
Patrick May committed
Patrick May's avatar
Patrick May committed
# Download scores
## Para scores per gene 
Patrick May's avatar
Patrick May committed
tar -zxvf paraScores.tar.gz  

## Parasub scores per gene
Patrick May's avatar
Patrick May committed
tar -zxvf parasubScores.tar.gz  

## Para homology scores per gene
Patrick May's avatar
Patrick May committed
tar -zxvf parahomoScores.tar.gz   
Patrick May's avatar
Patrick May committed
After downloading and decompressing one directory for each of the thee different score types will be available in the ```data``` directory
Patrick May's avatar
Patrick May committed
paraScores              # Directory with gene-specific files with para score
parasubScores           # Directory with gene-specific files with parasub score
parahomoScores          # Directory with gene-specific files with parahomo score

Patrick May's avatar
Patrick May committed
# Directory and subdirectory structures

# 1. paraScores  
Patrick May's avatar
Patrick May committed

* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `paraGeneIds.txt`)  
* fasta - contains for each gene family a fasta file (family numbers and descriptions can be found in the files `paraFamilyIds.txt` and `paragroups.detailed.tsv`)  
* cluster - contains for each gene family the cluster file (`.clstr`) generated with [cd-hit](  
* parafamilies - contains the alignment files from [MUSCLE]( and the conservation scoring from [JalView]( per paralog family   
Patrick May's avatar
Patrick May committed

* paraGeneIds.txt - file with [HGNC]( gene names with `para_scores`
* paraFamilyIds.txt - file with paralog family ids
* paragroups.detailed.tsv - file with detailed description of each paralog family including family-id, number of genes per family and gene names
* parascore.genes.tsv - file with `para_score` statistics per gene
* parascores.families.tsv - file with `para_score` statistics per paralog family

# 2. parasubScores  
Patrick May's avatar
Patrick May committed

* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `parasubGeneIds.txt`)  
* fasta - contains for each paralog subfamily a fasta file (subfamily numbers and descriptions can be found in the files `parasubFamilyIds.txt` and `parasubgroups.detailed.tsv`)  
* parasubfamilies - contains the alignment files from [MUSCLE]( and the conservation scoring from [JalView]( per paralog subfamily  
Patrick May's avatar
Patrick May committed

* parasubGeneIds.txt - file with [HGNC]( gene names with `parasub_scores`
* parasubFamilyIds.txt - file with paralog subfamily ids
* parasubgroups.detailed.tsv - file with detailed description of each paralog subfamily including subfamily-id, number of genes per subfamily and gene names
* parasubscore.genes.tsv - file with `parasub_score` statistics per gene
* parasubscores.subfamilies.tsv - file with `parasub_score` statistics per paralog subfamily
* parasubfam.probs.tsv - file with [ExAC]( probilities for missense, LoF and variants per para subfamily and number of genes per family

# 3. parahomoScores
Patrick May's avatar
Patrick May committed

* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `parahomoGeneIds.txt`)   
* fasta - contains for each paralog homology family a fasta file (homology family numbers and descriptions can be found in the files `parahomoFamilyIds.txt` and `parahomogroups.detailed.tsv`)    
* cluster - contains for each gene family the cluster file (`.clstr`) generated with [cd-hit](   
* parahomofamilies - contains the alignment files from [MUSCLE]( and the conservation scoring from [JalView]( per homology family  
Patrick May's avatar
Patrick May committed

* parahomoGeneIds.txt - file with [HGNC]( gene names with `parahomo_scores`
* parahomoFamilyIds.txt - file with homology subfamily ids
* parahomogroups.detailed.tsv - file with detailed description of each homology subfamily including subfamily-id, number of genes per subfamily and gene names

Patrick May's avatar
Patrick May committed
# File formats
Patrick May's avatar
Patrick May committed

## Gene files
Patrick May's avatar
Patrick May committed
For each of the three scores (para, parasub, parahomo) the `.geneIds.txt` file gives the genenames for which the according score is available.  

Per gene there are two diffferent files available in the according score directory":

Patrick May's avatar
Patrick May committed
* `genename`.txt eg. KRIT1.txt 
Patrick May's avatar
Patrick May committed

Patrick May's avatar
Patrick May committed
score per aminoacid  
format: AA position AA  score
Patrick May's avatar
Patrick May committed
1       M       0
2       G       0
3       N       0
4       P       0

Patrick May's avatar
Patrick May committed
* `genename`.withStats.txt eg. KRIT1.withStats.txt 
Patrick May's avatar
Patrick May committed

Patrick May's avatar
Patrick May committed
score per aminio acid plus gene-specific metrices like median/median/standard deviation (STD) in the header and z-scores per position  
format:AA position AA score    score_minus_median  (score-median)/STD  score-mean  (score-mean)/STD
Patrick May's avatar
Patrick May committed
# LEN=736
# MEAN(SCORE)=3.92
# STD(SCORE)=4.42
# SCORE>0=0.47
# MEAN(GREATER>0)=8.28 
# STD(SCORE>0)=2.26
1       M       0       0       0.00    -3.92   -0.89
2       G       0       0       0.00    -3.92   -0.89
3       N       0       0       0.00    -3.92   -0.89
4       P       0       0       0.00    -3.92   -0.89
5       E       0       0       0.00    -3.92   -0.89
82      A       10      10      2.26    6.08    1.38
83      N       10      10      2.26    6.08    1.38
84      Q       8       8       1.81    4.08    0.92
85      G       11      11      2.49    7.08    1.60
86      I       5       5       1.13    1.08    0.24