README.md

# Files to download to this directory:
All files neccessary to run the paralog annotation tool locally can be downloaded from the Zenodo repository [Paralog variant classification and scoring](https://zenodo.org/record/802891#.WTacLcmkJTY)

## Annotation files
```
wget https://zenodo.org/record/803358/files/CCDS_IDS.txt  
wget https://zenodo.org/record/803358/files/CCDS2Ensembl2HGNC.txt  
wget https://zenodo.org/record/803358/files/refSeqEnsCCDS.tsv  
```

# Download scores
## Para scores per gene 
````
wget https://zenodo.org/record/803358/files/paraScores.tar.gz  
tar -zxvf paraScores.tar.gz  
```

## Parasub scores per gene
```
wget https://zenodo.org/record/803358/files/parasubScores.tar.gz  
tar -zxvf parasubScores.tar.gz  
```

## Para homology scores per gene
```
wget https://zenodo.org/record/803358/files/parahomoScores.tar.gz  
tar -zxvf parahomoScores.tar.gz   
```

After downloading and decompressing one directory for each of the thee different score types will be available in the ```data``` directory
```
paraScores              # Directory with gene-specific files with para score
parasubScores           # Directory with gene-specific files with parasub score
parahomoScores          # Directory with gene-specific files with parahomo score
```

# Directory and subdirectory structures

# 1. paraScores  

Subdirectories
* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `paraGeneIds.txt`)  
* fasta - contains for each gene family a fasta file (family numbers and descriptions can be found in the files `paraFamilyIds.txt` and `paragroups.detailed.tsv`)  
* cluster - contains for each gene family the cluster file (`.clstr`) generated with [cd-hit](http://weizhongli-lab.org/cd-hit/)  
* parafamilies - contains the alignment files from [MUSCLE](http://www.drive5.com/muscle/) and the conservation scoring from [JalView](http://www.jalview.org/) per paralog family   

Files
* paraGeneIds.txt - file with [HGNC](http://www.genenames.org/) gene names with `para_scores`
* paraFamilyIds.txt - file with paralog family ids
* paragroups.detailed.tsv - file with detailed description of each paralog family including family-id, number of genes per family and gene names
* parascore.genes.tsv - file with `para_score` statistics per gene
* parascores.families.tsv - file with `para_score` statistics per paralog family

# 2. parasubScores  

Subdirectories
* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `parasubGeneIds.txt`)  
* fasta - contains for each paralog subfamily a fasta file (subfamily numbers and descriptions can be found in the files `parasubFamilyIds.txt` and `parasubgroups.detailed.tsv`)  
* parasubfamilies - contains the alignment files from [MUSCLE](http://www.drive5.com/muscle/) and the conservation scoring from [JalView](http://www.jalview.org/) per paralog subfamily  

Files:
* parasubGeneIds.txt - file with [HGNC](http://www.genenames.org/) gene names with `parasub_scores`
* parasubFamilyIds.txt - file with paralog subfamily ids
* parasubgroups.detailed.tsv - file with detailed description of each paralog subfamily including subfamily-id, number of genes per subfamily and gene names
* parasubscore.genes.tsv - file with `parasub_score` statistics per gene
* parasubscores.subfamilies.tsv - file with `parasub_score` statistics per paralog subfamily
* parasubfam.probs.tsv - file with [ExAC](http://exac.broadinstitute.org) probilities for missense, LoF and missense.plus.LoF variants per para subfamily and number of genes per family

# 3. parahomoScores

Subdirectories
* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `parahomoGeneIds.txt`)   
* fasta - contains for each paralog homology family a fasta file (homology family numbers and descriptions can be found in the files `parahomoFamilyIds.txt` and `parahomogroups.detailed.tsv`)    
* cluster - contains for each gene family the cluster file (`.clstr`) generated with [cd-hit](http://weizhongli-lab.org/cd-hit/)   
* parahomofamilies - contains the alignment files from [MUSCLE](http://www.drive5.com/muscle/) and the conservation scoring from [JalView](http://www.jalview.org/) per homology family  

Files
* parahomoGeneIds.txt - file with [HGNC](http://www.genenames.org/) gene names with `parahomo_scores`
* parahomoFamilyIds.txt - file with homology subfamily ids
* parahomogroups.detailed.tsv - file with detailed description of each homology subfamily including subfamily-id, number of genes per subfamily and gene names


# File formats

## Gene files
For each of the three scores (para, parasub, parahomo) the `.geneIds.txt` file gives the genenames for which the according score is available.  

Per gene there are two diffferent files available in the according score directory":

1. `genename`.txt eg. KRIT1.txt 

(parascore per aminoacid given in the format: AA position<tab>AA<tab>para_score)
 ```
1       M       0
2       G       0
3       N       0
4       P       0
..
```

2. `genename`.withStats.txt eg. KRIT1.withStats.txt 

(format:AA position<tab>AA<tab>para_score<tab>score_minus_median<tab>(score-median)/STD<tab>score-mean<tab>(score-mean)/STD)
```                 
# GENE: KRIT1
# TOTALSCORE=2883
# LEN=736
# MEAN(SCORE)=3.92
# STD(SCORE)=4.42
# MEDIAN(SCORE)=0
# SCORE>0=0.47
# MEAN(GREATER>0)=8.28 
# STD(SCORE>0)=2.26
# MEDIAN(SCORE>0)=8
# MAXSCORES=0.14
#POS    AA      SCORE   SCORE-MEDIAN    (SCORE-MEDIAN)/STD      SCORE-MEAN      (SCORE-MEAN)/STD
1       M       0       0       0.00    -3.92   -0.89
2       G       0       0       0.00    -3.92   -0.89
3       N       0       0       0.00    -3.92   -0.89
4       P       0       0       0.00    -3.92   -0.89
5       E       0       0       0.00    -3.92   -0.89
..
82      A       10      10      2.26    6.08    1.38
83      N       10      10      2.26    6.08    1.38
84      Q       8       8       1.81    4.08    0.92
85      G       11      11      2.49    7.08    1.60
86      I       5       5       1.13    1.08    0.24
..
```