README.md

# Files to download to this directory:
All files neccessary to run the paralog annotation tool locally can be downloaded from the Zenodo repository [Paralog variant classification and scoring](https://zenodo.org/record/802891#.WTacLcmkJTY)

## Annotation files
```
wget https://zenodo.org/record/803358/files/CCDS_IDS.txt  
wget https://zenodo.org/record/803358/files/CCDS2Ensembl2HGNC.txt  
wget https://zenodo.org/record/803358/files/refSeqEnsCCDS.tsv  
```

# Download scores
## Para scores per gene 
````
wget https://zenodo.org/record/803358/files/paraScores.tar.gz  
tar -zxvf paraScores.tar.gz  
```

## Parasub scores per gene
```
wget https://zenodo.org/record/803358/files/parasubScores.tar.gz  
tar -zxvf parasubScores.tar.gz  
```

## Para homology scores per gene
```
wget https://zenodo.org/record/803358/files/parahomoScores.tar.gz  
tar -zxvf parahomoScores.tar.gz   
```

After downloading and decompressing one directory for each of the thee different score types will be available in the ```data``` directory
```
paraScores              # Directory with gene-specific files with para score
parasubScores           # Directory with gene-specific files with parasub score
parahomoScores          # Directory with gene-specific files with parahomo score
```

# Directory and subdirectory structures

# 1. paraScores  

Subdirectories
* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `paraGeneIds.txt`)  
* fasta - contains for each gene family a fasta file (family numbers and descriptions can be found in the files `paraFamilyIds.txt` and `paragroups.detailed.tsv`)  
* cluster - contains for each gene family the cluster file (`.clstr`) generated with [cd-hit](http://weizhongli-lab.org/cd-hit/)  
* parafamilies - contains the alignment files from [MUSCLE](http://www.drive5.com/muscle/) and the conservation scoring from [JalView](http://www.jalview.org/) per paralog family   

Files
* paraGeneIds.txt - file with [HGNC](http://www.genenames.org/) gene names with `para_scores`
* paraFamilyIds.txt - file with paralog family ids
* paragroups.detailed.tsv - file with detailed description of each paralog family including family-id, number of genes per family and gene names
* parascore.genes.tsv - file with `para_score` statistics per gene
* parascores.families.tsv - file with `para_score` statistics per paralog family

# 2. parasubScores  

Subdirectories
* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `parasubGeneIds.txt`)  
* fasta - contains for each paralog subfamily a fasta file (subfamily numbers and descriptions can be found in the files `parasubFamilyIds.txt` and `parasubgroups.detailed.tsv`)  
* parasubfamilies - contains the alignment files from [MUSCLE](http://www.drive5.com/muscle/) and the conservation scoring from [JalView](http://www.jalview.org/) per paralog subfamily  

Files:
* parasubGeneIds.txt - file with [HGNC](http://www.genenames.org/) gene names with `parasub_scores`
* parasubFamilyIds.txt - file with paralog subfamily ids
* parasubgroups.detailed.tsv - file with detailed description of each paralog subfamily including subfamily-id, number of genes per subfamily and gene names
* parasubscore.genes.tsv - file with `parasub_score` statistics per gene
* parasubscores.subfamilies.tsv - file with `parasub_score` statistics per paralog subfamily
* parasubfam.probs.tsv - file with [ExAC](http://exac.broadinstitute.org) probilities for missense, LoF and missense.plus.LoF variants per para subfamily and number of genes per family

# 3. parahomoScores

Subdirectories
* genes - contains for each gene a `genename`.txt and `genename`.withStats.txt file (genenames can be found in `parahomoGeneIds.txt`)   
* fasta - contains for each paralog homology family a fasta file (homology family numbers and descriptions can be found in the files `parahomoFamilyIds.txt` and `parahomogroups.detailed.tsv`)    
* cluster - contains for each gene family the cluster file (`.clstr`) generated with [cd-hit](http://weizhongli-lab.org/cd-hit/)   
* parahomofamilies - contains the alignment files from [MUSCLE](http://www.drive5.com/muscle/) and the conservation scoring from [JalView](http://www.jalview.org/) per homology family  

Files
* parahomoGeneIds.txt - file with [HGNC](http://www.genenames.org/) gene names with `parahomo_scores`
* parahomoFamilyIds.txt - file with homology subfamily ids
* parahomogroups.detailed.tsv - file with detailed description of each homology subfamily including subfamily-id, number of genes per subfamily and gene names


# File formats

## Gene files
For each of the three scores (para, parasub, parahomo) the `.geneIds.txt` file gives the genenames for which the according score is available.  

Per gene there are two diffferent files available in the according score directory":

* `genename`.txt eg. KRIT1.txt 

score per aminoacid  
format: AA position AA  score
 ```
1       M       0
2       G       0
3       N       0
4       P       0
..
```

* `genename`.withStats.txt eg. KRIT1.withStats.txt 

score per aminio acid plus gene-specific metrices like median/median/standard deviation (STD) in the header and z-scores per position  
format:AA position AA score    score_minus_median  (score-median)/STD  score-mean  (score-mean)/STD
```                 
# GENE: KRIT1
# TOTALSCORE=2883
# LEN=736
# MEAN(SCORE)=3.92
# STD(SCORE)=4.42
# MEDIAN(SCORE)=0
# SCORE>0=0.47
# MEAN(GREATER>0)=8.28 
# STD(SCORE>0)=2.26
# MEDIAN(SCORE>0)=8
# MAXSCORES=0.14
#POS    AA      SCORE   SCORE-MEDIAN    (SCORE-MEDIAN)/STD      SCORE-MEAN      (SCORE-MEAN)/STD
1       M       0       0       0.00    -3.92   -0.89
2       G       0       0       0.00    -3.92   -0.89
3       N       0       0       0.00    -3.92   -0.89
4       P       0       0       0.00    -3.92   -0.89
5       E       0       0       0.00    -3.92   -0.89
..
82      A       10      10      2.26    6.08    1.38
83      N       10      10      2.26    6.08    1.38
84      Q       8       8       1.81    4.08    0.92
85      G       11      11      2.49    7.08    1.60
86      I       5       5       1.13    1.08    0.24
..
```

## Para statistics files

# Para score statistics

* parascore.genes.tsv (with header, 9 columns)  
        1. gene name  
        2. total para score over all positions  
        3. gene length  
        4. mean para score  
        5. std para score  
        6. median para score  
        7. percentage para score greater than 0 (minimum)  
        8. percentage para score equal 11 (maximum)  
        9. percentage para_zscore greater zero (=conserved)
```
GENE    TOTALPARASCORE  LENGTH  MEAN    STD     MEDIAN  PARASCORE_GREATER_0     PARASCORE_EQUAL_11      PARASCORE_ZSCORE_GREATER0
A1BG    669     495     1.32    2.49    0       0.34    0.02    0.27
A1CF    2410    602     3.93    3.83    3       0.66    0.09    0.42
```

* parascores.families.tsv (with header, 10 columns)  
        1. family id as number  
        2. gene names per family as csv list  
        3. sum of all para scores over all genes within the family  
        4. sum of lengths of all genes within a family  
        5. mean para score over all positions over all genes within the family  
        6. standard deviation para score over all positions over all genes within the family  
        7. median para score over all positions over all genes within the family  
        8. percentage of positions with para score greater than 0 (minimum)  
        9. percentage of positions with para score equal 11 (maximum)  
        10. percentage of positions with para_zscore per family greater than 0    
        
```
FAMILY  GENES   TOTALPARASCORE  TOTALLENGTH     MEAN    STD     MEDIAN  PARASCORE_GREATER_0     PARASCORE_11    PARASCORE_FAMILY_ZSCORE_GREATER0
2       SNX6,SNX5,SNX32 11199   1225    8.88    2.92    11      0.98    0.52    0.60
3       KLHDC2,HCFC1,KLHDC1,RABEPK,LZTR1,HCFC2,KLHDC3,KLHDC10   11442   5675    1.98    2.89    0       0.43    0.03    0.31
```

# Parasub score statistics

* parasubscore.genes.tsv (with header, 9 columns)  
        1. gene name
        2. total parasub score over all positions  
        3. gene length  
        4. mean parasub score  
        5. std parasub score  
        6. median parasub score  
        7. percentage parasub score greater than 0 (minimum)  
        8. percentage parasub score equal 11 (maximum)  
        9. percentage parasub_zscore greater zero (=para conserved)

```
GENE    TOTALPARASUBSCORE       LENGTH  MEAN    STD     MEDIAN  PARASUBSCORE_GREATER_0  PARASUBSCORE_EQUAL_11   PARASUBSCORE_ZSCORE_GREATER0
PNCK    3036    426     6.93    3.70    8       0.88    0.29    0.52
CLMP    1279    373     3.32    3.23    3       0.75    0.06    0.40
```

* parasubscores.subfamilies.tsv (with headr, 10 columns)   
        1. subfamily id given as family id dot para dot cluster id, e.g. 2.para.1   
        2. gene names per subfamily as csv list  
        3. sum of all para scores over all genes within the subfamily  
        4. sum of lengths of all genes within a subfamily  
        5. mean para score over all positions over all genes within the subfamily  
        6. standard deviation para score over all positions over all genes within the subfamily  
        7. median para score over all positions over all genes within the subfamily  
        8. percentage of positions with para score greater than 0 (minimum) per subfamily  
        9. percentage of positions with para score equal 11 (maximum) per subfamily  
        10. percentage of positions with para_zscore per subfamily greater than 0 

```
SUBFAMILY       GENES   TOTALSUBPARASCORE       TOTALLENGTH     MEAN    STD     MEDIAN  PARASUBSCORE_GREATER_0  PARASUBSCORE_11 PARASUBSCORE_SUBFAMILY_ZSCORE_GREATER0
2.para.1        SNX6,SNX32,SNX5 11199   1225    8.88    2.92    11      0.98    0.52    0.60
14.para.1       USP44,USP49     13016   1400    9.14    2.84    11      0.97    0.60    0.61
14.para.2       USP45,USP16     13724   1637    8.26    2.96    9       0.96    0.39    0.56
```

* parasubfam.probs.tsv (with header, 7 columns)   
        1. subfamily name  
        2. number of genes per subfamily  
        3. list of gene names  
        4. missing genes  
        5. probability for missense variants   
        6. probability for loss-of-function (lof) variants  
        7. probability for missinse plus lof variants  

```
subfamname      number_of_genes genes   tmissing_genes  p_mis   p_lof   p_mislof
1000.para.1     2       TARBP2,PRKRA            1.911761986731e-05      4.21231818355e-06       2.332993805086e-05
1000.para.2     2       STAU2,STAU1             2.831356316e-05 4.7126497699e-06        3.30262129299e-05
```