Commit a7adcd24 authored by David Hoksza's avatar David Hoksza
Browse files

documentation

parent 17b4348a
# Biohackathon 2019: Rare disease maps # Biohackathon 2019: Rare disease maps
Code and comments to [https://r3.pages.uni.lu/biocuration/resources/biohackathon2019/rdmaps/](https://r3.pages.uni.lu/biocuration/resources/biohackathon2019/rdmaps/). Documentation and code to [https://r3.pages.uni.lu/biocuration/resources/biohackathon2019/rdmaps/](https://r3.pages.uni.lu/biocuration/resources/biohackathon2019/rdmaps/).
#### Requirements ### Dependencies
- Java Runtime
- Python 3.x - Python 3.x
- R - R
- zip utility - Java 8
(- Maven 3.6)
##### Python
Python needs to have packages decribed in `associations/requirement.txt` which can be installed Python needs to have packages decribed in `associations/requirement.txt` which can be installed
via via
```commandline ```commandline
pip3 install -r associations/requirement.txt pip3 install -r dependencies/python_requirement.txt
``` ```
##### Java
Maven needs to be installed if you want to compile the `map_generator` from the code. This might not be
necesarry, since we provide a compiled version with the code as well. By default the compilation is not
carried out. This can be changed by setting the value `BUILD_MAP_GENERATOR` in `parameters.sh` to 1.
The maven can be installed on Ubuntu via apt (`sudo apt install maven`).
##### R
The R scripts in the pipeline should be able to install their dependencies on the first run. However,
it might happen that elevated privileges will be needed. If that is the case and the pipeline failes
on the R scripts, you can install the dependencies by running:
```bash
sudo Rscript --vanilla dependencies/dependencies.R
```
##### System
If the pipeline is run at clean Linux installation you might need to install the following libraries If the pipeline is run at clean Linux installation you might need to install the following libraries
(`sudo apt-get install` on Ubuntu) prior to running the code: (`sudo apt-get install` on Ubuntu) prior to running the code:
- libcurl - libcurl
- libssl-dev - libssl-dev
- libxml2-dev - libxml2-dev
### Execution ## Pipeline execution
To execute the pipeline, set the values of parameters in the `PARAMETERS TO SET` To execute the pipeline, set the values of parameters in the `parameters.sh`, mainly
section of the `assamble.sh` script file. The main parameter here is the list of the list of [Orphanet](https://www.orpha.net/consor/cgi-bin/index.php?lng=EN) disease numbers.
[Orphanet](https://www.orpha.net/consor/cgi-bin/index.php?lng=EN) disease numbers.
When the parameters are set, run the pipeline: When the parameters are set, run the pipeline:
``` ```
bash assemble.sh bash assemble.sh
``` ```
You can also provide additional parameter, being the parameters file if it differs from `parameters.sh`.
This will run the pipeline described in the next section and eventually output a This will run the pipeline described in the next section and eventually output a
ZIP file with a map which can then be imported in [MINERVA](https://minerva.pages.uni.lu/doc/) ZIP file with a map which can then be imported in [MINERVA](https://minerva.pages.uni.lu/doc/)
as a disease maps integrating all the found enriched pathways together with as a disease maps integrating all the found enriched pathways together with
genetic an variant [overlays](https://minerva.pages.uni.lu/doc/user_manual/v14.0/index/#overlays-tab). genetic an variant [overlays](https://minerva.pages.uni.lu/doc/user_manual/v14.0/index/#overlays-tab).
### Pipeline ## Pipeline description
The pipeline consists of a bunch of tools which can be divided into three broad categories:
1. [Retrieval of gene-disease mappging and variants](associations/README.md). 1. [Retrieval of gene-disease mappging and variants](associations/README.md).
2. [Enrichment](enrichment/README.md) 2. [Enrichment](enrichment/README.md)
\ No newline at end of file 2. [Map assembly](map_generator/README.md)
The tools are glued together by the `assembly.sh` script resulting in the following pipeline:
1. Obtain gene-disease and variant-disease mapping from DisGeNET.
2. Obtain gene-disease and variant-disease mapping from OpenTargets.
3. Obtain possibly pathogenic ClinVar variants and genes pertinent to given disease.
4. Compile list of of genes associated with disease from all the input sources.
5. Extend the list of genes by going to other resources such as OmniPath or text mining.
6. Compile list of of variants associated with disease from all the input sources.
7. Filter out variants with high allele frequency using Ensemble.
8. Obtain variant information (position, protein-level mapping) and store it for MINERVA genetic variant overlay.
9. From resources such as existing disease maps or WikiPathways obtain enriched pathways
with respect to the disease-associated genes obtained from previous step.
10. Compile the obtained pathways into a single disease map.
11. Bundle the disease map with genetic and variant overlays into a single archive to be then uploaded to [MINERVA](https://minerva.pages.uni.lu/).
## License
GNU Affero General Public License v3.0
\ No newline at end of file
...@@ -89,14 +89,11 @@ check_exit_code ...@@ -89,14 +89,11 @@ check_exit_code
cp ${EXTEND_DIR}/output.txt ${text_mining_out_path} cp ${EXTEND_DIR}/output.txt ${text_mining_out_path}
log "Genes list extended" log "Genes list extended"
minerva_genes_out_path=${RES_DIR}/04-minerva-genes.txt minerva_genes_out_path=${RES_DIR}/04-minerva-genes.txt
$PYTHON_BIN $ASSOCIATIONS_DIR/minerva_genes.py -f ${genes_out_path} > ${minerva_genes_out_path} $PYTHON_BIN $ASSOCIATIONS_DIR/minerva_genes.py -f ${genes_out_path} > ${minerva_genes_out_path}
check_exit_code check_exit_code
log "Genes stored in ${minerva_genes_out_path}" log "Genes stored in ${minerva_genes_out_path}"
var_line=`cat ${genes_variants_out_path} | grep "variants in total"` var_line=`cat ${genes_variants_out_path} | grep "variants in total"`
variants_out_path=${genes_variants_out_path/02-genes_variants/03-variants} variants_out_path=${genes_variants_out_path/02-genes_variants/03-variants}
#echo ${var_line#*:} | sed 's/\,/\n/g' > ${variants_out_path} #echo ${var_line#*:} | sed 's/\,/\n/g' > ${variants_out_path}
......
...@@ -9,6 +9,7 @@ variants. ...@@ -9,6 +9,7 @@ variants.
2. Extend the genes list by looking into additional resources such as OmniPath. 2. Extend the genes list by looking into additional resources such as OmniPath.
3. Use Ensembl to assesses allele frequencies of the identified variants to 3. Use Ensembl to assesses allele frequencies of the identified variants to
filter out possibly non-rare variants. filter out possibly non-rare variants.
4. Get detailed variant information and compile it into [variant file](https://minerva.pages.uni.lu/doc/user_manual/v14.0/index/#genetic-variant-format) for MINERVA.
## 1. Gene-disease associations and variants ## 1. Gene-disease associations and variants
...@@ -90,7 +91,7 @@ The OrphaHPO_clinvar_variants_summary is a file obtained from ClinVar FTP on 12t ...@@ -90,7 +91,7 @@ The OrphaHPO_clinvar_variants_summary is a file obtained from ClinVar FTP on 12t
It was created by filtering ClinVar variants to keep only those having Orphanet identifier, It was created by filtering ClinVar variants to keep only those having Orphanet identifier,
and reducing its contents to GRCh37 (hg19) variants. and reducing its contents to GRCh37 (hg19) variants.
The script gets genes and variants for given disease and carries out pairwise comparison The script gets genes and variants for given disease, excludes non-pathogenic variants and carries out pairwise comparison
of genes and variants for ClinVar and all other input resources which are provided as JSON files (see above). The of genes and variants for ClinVar and all other input resources which are provided as JSON files (see above). The
output is a report on gene and variation level, including list of all genes and variants pertinent to given output is a report on gene and variation level, including list of all genes and variants pertinent to given
disease. disease.
...@@ -184,6 +185,36 @@ It takes as *input* a file with a single column of rs# and give as *output* in t ...@@ -184,6 +185,36 @@ It takes as *input* a file with a single column of rs# and give as *output* in t
In the output there are the variants from the input that are either not described in the genomic databses contained in Ensembl or for which there is at least one population with allele frequency >= the threshold indicated in the command line. As an example, using a threshold of 0.10 will filter out form the input file all variants that have minor allele frequency eqaul to or greater than 10\% in at least one population among those described in Ensembl. Also variant not present in Ensembl are filtered out under the assumption that they have never been described so far. In the output there are the variants from the input that are either not described in the genomic databses contained in Ensembl or for which there is at least one population with allele frequency >= the threshold indicated in the command line. As an example, using a threshold of 0.10 will filter out form the input file all variants that have minor allele frequency eqaul to or greater than 10\% in at least one population among those described in Ensembl. Also variant not present in Ensembl are filtered out under the assumption that they have never been described so far.
###### Comment on efficiency ###### Comment on efficiency
## 3. Obtaining variant information
The script `minerva_variants.py` connects to [http://myvariant.info/](MyVariant.info) API and for each of the provided
variant identifiers (provided as a list of in the [dbSNP Reference numbers](dbSNP Reference SNP) ) gets genetic location
and also protein-level mapping. The protein-level mapping in MyVariant.info comes from the [dbNSFP database](https://sites.google.com/site/jpopgen/dbNSFP).
Usage:
```bash
python3 minerva_variants.py -f variant_file_with_line_delimited_dbSNPRN
```
The resulting output is formatted as a MINERVA's [genetic variant overlay file](https://minerva.pages.uni.lu/doc/user_manual/v14.0/index/#genetic-variant-format). An example follows:
```text
#NAME=DISEASE_ASSOCIATED_VARIANTS
#TYPE=GENETIC_VARIANT
#GENOME_TYPE=UCSC
#GENOME_VERSION=hg19
position original_dna alternative_dna gene_name description color contig allele_frequency amino_acid_change identifier_uniprot
38592933 G A SCN5A snv #ff0000 3 0.8 SCN5A:XXX:g.38592933G>A:p.R1626C H9KVD2
38592729 C T SCN5A snv #ff0000 3 0.8 SCN5A:XXX:g.38592729C>T:p.G1694S H9KVD2
38627199 C T SCN5A snv #ff0000 3 0.8 SCN5A:XXX:g.38627199C>T:p.V924I H9KVD2
38627199 C A SCN5A snv #ff0000 3 0.8 SCN5A:XXX:g.38627199C>A:p.V924F H9KVD2
108867945 T G KCNE5 snv #ff0000 X 0.8 KCNE5:XXX:g.108867945T>G:p.E102A Q9UJ90
81667462 A T CACNA2D1 snv #ff0000 7 0.8 CACNA2D1:XXX:g.81667462A>T:p.N323K P54289-2
...
```
required.packages <- c("jsonlite", "httr", "enrichR", "BiocManager", "RCurl")
new.packages <- required.packages[!(required.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos='http://cran.us.r-project.org')
if ("biomaRt" %in% installed.packages()[,1] == FALSE){
BiocManager::install("biomaRt", update=TRUE, ask=FALSE)
}
# Map generator # Map generator
This code assemble resources into a single map. This code assemble resources into a single map. And consists of:
## Description 1. Assembling the input pathways into a SBML-formatted map file.
2. Postprocessing of the map file.
The code is based on the [minerva source code](https://git-r3lab.uni.lu/minerva/core/tree/master). It downloads list of pathways provided as an input (from wikipathways or minerva instances) and merges them into single file. Additionally, data from text mining can be taken as an input and this information is incorporated into the whole map. Exported file is SBML file with layout, render and multi packages. ## 1. Assembling the input pathways
## License #### Description
GNU Affero General Public License v3.0 The code is based on the [minerva source code](https://git-r3lab.uni.lu/minerva/core/tree/master). It downloads list
of pathways provided as an input (from wikipathways or minerva instances) and merges them into single file.
Additionally, data from text mining can be taken as an input and this information is incorporated into the whole map.
Exported file is SBML file with layout, render and multi packages.
## Requirements
* Java 8 #### Compilation
* Maven 3.6
## Compilation
``` ```
mvn -DskipTests=true clean install -pl biohackathon -am mvn -DskipTests=true clean install -pl biohackathon -am
...@@ -23,7 +23,7 @@ mvn -DskipTests=true clean install -pl biohackathon -am ...@@ -23,7 +23,7 @@ mvn -DskipTests=true clean install -pl biohackathon -am
The compiled runnable file will be located in biohackathon/target/biohackathon-1.0-jar-with-dependencies.jar. The compiled runnable file will be located in biohackathon/target/biohackathon-1.0-jar-with-dependencies.jar.
## Execution #### Execution
To get information about parameters just run: To get information about parameters just run:
...@@ -36,3 +36,34 @@ Sample usage: ...@@ -36,3 +36,34 @@ Sample usage:
``` ```
java -jar biohackathon/target/biohackathon-1.0-jar-with-dependencies.jar --enrichr-file biohackathon/data/enrichr_output.txt --minerva-file biohackathon/data/example_pathway_to_pull.txt --text-mining-file biohackathon/data/brugada_output_file_omnipath.tsv --output-file output.xml java -jar biohackathon/target/biohackathon-1.0-jar-with-dependencies.jar --enrichr-file biohackathon/data/enrichr_output.txt --minerva-file biohackathon/data/example_pathway_to_pull.txt --text-mining-file biohackathon/data/brugada_output_file_omnipath.tsv --output-file output.xml
``` ```
## 2. Map file post-processing
The resulting file needs to be modified in order to:
- Attach UniProt identifiers to the species.
- Trim values whose length exceed 255 characters.
### UniProt identifiers
Attaching UniProt identifiers is necessary in order for the proteins to be UniProt annotated. The reason is that the genetic
variant overlay (generated in the previous step of the pipeline) is mapped on the proteins based on both gene name
and UniProt annotation. So if a variant maps on a protein only based on gene level but not on UniProt level,
it won't show in MINERVA.
Usage:
```bash
python3 utils/implant_annotations.py -m map_file -v minerva_variants_file
```
### Trimming long strings
Since the size of some of the strings in MINERVA (such as names or RDF resources) is limited by 255 characters,
longer string in the input map file need to be trimmed to meet the criteria. This is, for example, the case with
some RDF values (such as list of pubmed ids) coming from WikiPathways. In case of RDF resources which contain
multiple values for one identifier (e.g. list of pubmeds in places where there should be only one), the scripts
retains just the first identifier. In the rest of the cases, the string is simply trimmed to 255 characters (this
behaviour is not tested and can lead to invalid values).
```bash
python3 utils/implant_annotations.py -m map_file
```
\ No newline at end of file
...@@ -21,7 +21,7 @@ def trim(path: str) -> str: ...@@ -21,7 +21,7 @@ def trim(path: str) -> str:
# We want to keep the RDF resource still a valid resource otherwise MINERVA will fail during import # We want to keep the RDF resource still a valid resource otherwise MINERVA will fail during import
atts[k] = atts[k].split(";")[0] atts[k] = atts[k].split(";")[0]
else: else:
atts[k] = "" atts[k] = atts[k][:255]
return ET.tostring(root, encoding='utf8').decode('utf8') return ET.tostring(root, encoding='utf8').decode('utf8')
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment