Commit 2bdfba95 authored by David Hoksza's avatar David Hoksza
Browse files

Merge branch 'master' into vep

parents 9fa8bf5e a7adcd24
# Biohackathon 2019: Rare disease maps
Code and comments to [https://r3.pages.uni.lu/biocuration/resources/biohackathon2019/rdmaps/](https://r3.pages.uni.lu/biocuration/resources/biohackathon2019/rdmaps/).
Documentation and code to [https://r3.pages.uni.lu/biocuration/resources/biohackathon2019/rdmaps/](https://r3.pages.uni.lu/biocuration/resources/biohackathon2019/rdmaps/).
#### Requirements
### Dependencies
- Java Runtime
- Python 3.x
- R
- zip utility
- Java 8
(- Maven 3.6)
##### Python
Python needs to have packages decribed in `associations/requirement.txt` which can be installed
via
```commandline
pip3 install -r associations/requirement.txt
pip3 install -r dependencies/python_requirement.txt
```
##### Java
Maven needs to be installed if you want to compile the `map_generator` from the code. This might not be
necesarry, since we provide a compiled version with the code as well. By default the compilation is not
carried out. This can be changed by setting the value `BUILD_MAP_GENERATOR` in `parameters.sh` to 1.
The maven can be installed on Ubuntu via apt (`sudo apt install maven`).
##### R
The R scripts in the pipeline should be able to install their dependencies on the first run. However,
it might happen that elevated privileges will be needed. If that is the case and the pipeline failes
on the R scripts, you can install the dependencies by running:
```bash
sudo Rscript --vanilla dependencies/dependencies.R
```
##### System
If the pipeline is run at clean Linux installation you might need to install the following libraries
(`sudo apt-get install` on Ubuntu) prior to running the code:
- libcurl
- libssl-dev
- libxml2-dev
### Execution
## Pipeline execution
To execute the pipeline, set the values of parameters in the `PARAMETERS TO SET`
section of the `assamble.sh` script file. The main parameter here is the list of
[Orphanet](https://www.orpha.net/consor/cgi-bin/index.php?lng=EN) disease numbers.
To execute the pipeline, set the values of parameters in the `parameters.sh`, mainly
the list of [Orphanet](https://www.orpha.net/consor/cgi-bin/index.php?lng=EN) disease numbers.
When the parameters are set, run the pipeline:
```
bash assemble.sh
```
You can also provide additional parameter, being the parameters file if it differs from `parameters.sh`.
This will run the pipeline described in the next section and eventually output a
ZIP file with a map which can then be imported in [MINERVA](https://minerva.pages.uni.lu/doc/)
as a disease maps integrating all the found enriched pathways together with
genetic an variant [overlays](https://minerva.pages.uni.lu/doc/user_manual/v14.0/index/#overlays-tab).
### Pipeline
## Pipeline description
The pipeline consists of a bunch of tools which can be divided into three broad categories:
1. [Retrieval of gene-disease mappging and variants](associations/README.md).
2. [Enrichment](enrichment/README.md)
\ No newline at end of file
2. [Enrichment](enrichment/README.md)
2. [Map assembly](map_generator/README.md)
The tools are glued together by the `assembly.sh` script resulting in the following pipeline:
1. Obtain gene-disease and variant-disease mapping from DisGeNET.
2. Obtain gene-disease and variant-disease mapping from OpenTargets.
3. Obtain possibly pathogenic ClinVar variants and genes pertinent to given disease.
4. Compile list of of genes associated with disease from all the input sources.
5. Extend the list of genes by going to other resources such as OmniPath or text mining.
6. Compile list of of variants associated with disease from all the input sources.
7. Filter out variants with high allele frequency using Ensemble.
8. Obtain variant information (position, protein-level mapping) and store it for MINERVA genetic variant overlay.
9. From resources such as existing disease maps or WikiPathways obtain enriched pathways
with respect to the disease-associated genes obtained from previous step.
10. Compile the obtained pathways into a single disease map.
11. Bundle the disease map with genetic and variant overlays into a single archive to be then uploaded to [MINERVA](https://minerva.pages.uni.lu/).
## License
GNU Affero General Public License v3.0
\ No newline at end of file
#!/usr/bin/env bash
# ------------------------- PARAMETERS TO SET -------------------------
source parameters.sh
# Load parameters either from parameters.sh, or from a file provided as the first parameter of the script
if [[ $1 == "" ]]; then
source parameters.sh
else
source $1
fi
# ------------------------- PARAMETERS TO SET -------------------------
ASSOCIATIONS_DIR=associations/
......@@ -28,51 +33,85 @@ mkdir $RES_DIR
# First we copy the parameters into the result directory
cp parameters.sh ${RES_DIR}
# Get associations from DisGeNET
disgenet_out_path=${RES_DIR}/01-disgenet-n_${DISGENET_CNT_THRESHOLD}-s_${DISGENET_ASSOCIATION_SCORE_THRESHOLD_STR}.json
$PYTHON_BIN $ASSOCIATIONS_DIR/disgenet.py -o ${ORPHANET_IDS} -n ${DISGENET_CNT_THRESHOLD} -s ${DISGENET_ASSOCIATION_SCORE_THRESHOLD} > ${disgenet_out_path}
echo "Disgenet gene and variant associations stored in ${disgenet_out_path}"
check_exit_code(){
if [ $? -ne 0 ]; then
tput setaf 1; echo "The last command failed. Can't continue..."
exit
fi
}
log() {
echo `date +"%T"`: $1
}
out_paths=""
if [ ${USE_DISGENET} = 1 ]; then
log "Querying DisGeNET..."
disgenet_out_path=${RES_DIR}/01-disgenet-n_${DISGENET_CNT_THRESHOLD}-s_${DISGENET_ASSOCIATION_SCORE_THRESHOLD_STR}.json
$PYTHON_BIN $ASSOCIATIONS_DIR/disgenet.py -o ${ORPHANET_IDS} -n ${DISGENET_CNT_THRESHOLD} -s ${DISGENET_ASSOCIATION_SCORE_THRESHOLD} > ${disgenet_out_path}
check_exit_code
log "DisGeNET gene and variant associations stored in ${disgenet_out_path}"
out_paths="${out_paths},${disgenet_out_path}"
fi
if [ ${USE_OPENTARGETS} = 1 ]; then
# Get associations from OpenTargets
log "Querying OpenTargets..."
opentargets_out_path=${RES_DIR}/01-opentargets-n_${DISGENET_CNT_THRESHOLD}-s_${OPENTARGETS_ASSOCIATION_SCORE_THRESHOLD_STR}.json
$PYTHON_BIN $ASSOCIATIONS_DIR/opentargets.py -o ${ORPHANET_IDS} -n ${OPENTARGETS_CNT_THRESHOLD} -s ${OPENTARGETS_ASSOCIATION_SCORE_THRESHOLD} > ${opentargets_out_path}
check_exit_code
log "Opentargets gene and variant associations stored in ${opentargets_out_path}"
out_paths="${out_paths},${opentargets_out_path}"
fi
# Get associations from OpenTargets
opentargets_out_path=${RES_DIR}/01-opentargets-n_${DISGENET_CNT_THRESHOLD}-s_${OPENTARGETS_ASSOCIATION_SCORE_THRESHOLD_STR}.json
$PYTHON_BIN $ASSOCIATIONS_DIR/opentargets.py -o ${ORPHANET_IDS} -n ${OPENTARGETS_CNT_THRESHOLD} -s ${OPENTARGETS_ASSOCIATION_SCORE_THRESHOLD} > ${opentargets_out_path}
echo "Opentargets gene and variant associations stored in ${opentargets_out_path}"
# Merge with ClinVar
log "Merging with ClinVar..."
genes_variants_out_path=${RES_DIR}/02-genes_variants.log
$PYTHON_BIN $ASSOCIATIONS_DIR/merge_with_clinvar.py -v $disgenet_out_path,$opentargets_out_path -c ${ASSOCIATIONS_DATA_DIR}/OrphaHPO_clinvar_variants_summary.tsv -oid ${ORPHANET_IDS} > ${genes_variants_out_path}
echo "Integration with ClinVar stored in ${genes_variants_out_path}"
out_paths=${out_paths#,}
$PYTHON_BIN $ASSOCIATIONS_DIR/merge_with_clinvar.py -v $out_paths -c ${ASSOCIATIONS_DATA_DIR}/OrphaHPO_clinvar_variants_summary.tsv -oid ${ORPHANET_IDS} > ${genes_variants_out_path}
check_exit_code
log "Integration with ClinVar stored in ${genes_variants_out_path}"
genes_line=`cat ${genes_variants_out_path} | grep "genes in total"`
genes_out_path=${genes_variants_out_path/02-genes_variants/03-genes}
#echo ${genes_line#*:} | sed 's/\,/\n/g' > ${genes_out_path}
echo ${genes_line#*:} | tr ',' '\n' > ${genes_out_path}
echo "Genes stored in ${genes_out_path}"
log "Genes stored in ${genes_out_path}"
echo "Extending genes list"
log "Extending genes list..."
text_mining_out_path=${genes_out_path/03-genes/03-text-mining}
echo Rscript --vanilla ${EXTEND_DIR}/get_extended_network.R ${genes_out_path} ${EXTEND_CONFIG}
Rscript --vanilla ${EXTEND_DIR}/get_extended_network.R ${genes_out_path} ${EXTEND_CONFIG}
check_exit_code
cp ${EXTEND_DIR}/output.txt ${text_mining_out_path}
echo "Genes list extended"
log "Genes list extended"
minerva_genes_out_path=${RES_DIR}/04-minerva-genes.txt
$PYTHON_BIN $ASSOCIATIONS_DIR/minerva_genes.py -f ${genes_out_path} > ${minerva_genes_out_path}
check_exit_code
log "Genes stored in ${minerva_genes_out_path}"
var_line=`cat ${genes_variants_out_path} | grep "variants in total"`
variants_out_path=${genes_variants_out_path/02-genes_variants/03-variants}
#echo ${var_line#*:} | sed 's/\,/\n/g' > ${variants_out_path}
echo ${var_line#*:} | tr ',' '\n' > ${variants_out_path}
echo "Variants stored in ${variants_out_path}"
log "Variants stored in ${variants_out_path}"
log "Getting detailed variants information..."
minerva_variants_out_path=${RES_DIR}/04-minerva-variants.txt
$PYTHON_BIN $ASSOCIATIONS_DIR/minerva_variants.py -f ${variants_out_path} > ${minerva_variants_out_path}
log "Detailed variants information obtained"
check_exit_code
if [ ${STOP_AFTER_STAGE} = 1 ]; then
echo "Exiting after stage ${STOP_AFTER_STAGE}"
exit 0
fi
# ------------------------------ 2. Obtain pathways ------------------------------
echo "Retrieving enriched pathways"
log "Retrieving enriched pathways..."
tr '\r' '\n' < ${ENRICHMENT_CONFIG} > ${ENRICHMENT_CONFIG_TMP}
......@@ -81,6 +120,8 @@ sed -E -i "s/(max_areas_per_map)(.*)/\1\t${ENRICH_MAX_AREAS_PER_MAP}/g" ${ENRICH
sed -E -i "s/(max_areas_per_pathway_db)(.*)/\1\t${MAX_AREAS_PER_PATHWAY_DB}/g" ${ENRICHMENT_CONFIG_TMP}
Rscript --vanilla ${ENRICHMENT_DIR}/enrich_maps.R ${genes_out_path} ${ENRICHMENT_CONFIG_TMP}
check_exit_code
enriched_maps_out_path=$RES_DIR/05-enriched_disease_maps.txt
enriched_paths_out_path=$RES_DIR/05-enriched_pathways.txt
mv enriched_disease_maps.txt ${enriched_maps_out_path}
......@@ -88,7 +129,7 @@ mv enriched_pathways.txt ${enriched_paths_out_path}
rm ${ENRICHMENT_CONFIG_TMP}
echo "Enriched pathways obtained"
log "Enriched pathways retrieved"
if [ ${STOP_AFTER_STAGE} = 2 ]; then
......@@ -102,26 +143,41 @@ if [ ${BUILD_MAP_GENERATOR} = 1 ]; then
mvn -DskipTests=true clean install -pl biohackathon -am
cd ..
fi
echo "Map generator built"
echo "Assembling the map from pathways ..."
log "Map generator built"
log "Assembling the map from pathways ..."
map_out_path=${RES_DIR}/06-minerva_map.xml
java -Xmx4g -jar ${MAP_GENERATOR_DIR}/biohackathon/target/biohackathon-1.0-jar-with-dependencies.jar --enrichr-file ${enriched_paths_out_path} --minerva-file ${enriched_maps_out_path} --text-mining-file ${text_mining_out_path} --output-file ${map_out_path}
#java -Xmx4g -jar ${MAP_GENERATOR_DIR}/biohackathon/target/biohackathon-1.0-jar-with-dependencies.jar --enrichr-file ${enriched_paths_out_path} --minerva-file ${enriched_maps_out_path} --output-file ${map_out_path}
echo "Pathwaygs assembled into ${map_out_path}"
check_exit_code
log "Pathways assembled into ${map_out_path}"
log "Implanting UniProt annotations..."
map_out_path_uniprot=${map_out_path/.xml/_unp.xml}
$PYTHON_BIN ${MAP_GENERATOR_DIR}/utils/implant_annotations.py -m ${map_out_path} -v ${minerva_variants_out_path} > ${map_out_path_uniprot}
check_exit_code
log "UniProt annotations implanted"
log "Trimming long strings..."
map_out_path_trimmed=${map_out_path_uniprot/.xml/_trim.xml}
$PYTHON_BIN ${MAP_GENERATOR_DIR}/utils/trim_long_strings.py -m ${map_out_path_uniprot} > ${map_out_path_trimmed}
check_exit_code
log "Long string trimmed"
echo "Combining map with overlays"
log "Combining the map with overlays"
tmp_dir=${RES_DIR}/tmp/
tmp_dir_layouts=${tmp_dir}/layouts/
mkdir ${tmp_dir}
cp ${map_out_path} ${tmp_dir}
cp ${map_out_path_trimmed} ${tmp_dir}
mkdir ${tmp_dir_layouts}
cp ${minerva_genes_out_path} ${tmp_dir_layouts}
cp ${minerva_variants_out_path} ${tmp_dir_layouts}
map_zip_out_path=${map_out_path/.xml/.zip}
map_zip_out_path=${map_out_path_trimmed/.xml/.zip}
rm ${map_zip_out_path}
cd ${tmp_dir}
zip -r tmp.zip .
......@@ -129,4 +185,4 @@ cd -
mv ${tmp_dir}/tmp.zip ${map_zip_out_path}
rm -rf ${tmp_dir}
echo "Final map with overlays stored in ${map_zip_out_path}"
\ No newline at end of file
log "Final map with overlays stored in ${map_zip_out_path}"
\ No newline at end of file
......@@ -9,6 +9,7 @@ variants.
2. Extend the genes list by looking into additional resources such as OmniPath.
3. Use Ensembl to assesses allele frequencies of the identified variants to
filter out possibly non-rare variants.
4. Get detailed variant information and compile it into [variant file](https://minerva.pages.uni.lu/doc/user_manual/v14.0/index/#genetic-variant-format) for MINERVA.
## 1. Gene-disease associations and variants
......@@ -90,7 +91,7 @@ The OrphaHPO_clinvar_variants_summary is a file obtained from ClinVar FTP on 12t
It was created by filtering ClinVar variants to keep only those having Orphanet identifier,
and reducing its contents to GRCh37 (hg19) variants.
The script gets genes and variants for given disease and carries out pairwise comparison
The script gets genes and variants for given disease, excludes non-pathogenic variants and carries out pairwise comparison
of genes and variants for ClinVar and all other input resources which are provided as JSON files (see above). The
output is a report on gene and variation level, including list of all genes and variants pertinent to given
disease.
......@@ -184,6 +185,36 @@ It takes as *input* a file with a single column of rs# and give as *output* in t
In the output there are the variants from the input that are either not described in the genomic databses contained in Ensembl or for which there is at least one population with allele frequency >= the threshold indicated in the command line. As an example, using a threshold of 0.10 will filter out form the input file all variants that have minor allele frequency eqaul to or greater than 10\% in at least one population among those described in Ensembl. Also variant not present in Ensembl are filtered out under the assumption that they have never been described so far.
###### Comment on efficiency
## 3. Obtaining variant information
The script `minerva_variants.py` connects to [http://myvariant.info/](MyVariant.info) API and for each of the provided
variant identifiers (provided as a list of in the [dbSNP Reference numbers](dbSNP Reference SNP) ) gets genetic location
and also protein-level mapping. The protein-level mapping in MyVariant.info comes from the [dbNSFP database](https://sites.google.com/site/jpopgen/dbNSFP).
Usage:
```bash
python3 minerva_variants.py -f variant_file_with_line_delimited_dbSNPRN
```
The resulting output is formatted as a MINERVA's [genetic variant overlay file](https://minerva.pages.uni.lu/doc/user_manual/v14.0/index/#genetic-variant-format). An example follows:
```text
#NAME=DISEASE_ASSOCIATED_VARIANTS
#TYPE=GENETIC_VARIANT
#GENOME_TYPE=UCSC
#GENOME_VERSION=hg19
position original_dna alternative_dna gene_name description color contig allele_frequency amino_acid_change identifier_uniprot
38592933 G A SCN5A snv #ff0000 3 0.8 SCN5A:XXX:g.38592933G>A:p.R1626C H9KVD2
38592729 C T SCN5A snv #ff0000 3 0.8 SCN5A:XXX:g.38592729C>T:p.G1694S H9KVD2
38627199 C T SCN5A snv #ff0000 3 0.8 SCN5A:XXX:g.38627199C>T:p.V924I H9KVD2
38627199 C A SCN5A snv #ff0000 3 0.8 SCN5A:XXX:g.38627199C>A:p.V924F H9KVD2
108867945 T G KCNE5 snv #ff0000 X 0.8 KCNE5:XXX:g.108867945T>G:p.E102A Q9UJ90
81667462 A T CACNA2D1 snv #ff0000 7 0.8 CACNA2D1:XXX:g.81667462A>T:p.N323K P54289-2
...
```
......@@ -2,4 +2,4 @@ param value
output_file output.txt
n 50
source stringdb
score 0
score 0.3
......@@ -7,7 +7,7 @@
# n: the number of new proteins in the extended network. By default 50
# score: is the string score (between 0,1). By default 0. it filters a posteriori because
# I was not able to find documentation on how to add the score in the API url request
# dbsource: the source of the protein-protein interaction data. Currently, stringdb or omnipathdb.
# dbsource: the source of the protein-protein interaction data. Currently, text_mining, stringdb or omnipathdb.
# By default, stringdb, and then overlaps omnipathdb data to provide information about directionality.
# When source = "omnipathdb" it will retrieve all information for the genes from omnipath
# the output file name. It will contain the pairs gene A, gene B with their corresponding HGNC identifiers,
......@@ -22,7 +22,6 @@ if(length(args) > 0) { infile <- args[1] }
if(!file.exists(infile)) { message("No correct input given!")}
input <- readLines(infile)
config <- "config.txt"
if(length(args) > 1) { config <- args[2] }
if(!file.exists(config)) { message("No correct config given!")}
......@@ -33,7 +32,7 @@ n <-config[config$param == "n", "value"]
score <-config[config$param == "score", "value"]
dbsource <-config[config$param == "source", "value"]
if( !dbsource %in% c("stringdb", "omnipathdb")) {
if( !dbsource %in% c("stringdb", "omnipathdb", "text_mining")) {
stop("Invalid argument 'dbsource'. It should be omnipathdb or stringdb ")
}
if( ! ( score >= 0 & score <= 1 )) {
......@@ -50,149 +49,137 @@ if ("biomaRt" %in% installed.packages()[,1] == FALSE){
library(biomaRt)
message(paste0("there are ", length(unique(input))), " genes in the file")
if (dbsource == "stringdb") {
message( "mapping the hgnc identifiers to ensemble proteins used by stringdb with biomart" )
ensembl <- useMart("ensembl")
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
ensprotein_idest<- getBM(attributes=c('ensembl_peptide_id', 'hgnc_symbol'),
filters = 'hgnc_symbol',
values = input,
mart = ensembl)
ensprotein_idest <- subset(ensprotein_idest, ensembl_peptide_id != "")
message(paste0("there are ", length(unique(ensprotein_idest$hgnc_symbol))), " genes in the input file with ensembl protein ids")
message(paste0("doing the query to stringdb and retrieving n ",n, " neighbors" ))
### Function to retrieve StringDB interactors
retrieve_stringdb_net <- function(finput, fn, fscore) {
message(paste0("doing the query to stringdb and retrieving n ", fn, " neighbors"))
entities <- paste0("9606.", ensprotein_idest$ensembl_peptide_id)
entities <- paste(entities, collapse = "%0A" )
resp <- httr::POST(url = "api.jensenlab.org/network",
httr::add_headers('Content-Type' = "application/x-www-form-urlencoded"),
body = paste0("entities=", entities, "&additional=", n),
httr::verbose())
body = paste0("entities=", entities, "&additional=", n))
json <- jsonlite::fromJSON(httr::content(resp, as = "text"), flatten = FALSE)
protein_info <- json[[2]]
protein_info$protein <- gsub('stringdb:', "", protein_info$`@id`)
protein_info$protein <- gsub('stringdb:', "", protein_info$`@id`)
protein_info$protein <- gsub('9606.', "", protein_info$protein )
protein_info_2ensp <- getBM(attributes=c('ensembl_peptide_id', "hgnc_symbol", 'entrezgene_id'),
filters = 'ensembl_peptide_id',
values = protein_info$protein,
mart = ensembl)
biomartCacheClear()
json <- jsonlite::fromJSON(httr::content(resp, as = "text", encoding = "UTF-8"), flatten = FALSE)
protein_info_2ensp <- subset(protein_info_2ensp, ensembl_peptide_id != "")
protein_info_2ensp$ensembl_peptide_id <- paste0("9606.", protein_info_2ensp$ensembl_peptide_id)
ppi_network <- json[[1]]
ppi_network$string_score <- ppi_network$scores$`stringdb::score`
message("the network contains ", dim(ppi_network)[1], " interactions ")
ppi_network <- subset(ppi_network, string_score > score)
### May be important for later, when node-level info can be used for filtering
# protein_info <- json[[2]]
# protein_info$protein <- gsub('stringdb:', "", protein_info$`@id`)
# protein_info$protein <- gsub('9606.', "", protein_info$protein )
# protein_info <- protein_info$protein
ppi_network <- data.frame(source = json[[1]]$source,
target = json[[1]]$target,
score = json[[1]]$scores$`stringdb::score`,
stringsAsFactors = F)
ppi_network <- ppi_network[ppi_network$score >= fscore,]
ppi_network$source <- gsub('9606.', "", ppi_network$source)
ppi_network$target <- gsub('9606.', "", ppi_network$target)
ppi_network
ppi_network <- unique(ppi_network)
message("after pruning by score, the network contains ", dim(ppi_network)[1], " interactions ")
ppi_network <- merge(ppi_network, protein_info_2ensp, by.x = "source", by.y = "ensembl_peptide_id" )
colnames(ppi_network)[which(names(ppi_network) == "hgnc_symbol")] <- "source_hgnc_symbol"
colnames(ppi_network)[which(names(ppi_network) == "entrezgene_id")] <- "source_entrezgene_id"
ppi_network <- merge(ppi_network, protein_info_2ensp, by.x = "target", by.y = "ensembl_peptide_id" )
colnames(ppi_network)[which(names(ppi_network) == "hgnc_symbol")] <- "target_hgnc_symbol"
colnames(ppi_network)[which(names(ppi_network) == "entrezgene_id")] <- "target_entrezgene_id"
ppi_network <- unique(ppi_network[c("source_hgnc_symbol","source_entrezgene_id",
"target_hgnc_symbol","target_entrezgene_id", "string_score")])
message("after converting to gene symbols, the network contains ", dim(ppi_network)[1], " interactions ")
return(ppi_network)
}
### Function to retrieve OmniPathDB interactors, by default only between input genes
retrieve_omnipath <- function(finput, only_between_inputs = T) {
message("Retrieving data from Omnipath")
url <- paste0("http://omnipathdb.org/interactions/?genesymbols=1&partners=",
paste(finput, collapse = ","),"&fields=sources,references")
url <- paste0("http://omnipathdb.org/interactions/?genesymbols=1&partners=", paste(protein_info_2ensp$hgnc_symbol, collapse = ","), "&fields=sources,references")
dataTsv <- RCurl::getURLContent( url )
myTextConnection <- textConnection( dataTsv )
result <- read.csv( myTextConnection, header = TRUE, sep = "\t", colClasses=c("character"))
resp <- httr::GET(url = url,
httr::add_headers('Content-Type' = "application/x-www-form-urlencoded"))
result <- read.table(text = httr::content(resp, as = "text", encoding = "UTF-8"),
sep = "\t", header = T)
result <- result[result$is_directed == 1,]
colnames(result)[which(names(result) == "source_genesymbol")] <- "source_hgnc_symbol"
colnames(result)[which(names(result) == "target_genesymbol")] <- "target_hgnc_symbol"
gene_info_2entrez <- getBM(attributes=c('hgnc_symbol', 'entrezgene_id'),
filters = 'hgnc_symbol',
values = c(result$source_hgnc_symbol, result$target_hgnc_symbol),
mart = ensembl)
biomartCacheClear()
gene_info_2entrez <- subset(gene_info_2entrez, entrezgene_id != "")
result <- merge(result, gene_info_2entrez, by.x = "source_hgnc_symbol", by.y = "hgnc_symbol" )
colnames(result)[which(names(result) == "entrezgene_id")] <- "source_entrezgene_id"
result <- merge(result, gene_info_2entrez, by.x = "target_hgnc_symbol", by.y = "hgnc_symbol" )
colnames(result)[which(names(result) == "entrezgene_id")] <- "target_entrezgene_id"
result$pair <-apply(result,1,function(x){paste(sort(x[c("source_hgnc_symbol","target_hgnc_symbol")]),collapse=":")})
ppi_network$pair <-apply(ppi_network,1,function(x){paste(sort(x[c("source_hgnc_symbol","target_hgnc_symbol")]),collapse=":")})
message(paste0("from ",length(unique(ppi_network$pair))," pairs of interacting proteins from stringdb"))
message(paste0("we found ", length(intersect(ppi_network$pair, result$pair) ), " interacting pairs in Omnipath"))
final_results <- unique(subset(result, pair %in% ppi_network$pair)[c( "source_hgnc_symbol","source_entrezgene_id",
"target_hgnc_symbol", "target_entrezgene_id",
"is_directed", "consensus_direction", "references", "pair" )])
ppi_network <- subset(ppi_network, !pair %in% final_results$pair)
ppi_network <- within(ppi_network , rm(string_score, pair))
final_results <- merge( final_results[, -grep("pair", colnames(final_results))],
ppi_network, all = T )
final_results$is_directed <- ifelse(is.na(final_results$is_directed), 0, as.character(final_results$is_directed))
message(paste0(" And from those ",dim(subset(final_results, is_directed == 1) )[1], " interacting pairs have direction in Omnipath"))
if( nrow( final_results ) == 0 ){
warning("None of the pairs of interacting proteins havee pased the filters")
if(only_between_inputs) {
result <- result[(result$source_hgnc_symbol %in% finput) & (result$target_hgnc_symbol %in% finput),]
}
### Restrict the number of columns, sign still missing
result <- result[,c("source_hgnc_symbol","target_hgnc_symbol",
"is_directed", "consensus_direction", "references")]
return(result)
}
### Load ensembl mart, will be used in id translation
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
ppi_network <- NULL
if (dbsource == "stringdb" | dbsource == "text_mining") {
message( "mapping the hgnc identifiers to ensemble proteins")
ensprotein_idest <- getBM(attributes=c('ensembl_peptide_id', 'hgnc_symbol'),
filters = 'hgnc_symbol',
values = input,
mart = ensembl)
ensprotein_idest <- subset(ensprotein_idest, ensembl_peptide_id != "")
message(paste0("there are ", length(unique(ensprotein_idest$hgnc_symbol))), " genes in the input file with ensembl protein ids")
if (dbsource == "stringdb"){
ppi_network <- retrieve_stringdb_net(ensprotein_idest$ensembl_peptide_id, n, score)
} else {
write.table(final_results, outputFile, sep = "\t", quote = F, row.names = F)
### This will remain a stub while no bulk request to text mining is possible
}
message("the network contains ", dim(ppi_network)[1], " interactions ")
if (length(ppi_network) > 0) {
all_proteins <- grep("^ENSP", unique(c(ppi_network$source, ppi_network$target)), value = T)
all_proteins_ids <- getBM(attributes=c('ensembl_peptide_id', "hgnc_symbol"),
filters = 'ensembl_peptide_id',
values = all_proteins,
mart = ensembl)
all_proteins_ids <- subset(all_proteins_ids, hgnc_symbol != "")
rownames(all_proteins_ids) <- all_proteins_ids[,1]
ppi_network <- data.frame(all_proteins_ids[ppi_network$source, c("hgnc_symbol")],
all_proteins_ids[ppi_network$target, c("hgnc_symbol")],
ppi_network$score,
stringsAsFactors = F)
colnames(ppi_network) <- c("source_hgnc_symbol", "target_hgnc_symbol", "score")
ppi_network <- unique(ppi_network)
### Remove HGNC NAs
ppi_network <- ppi_network[!(is.na(ppi_network$source_hgnc_symbol) | is.na(ppi_network$target_hgnc_symbol)),]
message("after converting to gene symbols, the network contains ", dim(ppi_network)[1], " interactions ")
}
}
if(length(ppi_network) == 0) {
message("No text mining network, relying only on OmniPathDB")
ppi_network <- retrieve_omnipath(input)
} else {
message("Retrieving data from Omnipath")
url <- paste0("http://omnipathdb.org/interactions/?genesymbols=1&partners=", paste(input, collapse = ","), "&fields=sources,references")
dataTsv <- RCurl::getURLContent( url )
myTextConnection <- textConnection( dataTsv )
result <- read.csv( myTextConnection, header = TRUE, sep = "\t", colClasses=c("character"))
colnames(result)[which(names(result) == "source_genesymbol")] <- "source_hgnc_symbol"
colnames(result)[which(names(result) == "target_genesymbol")] <- "target_hgnc_symbol"
result$pair <-apply(result,1,function(x){paste(sort(x[c("source_hgnc_symbol","target_hgnc_symbol")]),collapse=":")})
message(paste0("there are ",length(unique(result$pair))," pairs of interacting proteins from omnipath"))
ensembl <- useMart("ensembl")
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
omnipath_network <- retrieve_omnipath(unique(c(ppi_network$source_hgnc_symbol,ppi_network$target_hgnc_symbol)))
### Check for Omnipath interactions that are uniquely between the text mining pairs
### Combine source-target and target source identifiers (text mining interactions are reversible/non-directional)
### and check if Omnipath interactions are within this set
ppi_pairs <- c(paste(ppi_network$source_hgnc_symbol,",",ppi_network$target_hgnc_symbol),
paste(ppi_network$target_hgnc_symbol,",",ppi_network$source_hgnc_symbol))
omnipath_pairs <- paste(omnipath_network$source_hgnc_symbol,",",omnipath_network$target_hgnc_symbol)
ppi_network <- omnipath_network[omnipath_pairs %in% ppi_pairs,]
message("after OmniPathDB filtering, the network contains ", dim(ppi_network)[1], " interactions ")
}
if(nrow(ppi_network) == 0) {