Commit f3440aad authored by David Hoksza's avatar David Hoksza
Browse files

vep included in the pipeline

parent 2bdfba95
......@@ -77,7 +77,7 @@ The tools are glued together by the `assembly.sh` script resulting in the follow
4. Compile list of of genes associated with disease from all the input sources.
5. Extend the list of genes by going to other resources such as OmniPath or text mining.
6. Compile list of of variants associated with disease from all the input sources.
7. Filter out variants with high allele frequency using Ensemble.
7. Filter out variants with high allele frequency using Ensemble's VEP (Variant Effect Predictor) service.
8. Obtain variant information (position, protein-level mapping) and store it for MINERVA genetic variant overlay.
9. From resources such as existing disease maps or WikiPathways obtain enriched pathways
with respect to the disease-associated genes obtained from previous step.
......
......@@ -100,6 +100,17 @@ variants_out_path=${genes_variants_out_path/02-genes_variants/03-variants}
echo ${var_line#*:} | tr ',' '\n' > ${variants_out_path}
log "Variants stored in ${variants_out_path}"
if [ ${USE_VEP} = 1 ]; then
# Get associations from OpenTargets
log "Filtering by allele frequency via Ensemble VEP service..."
variants_vep_out_path=${variants_out_path/-variants/-varaints_vep}
$PYTHON_BIN $ASSOCIATIONS_DIR/vepmining/vepAllInOne.py -f ${variants_out_path} -t ${VEP_THRESHOLD}> ${variants_vep_out_path}
check_exit_code
log "Fitlered variants stored in ${variants_vep_out_path}"
variants_out_path=${variants_vep_out_path}
fi
log "Getting detailed variants information..."
minerva_variants_out_path=${RES_DIR}/04-minerva-variants.txt
$PYTHON_BIN $ASSOCIATIONS_DIR/minerva_variants.py -f ${variants_out_path} > ${minerva_variants_out_path}
......
......@@ -185,6 +185,10 @@ It takes as *input* a file with a single column of rs# and give as *output* in t
In the output there are the variants from the input that are either not described in the genomic databses contained in Ensembl or for which there is at least one population with allele frequency >= the threshold indicated in the command line. As an example, using a threshold of 0.10 will filter out form the input file all variants that have minor allele frequency eqaul to or greater than 10\% in at least one population among those described in Ensembl. Also variant not present in Ensembl are filtered out under the assumption that they have never been described so far.
###### Comment on efficiency
The VEP service is relatively slow and takes about 2,5 or more minutes per 200 variants ([maximum POST size](https://rest.ensembl.org/documentation/info/vep_id_post)).
If the allele frequency is not needed, the filtering step can be turned off in the `parameters.sh` file passed to the
assembly script.
## 3. Obtaining variant information
......
import requests, sys, json , argparse
import logging
"""
send a request to https://rest.ensembl.org/documentation/info/vep_id_post to retrive information about allele frequencies in the general population (if available) for a list of rsid
......@@ -24,19 +25,25 @@ def getfreqfromVEPbulck (listOfrsid):
server = "https://rest.ensembl.org"
ext = "/vep/human/id"
headers={ "Content-Type" : "application/json", "Accept" : "application/json"}
strOfDataDictionary=json.dumps({"ids" : listOfrsid })
res = requests.post(server+ext, headers=headers, data=strOfDataDictionary)
decoded = json.loads(res.text)# a python dictionary
for elem in decoded:
rsid=elem["input"]
freq_dict={}
for var in elem["colocated_variants"]:
#rsid=var["id"]
if 'frequencies' in var: freq_dict=var["frequencies"]
listOfDict.append((rsid, freq_dict))
return listOfDict
#return decoded
strOfDataDictionary=json.dumps({"ids": listOfrsid })
while True:
res = requests.post(server+ext, headers=headers, data=strOfDataDictionary)
if res.status_code == 503 and 'No server is available to handle this request.' in res.text:
logging.info('No server is available to handle this request. Repeating...')
continue
decoded = json.loads(res.text)# a python dictionary
for elem in decoded:
rsid=elem["input"]
freq_dict={}
for var in elem["colocated_variants"]:
#rsid=var["id"]
if 'frequencies' in var: freq_dict=var["frequencies"]
listOfDict.append((rsid, freq_dict))
return listOfDict
def getfreqfromVEP (rsid):
freq_dict={}
......@@ -53,6 +60,12 @@ def getfreqfromVEP (rsid):
def main():
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s [%(levelname)s] %(module)s - %(message)s',
datefmt='%H:%M:%S')
parser = argparse.ArgumentParser()
parser.add_argument("-f", help="path to input file ",type=str,required=True)
parser.add_argument("-t", help="threshold for filtering allele frequencies ", type=float,required=True)
......@@ -63,7 +76,7 @@ def main():
#thislist=["rs12720452", "rs1060501130"]
maxNumbRsIdForRequest=200 # there is a limit in the VEP rest
while len(allrsId) > 0 :
print("{} remaining".format(len(allrsId)))
logging.info("{} remaining".format(len(allrsId)))
sublist=allrsId[0:maxNumbRsIdForRequest]
allrsId=allrsId[maxNumbRsIdForRequest:]
result = getfreqfromVEPbulck(sublist)
......
......@@ -12,6 +12,9 @@ USE_OPENTARGETS=1
OPENTARGETS_CNT_THRESHOLD=50
OPENTARGETS_ASSOCIATION_SCORE_THRESHOLD="0.3"
USE_VEP=1
VEP_THRESHOLD=0.001
ENRICH_MAX_AREAS_PER_MAP=5
MAX_AREAS_PER_PATHWAY_DB=5
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment