diff --git a/Resources/Text mining/Readme.md b/Resources/Text mining/Readme.md index 2b170a1e3ea1da0dbcb1e10740b44128344f6154..993426f19f0c521a24797073de825450d98414a3 100644 --- a/Resources/Text mining/Readme.md +++ b/Resources/Text mining/Readme.md @@ -1,3 +1,3 @@ -#OpenNLP-based text mining workflow (by Miguel Vázquez and Arnau Montagud) +# OpenNLP-based text mining workflow (by Miguel Vázquez and Arnau Montagud) Text mining and natural language processing (NLP) were used to help curators and modellers complete their networks. The same pipeline was applied to two different corpora: One is the CORD-19 [PMID: 32510522], the other is the collection of MEDLINE abstracts associated to the genes in the PPI network from Gordon et al [PMID: 32353859] using the Entrez Gene reference into function (GeneRIF). The text-mining pipeline consisted in identifying sentences using OpenNLP (https://opennlp.apache.org/) and mentions of genes using GNormPlus [DOI: dx.doi.org/10.1155/2015/918710]. Sentences with mentions to at least two different genes were reported. Each of these sentences thus contains one or more pairs of genes that might have a protein-protein interaction (PPI) between them, for each of the corpora. Additionally, a tentative network consisting of all the potential PPIs is derived from each corpora but restricted to only genes in the Gordon et al. publication. These 4 resources (2 for each corpora) contain the literal and normalized gene mentions, the text of the corresponding sentence, and the coordinates of those sentences in the original document.