Synonyms (II)
We had a lot of synonym issues in the ENTACT dataset. Lessons learnt from manual corrections and chatting with PubChem is that we could revamp the synonym handling in RMassBank to do the following:
Take
- User-contributed name
- PubChem Title
- PubChem IUPAC Name
- PubChem Top 3 synonyms
Remove if:
- CAS number (we have a separate field for those)
- CID XXXXX
- Duplicated
- (maybe some other text patterns we can add over time... e.g. "nocas_*")
…and add the first 3-5 of whatever remains.
RMassBank already uses the first 3 bullet points above, we can add the synonyms via this URL
- https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2256/synonyms/TXT
- https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/61938/synonyms/TXT
- https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/10201497/synonyms/TXT
Documentation:
https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest$_Toc494865568
Explanation: PubChem have an automatic algorithm to decide the ordering of depositor-supplied synonyms in compounds, with the “best” first
@todor.kondic @anjana.elapavalore this could be a nice exercise to do another update to RMassBank eventually after trying out something locally ...
Btw the other lesson learnt from ENTACT is not to take the name from the original DTXSID if MS-ready SMILES are used because the names are often the salt/mixture and not the proper name for the MS-ready form.
NOTE this is a repetition of another issue, but with broader access rights.