Commit cedc423b authored by Emma Schymanski's avatar Emma Schymanski
Browse files

Create C_elegans_metabolites.Rmd

parent 952771c9
title: "Finding _C. elegans_ Metabolites on PubChem"
author: "Emma SCHYMANSKI"
date: "19/01/2022"
output: pdf_document
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
## Background
### WormJam
In 2019, Michael Witting provided a copy of 1203 WormJam
(Worm Jamboree) metabolites from the multi-author WormJam study
This was turned into a MetFrag database and put on Zenodo by the
Environmental Cheminformatics group (ECI) at LCSB
for integration into
[MetFrag Web](,
and for download for
[MetFragCL]( users.
The subset of WormJam that was within the CompTox Chemicals Dashboard
was mapped up to DTXSIDs and uploaded to CompTox as a _partial_ set of
(currently 416 of the 1203 entries).
NOTE: The WormJam files on Zenodo should be used as the primary reference.
### _C. elegans_ on PubChem
[PubChem]( has a taxonomy page
for [_Caenorhabditis elegans_](
Other, related taxonomies can be found in the
["Related Taxonomies"](
subsection - currently only a genus (parent) level.
Associated chemicals of interest can be found under the
["Chemicals and Bioactivities"](
section, but note that not all of these subsections will contain
chemicals of interest for metabolomics!
For _C. elegans_, the subsections
["Pathway Compounds"](,
["Natural Products"](
appear to be of most interest for metabolomics (see Figure 1).
The WormJam data mentioned above appears under the
![_Associated chemicals of interest for C. elegans on PubChem_](fig/Celegans_ChemBioAct.png)
## Chemical Curation
### WormJam Dataset (from Zenodo)
First, we can start with the known data from WormJam.
Read the file in, and view it:
```{r load wormjam}
wormjam_url <- ""
wormjam_csv <- "WormJam_10Sept19.csv"
download.file(wormjam_url, wormjam_csv)
wormjam <- read.csv(wormjam_csv,stringsAsFactors = F)
Now let's do some basic stats like number of entries, InChIKeys (IK),
PubChem CIDs ...
```{r wormjam stats}
n_entries <- length(wormjam$Identifier)
IK <- unique(wormjam$InChIKey)
n_IK <- length(IK)
CIDs <- unique(wormjam$PubChem_neutral)
#remove NAs
CIDs <- na.exclude(CIDs)
n_CIDs <- length(CIDs)
For the record, of 1203 entries, 1196 have InChIKeys and only 861 CIDs,
whereas the PubChem integration of WormJam has 1165 entries (more than
the CIDs in the file, and only slightly less than the number of InChIKeys.
### CIDs from PubChem Taxonomy pages
There are several sources of CIDs on the taxonomy pages, but as mentioned
above, likely not all are relevant to the metabolism. For those sections
that are relevant, it is possible to download these directly
(you can also just directly download from the PubChem pages and save them):
```{r download chemicals}
#url_path <- ""
# temporary fix, download an archived version
url_path <- ""
url_wormjam <- "{%22download%22:%22*%22,%22collection%22:%22norman_wormjam%22,%22where%22:{%22ands%22:[{%22taxid%22:%226239%22}]},%22order%22:[%22relevancescore,desc%22],%22start%22:1,%22limit%22:10000000,%22downloadfilename%22:%22TaxID_6239_norman_wormjam%22}"
url_lotus <- "{%22download%22:%22*%22,%22collection%22:%22lotus%22,%22where%22:{%22ands%22:[{%22taxid%22:%226239%22}]},%22order%22:[%22relevancescore,desc%22],%22start%22:1,%22limit%22:10000000,%22downloadfilename%22:%22TaxID_6239_lotus%22}"
download.file(url_path, "TaxID_6239_pcget_pathway_chemical.csv")
download.file(url_wormjam, "TaxID_6239_eci_wormjam.csv")
download.file(url_lotus, "TaxID_6239_lotus.csv")
Now take a look, and extract the CIDs.
```{r Pathway CIDs}
# Pathway CIDs
path_entries <- read.csv("TaxID_6239_pcget_pathway_chemical.csv", stringsAsFactors = F)
# display first row:
# extract the CIDs
path_cids <- na.exclude(unique(path_entries[,1]))
Now for WormJam via PubChem:
```{r WormJam in PubChem CIDs}
# WormJam Metabolites in PubChem
metab_entries <- read.csv("TaxID_6239_eci_wormjam.csv", stringsAsFactors = F)
#display first row:
# extract the CIDs
metab_cids <- na.exclude(unique(metab_entries[,1]))
Now for LOTUS:
```{r LOTUS CIDs}
lotus_entries <- read.csv("TaxID_6239_lotus.csv", stringsAsFactors = F)
# view first row
# extract CIDs
lotus_cids <- na.exclude(unique(lotus_entries[,1]))
Now do some stats:
```{r CID counts}
# CIDs from WormJam
# CIDs from Pathways
# CIDs from WormJam Metabolites in PubChem
# CIDs from LOTUS
# Merged, unique CIDs
# (note, to select only some, just delete undesired entries from the vector below)
#celegans_CIDs <- na.exclude(unique(c(CIDs,path_cids,metab_cids,lotus_cids)))
## note: CIDs eliminated due to formatting issues, and coverage within PubChem
celegans_CIDs <- na.exclude(unique(c(path_cids,metab_cids,lotus_cids)))
### Getting Parent CIDs (neutral form)
Next step: write out the CIDs so that we can map to parent CIDs with ID Exchange.
```{r C elegans CIDs}
write.table(celegans_CIDs,file="celegans_all_cids.txt",quote = F,
row.names = F, col.names = F)
Next, go to the PubChem
[ID Exchange](
and map the CIDs to parent CIDs (see Figure 2). Upload the file from above
with the Browse option (or copy all the CIDs into the box).
Note: it's possible to do this in R, but it's a bit slower ;-)
![_Converting CIDs to parent CIDs with ID Exchange_](fig/Celegans_IDExchange.png)
The results were saved in the following file (replace the filename with the
cache code of the ID Exchange run or adjusted filename). As seen from the stats,
a lot of CIDs didn't have parent CIDs.
This could be due to presence of metals, or several other factors.
It's up to the user whether to continue with all CIDs, or just the parent ones.
The code for both options is below.
```{r putida CID stats}
celegans_CID_mapping <- read.delim("3592661906267113687.txt",
#number of CIDs
#number of parent CIDs
celegans_parent_CIDs <- na.exclude(unique(celegans_CID_mapping$V2))
### Include all CIDs
#all_CIDs <- na.exclude(unique(c(celegans_CIDs,celegans_parent_CIDs)))
## or just parent CIDs
all_CIDs <- na.exclude(celegans_parent_CIDs)
n_all_CIDs <- length(all_CIDs)
If you only want to continue with all CIDs, uncomment the line two above
'all_CIDs <- celegans_parent_CIDs', and comment that out instead.
### Gathering Chemical Information
Next, get all information we need via _webchem_.
```{r load pkgs}
Then run the following (will return CID, Name, ExactMass, Molecular_Formula, XlogP,
```{r webchem}
selected_properties <- c("Title","ExactMass","MolecularFormula","XlogP",
# retrieve info with webchem
CID_info_all <-, selected_properties))
# output
write.csv(CID_info_all,"celegans_all_Struct_Info.csv",row.names = F)
So, one last issue, sometimes charged formulas are generated (despite them
being parent CIDs) and this can cause issues for downstream applications.
So, an optional step is needed to remove these charges.
```{r remove charges}
plus_formulas <- grep("+",CID_info_all$MolecularFormula,fixed=TRUE)
# remove everything including and after the +
# (some are multi-charged, then it's +2, +3, ... )
if (length(plus_formulas)>0) {
for (i in 1:length(plus_formulas)) {
mf <- CID_info_all$MolecularFormula[plus_formulas[i]]
new_mf <- strsplit(mf,"+",fixed=TRUE)[[1]][1]
if (is.character(new_mf)) {
CID_info_all$MolecularFormula[plus_formulas[i]] <- new_mf
# same for negative
neg_formulas <- grep("-",CID_info_all$MolecularFormula,fixed=TRUE)
# remove everything including and after the -
if (length(neg_formulas)>0) {
for (i in 1:length(neg_formulas)) {
mf <- CID_info_all$MolecularFormula[neg_formulas[i]]
new_mf <- strsplit(mf,"-",fixed=TRUE)[[1]][1]
if (is.character(new_mf)) {
CID_info_all$MolecularFormula[neg_formulas[i]] <- new_mf
# output
write.csv(CID_info_all,"celegans_all_Struct_Info.csv",row.names = F)
## Wrap Up
Hopefully this helps find lots of exciting _C. elegans_ metabolites!
## Acknowledgements
Thanks to Tiejun Cheng and Evan Bolton for the taxonomy-based discussions ;-)
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment