Commit ae50bc23 authored by Emma Schymanski's avatar Emma Schymanski
Browse files

Updated docs

...many changes due to new section to be added soon in next release.
parent 650024eb
---
title: "PFAS and Fluorinated Organic Compounds in PubChem Tree"
title: "PFAS and Fluorinated Compounds in PubChem Tree"
author:
- "Emma L. Schymanski^1^*, Parviel Chirsir^1^, Todor Kondic^1^,"
- "Paul A. Thiessen^2^, Jian Zhang^2^ and Evan E. Bolton^2^*"
date: "27/03/2022"
date: "28/05/2022"
output: pdf_document
csl: journal-of-cheminformatics.csl
bibliography: refs.bib
......@@ -34,32 +34,33 @@ EEB: [0000-0002-5959-6190](http://orcid.org/0000-0002-5959-6190).
## Preamble
This document describes the "[PFAS and Fluorinated Organic Compounds in
This document describes the "[PFAS and Fluorinated Compounds in
PubChem Tree](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=120)"
(see Figure 1) on the
[Classification Brower](https://pubchem.ncbi.nlm.nih.gov/classification/)
(hereafter
"[PubChem PFAS Tree](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=120)")
<!-- on the -->
<!-- [Classification Brower](https://pubchem.ncbi.nlm.nih.gov/classification/) -->
in [PubChem](https://pubchem.ncbi.nlm.nih.gov/) [@kim_pubchem_2021],
developed in collaboration between PubChem (NCBI/NLM/NIH) and the
developed jointly between PubChem (NCBI/NLM/NIH) and the
Environmental Cheminformatics group
([ECI](https://wwwen.uni.lu/lcsb/research/environmental_cheminformatics))
at the [LCSB](https://wwwen.uni.lu/lcsb/),
[University of Luxembourg](https://wwwen.uni.lu/) in consultation with
[University of Luxembourg](https://wwwen.uni.lu/), in consultation with
several community representatives (see [Contributions](#contrib)
and [Acknowledgements](#ack)).
The
[PFAS Tree](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=120)
(see [Figure 1](##treenodes))
[PubChem PFAS Tree](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=120)
(see [Figure 1](#treenodes) and [Contents listing](#cont))
includes all compounds in [PubChem](https://pubchem.ncbi.nlm.nih.gov/)
that satisfy various definitions, as explained later in this document.
satisfying various definitions, as explained later in this document.
Each compound in PubChem has a PubChem Compound Identifier (CID), and the
blue numbers next to each node header reflects the number of
compounds (_i.e._ CIDs) in that node.
To become more familiar with the PubChem Classification Browser features
in general before embarking on content specific to the PFAS tree,
see the Section [Navigating the Tree](#search).
There is also extensive documentation on the PubChem website
(links below) or reach out to
More details on the general
[PubChem Classification Brower](https://pubchem.ncbi.nlm.nih.gov/classification/)
features are given in the Section [Navigating the Tree](#search), at
the links below, or by reaching out to
[pubchem-help@ncbi.nlm.nih.gov](mailto:pubchem-help@ncbi.nlm.nih.gov)
for more information:
......@@ -68,24 +69,25 @@ for more information:
- https://pubchem.ncbi.nlm.nih.gov/classification/docs/classification_help.html
## Contents
## Contents {#cont}
<!-- This document is organised into several sections, as follows: -->
Table: _Contents page for this documentation._
Table: _Contents list for the PubChem PFAS Tree documentation._
| Section | Navigation | PDF Page |
|-----------|---------|:----:|
|PubChem PFAS Tree Nodes | [Go to heading](#treenodes) | 2 |
|_OECD PFAS Definition_ | [Go to heading](#oecddef) | 2 |
|_Organofluorine Compounds_ | [Go to heading](#orgf) | 5 |
|_PFAS and Fluorinated Organic Compound Collections_ | [Go to heading](#lists) | 5 |
| - _OECD PFAS Definition_ | [Go to heading](#oecddef) | 2 |
| - _Organofluorine Compounds_ | [Go to heading](#orgf) | 5 |
| - _Other Diverse Fluorinated Compounds_ | [Go to heading](#divf) | 6 |
| - _PFAS and Fluorinated Compound Collections_ | [Go to heading](#lists) | 7 |
|Navigating the Tree | [Go to heading](#search) | 7 |
|_Search via PubChem Search_ | [Go to heading](#pc-search) | 7 |
|_Interactions via Entrez_ | [Go to heading](#entrez) | 9 |
|_Interactions via PUG REST_ | [Go to heading](#pugrest) | 10 |
|Further Details | [Go to heading](#details) | 12 |
|Statements | [Go to heading](#statements) | 12 |
|References | [Go to heading](#refs) | 13 |
| - _Search via PubChem Search_ | [Go to heading](#pc-search) | 8 |
| - _Interactions via Entrez_ | [Go to heading](#entrez) | 9 |
| - _Interactions via PUG REST_ | [Go to heading](#pugrest) | 12 |
|Further Details | [Go to heading](#details) | 13 |
|Statements | [Go to heading](#statements) | 14 |
|References | [Go to heading](#refs) | 14 |
<!-- To become more familiar with the PubChem Classification Browser features -->
......@@ -101,14 +103,15 @@ Table: _Contents page for this documentation._
## PubChem PFAS Tree Nodes {#treenodes}
The tree is currently split into three main nodes that are constructed and
compiled separately (see Figure 1).
compiled separately (see [Figure 1](#treenodes)).
More nodes are under development and will be released as they are ready.
Further details are given below.
<!-- To become more familiar with the PubChem Classification Browser features, -->
<!-- see Section [Navigating the Tree](#search). -->
![_The "[PFAS and Fluorinated Organic Compounds in PubChem Tree](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=120)" Landing Page._](fig/PFAS_Tree_Landing.png)
![_The "[PFAS and Fluorinated Compounds in PubChem Tree](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=120)" Landing Page._](fig/PFAS_Tree_Landing.png)
<!-- TODO: Update Figure 1 -->
### OECD PFAS Definition {#oecddef}
......@@ -121,8 +124,8 @@ CF~3~ part) in the 2021 OECD Report
Note that here, "**PFAS part**" is used to describe a connected portion of
the molecule that satisfies the OECD PFAS definition. A given molecule may have
more than one PFAS part present, some examples are given in Figure 2,
along with the count of parts. For more information, see
[Further Details](#details).
along with the count of parts. For more information, see section
"[Further Details](#details)".
Browsing the 6 million entries in this node (see Figure 3) is challenging.
Since most of these PFAS contain isolated CF~2~ (600 K entries) or
......@@ -137,11 +140,12 @@ CF~3~ groups (5.4 M entries), these were separated into individual sections
![_Examples of molecules with varying PFAS parts highlighted, drawn using [CDK Depict](https://www.simolecule.com/cdkdepict/depict.html) [@mayfield_cdk]._](fig/PFAS_parts_CDK.png)
The _OECD PFAS Definition_ node
The _OECD PFAS Definition_ node,
with the top two level subnodes, is shown in Figure 3.
![_The OECD PFAS Definition part of the PFAS tree, with top two subnodes (24 March 2022)._](fig/OECDPFAS_TopTwoSubnodes_v3.png)
![_The OECD PFAS Definition part of the PFAS tree, with top two subnodes (24 March 2022)._](fig/OECDPFAS_TopTwoSubnodes_v4.png)
<!-- TODO: update Figure 3 -->
### OECD PFAS - Isolated CF~2~ and CF~3~ Nodes {#isonodes}
......@@ -165,7 +169,6 @@ The "_Contains only isolated CF~2~_" (or, for the CF~3~ node, only isolated
CF~3~) is broken down by the number of isolated groups (CF~2~ or,
for the CF~3~ node, by CF~3~ groups) - see Figure 4, middle panel. In both
cases, the vast majority of molecules have only one isolated group.
The "_Contains only isolated CF~2~/CF~3~_" is also broken down by
the number of groups, sorted by increasing number of CF~2~ groups
(for both nodes). See Figure 4, right panel.
......@@ -286,18 +289,55 @@ if there are CIDs within this range.
<!-- - Exact mass range >1000 -->
### PFAS and Fluorinated Organic Compound Collections {#lists}
### Other Diverse Fluorinated Compounds {#divf}
The "_Other Diverse Fluorinated Compounds_" section of the
PubChem PFAS Tree is designed to help users explore various
cases of fluorine chemistry not necessarily covered in the OECD PFAS
or Organofluorine compound sections above. The navigation in this
section helps explore fluorinated compound chemistry by various
fluorine-heteroatom bonds and the occurrence of different elements
(see Figure 8).
Many of the compounds present in this section are also present
in the other sections of the PubChem PFAS Tree - the overlap
can be investigated in Entrez (see section
[Interactions via Entrez](#entrez) below).
![_The "Other diverse fluorinated compounds" part of the PubChem PFAS Tree, showing the breakdown by fluorine bonded to non-carbon elements and by non-organic element (interim numbers from 27 May 2022)._](fig/DiverseFcmpds_v2.png)
#### The "Contains fluorine bond to non-carbon element"
section (Figure 8, middle panel) is broken down first by the
count of molecules present in the given category, then by the
non-carbon element present in the F-element bond (sorted alphabetically).
For the sections with counts above 100, there is an extra breakdown
by the numbers of fluorine present overall.
The "_PFAS and Fluorinated Organic Compound Collections_"
section of the PFAS tree contains various lists gathered
across PubChem content (see Figure 8). Additional community-based PFAS lists may
also be added here. The mapping files to construct this are kept
#### The "Contains non-organic element"
section (Figure 8, right panel) is likewise broken down first by the
count of molecules present in the given category, then by the
non-organic element present (sorted alphabetically).
In this section, non-organic refers to any element that is not
C, H, N, O, P, S, Si, F, Cl, Br or I.
As above, there is an extra breakdown by the numbers of fluorine
present overall for the sections with counts above 100.
### PFAS and Fluorinated Compound Collections {#lists}
The "_PFAS and Fluorinated Compound Collections_"
section of the PubChem PFAS tree contains various lists gathered
across PubChem content (see Figure 9).
The mapping files to construct this are kept
on the [eci/pubchem](https://gitlab.lcsb.uni.lu/eci/pubchem/)
repository on GitLab.
![_The "PFAS and Fluorinated Organic Compound Collections" node, with all major collections shown (CompTox as inset). Numbers and content listing from 24 March 2022._](fig/PFAS_list_of_lists.png)
Currently, the content displayed in Figure 8 comes from:
Currently, the content displayed in Figure 9 comes from:
- All [PFAS lists](https://comptox.epa.gov/dashboard/chemical-lists?filtered=&search=PFAS)
from the
......@@ -310,13 +350,19 @@ from the NORMAN Suspect List Exchange
([NORMAN-SLE](https://www.norman-network.com/nds/SLE/)) via the
[NORMAN-SLE Tree](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=101)
in PubChem;
- The CORE PFAS lists from OntoChem [@barnabas_extracting_2022];
- The CORE and Patent PFAS lists from OntoChem [@barnabas_extracting_2022];
- Other collections from within PubChem Classification Trees, including
[Cameo](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=86),
[ChEBI](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=2) and
[MeSH](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=1).
Additional community-based PFAS can also be added to this section.
We will be happy to add new collections where feasible.
If you have any suggestions, please email
[pubchem-help@ncbi.nlm.nih.gov](mailto:pubchem-help@ncbi.nlm.nih.gov) or
[normansle@uni.lu](mailto:normansle@uni.lu)) for further details.
## Navigating the Tree {#search}
......@@ -328,18 +374,18 @@ sections.
### Search via PubChem Search {#pc-search}
Perhaps the most intuitive interaction is directly through
clicking on the numbers besides each node (see Figure 9). This sends a query
clicking on the numbers besides each node (see Figure 10). This sends a query
directly to the PubChem Search interface and displays the
entire node contents, as shown in Figure 9. This query follows
entire node contents, as shown in Figure 10. This query follows
"_OECD PFAS Definition_" > "_Molecule contains PFAS parts larger than
CF~2~/CF~3~_" > "_Breakdown by isolated PFAS part count_" >
"_Contains 01 isolated PFAS part_" > "_Count of molecules 10001-100000_" >
"_Contains 01xC04F09-linear_" and returns the 10,555 CIDs containing
only one single linear C~4~F~9~ PFAS part.
This query can then be downloaded (Figure 9, inset),
This query can then be downloaded (Figure 10, inset),
or sent to Entrez for advanced querying (see [next section](#entrez)).
Note that clicking on the "**?**" beside a node (where present) will open a
tool tip explaining the node contents (Figure 9, bottom left).
tool tip explaining the node contents (Figure 10, bottom left).
![_Querying node contents in PubChem Search. When clicking on the blue numbers (left), a search window will open in a new tab (right, main image). This collection can be browsed, downloaded (see inset) or sent to Entrez (see next section). Clicking on the "**?**" sign next to a node name will open a tool tip (left panel, bottom, see yellow blurb)._](fig/Tree_PubChemSearch.png)
......@@ -350,7 +396,7 @@ as well as several metadata entries. These metadata entries contain
valuable information about the evidence contributing to the presence
of that structure in PubChem (_e.g.,_ contribution source(s) and date,
annotation information). Relevant fields are explained in Table 2
and shown in Figure 10.
and shown in Figure 11.
Table: _Relevant metadata files in the PubChem Download files._
......@@ -397,7 +443,7 @@ as explained in the next section.
It is possible to build more extensive queries via the
[Entrez](https://pubchemdocs.ncbi.nlm.nih.gov/advanced-search-entrez)
interface, which is accessible through the button below
the download button (see Figure 9) or by clicking the "Use Entrez"
the download button (see Figure 10) or by clicking the "Use Entrez"
option on the PubChem landing page. More documentation on Entrez is given
[here](https://pubchemdocs.ncbi.nlm.nih.gov/advanced-search-entrez).
This section steps through a few interactive examples.
......@@ -405,14 +451,14 @@ This section steps through a few interactive examples.
#### Example 1: Find all PFAS containing one linear C~4~F~9~ part with use information:
To find all molecules from the query in Figure 9 that also have
use information in PubChem, the first step is to send the 10,555 CIDs
from the query above to Entrez via the "Push to Entrez" option (Figure 9,
from the query above to Entrez via the "Push to Entrez" option (Figure 10,
second box encircled in red on the right). This opens a new page in the
Entrez interface (not shown).
Next, go to the "Use and Manufacturing" section of the
[PubChem TOC Tree](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=72),
send this to PubChem Search via the numbers next to the node (Figure 11,
red circle on left), and push to Entrez (Figure 11, top right). By
selecting the "Advanced" option under the search bar (Figure 11, top),
send this to PubChem Search via the numbers next to the node (Figure 12,
red circle on left), and push to Entrez (Figure 12, top right). By
selecting the "Advanced" option under the search bar (Figure 12, top),
the Advanced Search builder is opened and further queries can be built.
By selecting "#2 AND #6", only the 436 chemicals with a single
C~4~F~9~ linear PFAS part (query #2) that also have use and manufacturing
......@@ -426,14 +472,14 @@ Analytical chemists may, for instance, be particularly keen on finding
out which PFAS or organofluorine compounds have mass spectrometry information
available in PubChem (or in resources integrated within PubChem). It is
also possible to use the Entrez functionality to subset the tree
contents according to other available information - shown in Figure 12
contents according to other available information - shown in Figure 13
for this example. First, go to the "Mass Spectrometry" section of the
[PubChem TOC Tree](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=72),
which is under the "Spectral Information" heading, and send this query
to Entrez (see Figure 12 left and top right). Then, go back to the
to Entrez (see Figure 13 left and top right). Then, go back to the
[PubChem PFAS Tree](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=120)
and ***refresh*** the contents. A new dropdown menu will appear
(if not already present) called "Filter by Entrez History" (Figure 12,
(if not already present) called "Filter by Entrez History" (Figure 13,
bottom right). By selecting the chosen query in this dropdown menu,
the tree will then be subset by the contents within that query, such
that only CIDs that are in the tree _and_ in the query will show
......@@ -459,7 +505,7 @@ by uploading this information via the
![_Subsetting Tree Contents via Entrez. Left: [PubChem TOC Tree](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=72), "Mass Spectrometry" subsection. Top right: the "Mass Spectrometry" query in PubChem Search (to be sent to Entrez). Bottom right: the [PubChem PFAS Tree](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=120) subset by Mass Spectrometry, now only displaying CIDs where mass spectrometry information is available in PubChem. Queries run on 27 March 2022._](fig/Entrez_MSandPFAS.png)
More examples coming soon ...
<!-- More examples coming soon ... -->
### Interactions via PUG REST {#pugrest}
......@@ -543,6 +589,8 @@ Nonetheless, some
technical details are necessary and are contained in this section, which
will be expanded as further questions arise.
### Compounds Excluded from the PubChem PFAS Tree
The [PubChem PFAS Tree](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=120)
currently excludes molecules (compounds) from consideration if they:
......@@ -551,13 +599,14 @@ currently excludes molecules (compounds) from consideration if they:
Since the entire tree is constructed on CIDs (_i.e._, compounds), substance
entries (denoted by substance identifiers, SID) are also not included. Thus,
undefined or poorly defined entities are also not included.
undefined or poorly defined entities are also not included. Polymer entries
are also not included.
More information about the difference between compound and substances
on PubChem is available
[here](https://pubchemblog.ncbi.nlm.nih.gov/2014/06/19/what-is-the-difference-between-a-substance-and-a-compound-in-pubchem/).
#### PFAS Test set:
### PFAS Test set
A test set of PFAS and non-PFAS from the OECD Report
[@oecd_reconciling_2021] has been compiled to check the
performance of the
......@@ -572,11 +621,11 @@ The current approach still has room for improvement; the following are being add
in future developments (and will be released as ready). These include:
- Handling of ethers and other connecting atoms;
- Handling of unsaturated PFAS;
- Better browseability of special cases.
- Handling of unsaturated PFAS.
<!-- - Better browseability of special cases. -->
### Contact Details
## Contact Details
User feedback is extremely valuable to help improve this tree further.
Please reach out to either contact author (details on first page,
......@@ -584,10 +633,17 @@ or email [Evan](mailto:evan.bolton@nih.gov) and
[Emma](mailto:emma.schymanski@uni.lu) directly)
with feedback and comments!
If you have any suggestions for PFAS or fluorinated compound collections to
include in the "_PFAS and Fluorinated Compound Collections_" part
of the PubChem PFAS Tree, please email
[pubchem-help@ncbi.nlm.nih.gov](mailto:pubchem-help@ncbi.nlm.nih.gov) or
[normansle@uni.lu](mailto:normansle@uni.lu).
For general questions about PubChem and the functionality
described here, please reach out to the
[PubChem Help](mailto:pubchem-help@ncbi.nlm.nih.gov)
mailing list for further support.
[PubChem Help mailing list](mailto:pubchem-help@ncbi.nlm.nih.gov)
for further support.
<!-- ## Closing -->
......
No preview for this file type
This diff is collapsed.
......@@ -70,7 +70,7 @@
pages = {689059}
}
@techreport{oecd_reconciling_2021,
@electronic{oecd_reconciling_2021,
address = {Paris},
title = {Reconciling {Terminology} of the {Universe} of {Per}- and {Polyfluoroalkyl} {Substances}: {Recommendations} and {Practical} {Guidance}},
url = {https://www.oecd.org/chemicalsafety/portal-perfluorinated-chemicals/terminology-per-and-polyfluoroalkyl-substances.pdf},
......@@ -82,7 +82,7 @@
pages = {45}
}
@misc{mayfield_cdk,
@electronic{mayfield_cdk,
title = {{CDK} {Depict} {Web} {Interface}},
url = {http://simolecule.com/cdkdepict/depict.html},
urldate = {2022-03-24},
......@@ -90,14 +90,15 @@
author = {Mayfield, John}
}
@misc{pubchem_pfas_metfrag_2022,
@article{pubchem_pfas_metfrag_2022,
title = {{PubChem} {OECD} {PFAS} {Larger} {PFAS} {Parts} file for {MetFrag}},
copyright = {Creative Commons Attribution 4.0 International, Open Access},
url = {https://zenodo.org/record/6385954},
abstract = {This is a MetFrag database file constructed from the "Molecule contains PFAS parts larger than CF$_{\textrm{2}}$/CF$_{\textrm{3}}$" subnode of the OECD PFAS Definition node in the PFAS and Fluorinated Organic Compounds in PubChem Tree on the Classification Browser in PubChem. This file was constructed by downloading the node contents, selecting the columns of interest, changing the headers to MetFrag-compatible headers and adding exact mass, PubMed ID (PMID) counts and patent counts to the file (via this package). Entries containing Xe and Pr were removed. The construction of the tree is documented here (work in progress).},
urldate = {2022-03-26},
publisher = {Zenodo},
author = {{Schymanski, Emma} and {Bolton, Evan} and {Chirsir, Parviel} and {Kondic, Todor} and {Thiessen, Paul} and {Zhang, Jian}},
author = {Schymanski, Emma and Bolton, Evan and Chirsir, Parviel and Kondic, Todor and Thiessen, Paul and Zhang, Jian},
journal = {Zenodo},
month = mar,
year = {2022},
doi = {10.5281/zenodo.6385954},
......@@ -123,7 +124,7 @@ Type: dataset},
pages = {61}
}
@techreport{barnabas_extracting_2022,
@article{barnabas_extracting_2022,
type = {preprint},
title = {Extracting and {Comparing} {PFAS} from {Literature} and {Patent} {Documents} using {Open} {Access} {Chemistry} {Toolkits}},
url = {https://chemrxiv.org/engage/chemrxiv/article-details/6235c30d13d4787cda941789},
......@@ -131,6 +132,7 @@ Type: dataset},
urldate = {2022-03-25},
institution = {Chemistry},
author = {Barnabas, Shadrack and Böhme, Timo and Boyer, Stephen and Irmer, Matthias and Ruttkies, Christoph and Kondic, Todor and Schymanski, Emma and Weber, Lutz},
journal = {ChemRxiv - preprint},
month = mar,
year = {2022},
doi = {10.26434/chemrxiv-2022-nmnnd}
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment