Commit 7a2e7e92 authored by Emma Schymanski's avatar Emma Schymanski
Browse files

proofing edits

many small changes for readability and alignment
parent 6818fb4d
---
title: "PFAS and Fluorinated Organic Compounds in PubChem Tree"
author:
- "Emma L. Schymanski^1^, Parviel Chirsir^1^, Todor Kondic^1^"
- "Paul A. Thiessen^2^, Jian Zhang^2^ and Evan E. Bolton^2^"
- "Emma L. Schymanski^1^*, Parviel Chirsir^1^, Todor Kondic^1^,"
- "Paul A. Thiessen^2^, Jian Zhang^2^ and Evan E. Bolton^2^*"
date: "24/03/2022"
output: pdf_document
csl: journal-of-cheminformatics.csl
......@@ -18,15 +18,18 @@ knitr::opts_chunk$set(warning = FALSE, message = FALSE)
^1^ Luxembourg Centre for Systems Biomedicine (LCSB),
University of Luxembourg, 6 avenue du Swing, 4367, Belvaux, Luxembourg.
ELS: ORCID [0000-0001-6868-8145](http://orcid.org/0000-0001-6868-8145),
PC: ORCID [0000-0002-9932-8609](http://orcid.org/0000-0002-9932-8609),
TK: ORCID [0000-0001-6662-4375](https://orcid.org/0000-0001-6662-4375)
*ELS: [emma.schymanski@uni.lu](mailto:emma.schymanski@uni.lu).
ORCIDs: ELS: [0000-0001-6868-8145](http://orcid.org/0000-0001-6868-8145),
PC: [0000-0002-9932-8609](http://orcid.org/0000-0002-9932-8609),
TK: [0000-0001-6662-4375](https://orcid.org/0000-0001-6662-4375).
^2^ National Center for Biotechnology Information, National
Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
PAT: ORCID [0000-0002-1992-2086](https://orcid.org/0000-0002-1992-2086),
JZ: ORCID [0000-0002-6192-4632](https://orcid.org/0000-0002-6192-4632),
EEB: ORCID [0000-0002-5959-6190](http://orcid.org/0000-0002-5959-6190)
^2^ National Center for Biotechnology Information (NCBI), National
Library of Medicine (NLM), National Institutes of Health (NIH),
Bethesda, MD, 20894, USA.
*EEB: [evan.bolton@nih.gov](mailto:evan.bolton@nih.gov).
ORCIDs: PAT: [0000-0002-1992-2086](https://orcid.org/0000-0002-1992-2086),
JZ: [0000-0002-6192-4632](https://orcid.org/0000-0002-6192-4632),
EEB: [0000-0002-5959-6190](http://orcid.org/0000-0002-5959-6190)
## Preamble
......@@ -36,7 +39,7 @@ PubChem Tree](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=120)"
(see Figure 1) on the
[Classification Brower](https://pubchem.ncbi.nlm.nih.gov/classification/)
in [PubChem](https://pubchem.ncbi.nlm.nih.gov/) [@kim_pubchem_2021],
developed in collaboration with the
developed in collaboration between PubChem (NCBI/NLM/NIH) and the
Environmental Cheminformatics group
([ECI](https://wwwen.uni.lu/lcsb/research/environmental_cheminformatics))
at the [LCSB](https://wwwen.uni.lu/lcsb/),
......@@ -46,6 +49,7 @@ and [Acknowledgements](#ack)).
The
[PFAS Tree](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=120)
(see [Figure 1](##treenodes))
includes all compounds in [PubChem](https://pubchem.ncbi.nlm.nih.gov/)
that satisfy various definitions, as explained later in this document.
Each compound in PubChem has a PubChem Compound Identifier (CID), and the
......@@ -54,6 +58,16 @@ compounds (_i.e._ CIDs) in that node.
<!-- ![The "[PFAS and Fluorinated Organic Compounds in PubChem Tree](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=120)" Landing Page.](fig/PFAS_Tree_Landing.png) -->
To become more familiar with the PubChem Classification Browser features
in general before embarking on content specific to the PFAS tree,
see Section [Navigating the Tree](#search).
There is also extensive documentation on the PubChem website, see:
- https://pubchem.ncbi.nlm.nih.gov/classification/
- https://pubchemdocs.ncbi.nlm.nih.gov/classification-browser
- https://pubchem.ncbi.nlm.nih.gov/classification/docs/classification_help.html
## Contents
......@@ -74,15 +88,15 @@ compounds (_i.e._ CIDs) in that node.
<!-- |References | [Go to heading](#statements) | 6 | -->
To become more familiar with the PubChem Classification Browser features
in general before embarking on content specific to the PFAS tree,
see Section [Navigating the Tree](#search).
<!-- To become more familiar with the PubChem Classification Browser features -->
<!-- in general before embarking on content specific to the PFAS tree, -->
<!-- see Section [Navigating the Tree](#search). -->
There is also extensive documentation on the PubChem website, see:
<!-- There is also extensive documentation on the PubChem website, see: -->
- https://pubchem.ncbi.nlm.nih.gov/classification/
- https://pubchemdocs.ncbi.nlm.nih.gov/classification-browser
- https://pubchem.ncbi.nlm.nih.gov/classification/docs/classification_help.html
<!-- - https://pubchem.ncbi.nlm.nih.gov/classification/ -->
<!-- - https://pubchemdocs.ncbi.nlm.nih.gov/classification-browser -->
<!-- - https://pubchem.ncbi.nlm.nih.gov/classification/docs/classification_help.html -->
## PubChem PFAS Tree Nodes {#treenodes}
......@@ -104,17 +118,21 @@ This node is constructed out of per- and polyfluoroalkyl substances
CF~3~ part) in the 2021 OECD Report
[ENV/CBC/MONO(2021)25](https://www.oecd.org/officialdocuments/publicdisplaydocumentpdf/?cote=ENV/CBC/MONO(2021)25&docLanguage=En)
[@oecd_reconciling_2021].
Browsing the 6 million entries in this node is a challenge.
Since the majority of these PFAS contained isolated CF~2~ (600 K) or
CF~3~ groups (5.4 M), these were separated into individual sections.
188 K compounds remain that contain PFAS parts larger than CF~2~/CF~3~.
<!-- #### PFAS "parts": -->
Note that here, "**PFAS part**" is used to describe a connected portion of
the molecule that satisfies the OECD PFAS definition. A given molecule may have
more than one PFAS parts present, some examples are given in Figure 2,
along with the count of parts. For further information, see [Details](#details).
Browsing the 6 million entries in this node (see Figure 3) is a challenge.
Since the majority of these PFAS contained isolated CF~2~ (600 K) or
CF~3~ groups (5.4 M), these were separated into individual sections
<!-- (see "[_isolated CF~2~ and CF~3~_](#isonodes)"). -->
(see [next section](#isonodes)).
188 K compounds contain PFAS parts larger than CF~2~/CF~3~
(see "[larger PFAS parts](#largerparts)").
<!-- #### PFAS "parts": -->
![Examples of molecules with varying PFAS parts highlighted, drawn using CDK Depict [@mayfield_cdk].](fig/PFAS_parts_CDK.png)
......@@ -125,7 +143,7 @@ with the top two level subnodes, is shown in Figure 3.
### OECD PFAS - Isolated CF~2~ and CF~3~ Nodes:
### OECD PFAS - Isolated CF~2~ and CF~3~ Nodes {#isonodes}
The _Isolated CF~2~ and CF~3~_ subnodes of the _OECD PFAS Definition_ node
allows the browsing of all PFAS
......@@ -133,7 +151,7 @@ molecules in PubChem containing at least one isolated CF~2~ (top subnode)
or one isolated CF~3~ (next subnode). These are broken down similarly,
as shown in Figure 4 for CF~2~.
![The isolated CF~2~ section of the OECD PFAS Definition node, with breakdown of the major parts (24 March 2022).](fig/OECDPFAS_CF2combi.png)
![The isolated CF~2~ section of the OECD PFAS Definition node, with breakdown of the major parts (numbers as of 24 March 2022).](fig/OECDPFAS_CF2combi.png)
The larger PFAS parts (left) are broken down by part type (linear, branched,
_etc._). Within these subcategories, dynamic construction is used.
......@@ -153,7 +171,7 @@ the number of groups, sorted by increasing number of CF~2~ groups
### OECD PFAS - PFAS Parts Larger than CF~2~/CF~3~
### OECD PFAS - PFAS Parts Larger than CF~2~/CF~3~ {#largerparts}
The "_Molecule contains PFAS parts larger than CF~2~/CF~3~_" part of the
OECD PFAS node includes approx. 188K molecules, which can be browsed
......@@ -172,7 +190,8 @@ Should there be fewer than ~20 categories,
the immediate breakdown is by the formula of the parts (see
Figure 5, bottom right, "_Contains 11 isolated PFAS parts_").
Should there be more than 20 entries, an extra layer is added,
to sort by the type of PFAS part (see Figure 5, top right).
to sort by the type of PFAS part (see Figure 5, top right,
"_Contains 10 isolated PFAS parts_").
For categories with very large numbers of entries, an additional
initial breakdown by the count of molecules is added (Figure 5,
middle panel). This is again broken down dynamically. If only a
......@@ -192,7 +211,7 @@ part type (linear, cyclic, _etc._) (Figure 6, left panel). These are
again split dynamically. With fewer than 20 entries, the list split
according to PFAS part formulas appears. If a greater breakdown is needed,
an extra layer of "_Also contains ..._" or "_Only contains ..._" is
added for extra navigation (Figure 6, top left, "_Contains isolated
added for extra navigation (Figure 6, mid left, "_Contains isolated
branched PFAS part_"). For entries containing many CIDs, a breakdown
by count of molecules is added (Figure 6, mid left, "_Contains isolated
linear PFAS part_"). Generally, the linear entries contain more entries
......@@ -206,31 +225,37 @@ breakdown by "_Also contains..._".
The dynamic navigation approach reduces the scrolling by users and
also helps reduce the data loading time, when many entries are
also helps reduce the data loading time when many entries are
present within a node. It is possible to use some
advanced search and querying capabilities to improve the interaction
with the tree, see [Navigating the Tree](#search) below.
The _PFAS Parts Larger than CF~2~/CF~3~_ is available as
a [MetFrag](https://msbi.ipb-halle.de/MetFrag/) file for further use
a [MetFrag](https://msbi.ipb-halle.de/MetFrag/)
[@ruttkies_metfrag_2016] file for further use
[@pubchem_pfas_metfrag_2022]. The CSV can be downloaded from Zenodo
(DOI:[10.5281/zenodo.6385954](https://doi.org/10.5281/zenodo.6385954))
for use in MetFragCL and will be made available from the MetFragWeb
drop down menu. See the description on the Zenodo record for
more details.
(DOI: [10.5281/zenodo.6385954](https://doi.org/10.5281/zenodo.6385954))
for use in
[MetFragCL](https://ipb-halle.github.io/MetFrag/projects/metfragcl/)
and will be made available from the
[MetFragWeb](https://msbi.ipb-halle.de/MetFrag/)
drop down menu. See the description on the Zenodo record
[@pubchem_pfas_metfrag_2022] for more details.
### Organofluorine Compounds {#orgf}
This node contains _organofluorine compounds_ as defined in Figure 8 in
This node contains _Organofluorine compounds_ as defined in Figure 8 in
the 2021 OECD PFAS Report
[ENV/CBC/MONO(2021)25](https://www.oecd.org/officialdocuments/publicdisplaydocumentpdf/?cote=ENV/CBC/MONO(2021)25&docLanguage=En)
[@oecd_reconciling_2021].
Figure 7 in the current report shows an extract from Figure 8 of
the OECD report on the left panel,
and the corresponding node breakdown in the _Organofluorine compounds_
section of the PubChem PFAS Tree to the right. Note that one additional
section of the
[PubChem PFAS Tree](https://pubchem.ncbi.nlm.nih.gov/classification/#hid=120)
to the right. Note that one additional
category was added ("_Other fluorinated substances_") to capture content
that did not fit into any other category defined in the OECD figure.
......@@ -245,8 +270,9 @@ present. For instance, the "_Fluorinated aliphatic substances that have a
fully fluorinated methyl or methylene carbon atom_" category starts at
"_Contains 02 Fluorine atoms_" as no entries in this category could contain
only one F.
The exact mass is split into the ranges 1-250, 250-500, 500-750, 750-1000
and >1000.
The exact mass subcategories are split into the ranges
1-250, 250-500, 500-750, 750-1000 and >1000 - and again only display
if there are CIDs within this range.
<!-- The exact mass is split as follows: -->
......@@ -259,13 +285,14 @@ and >1000.
### PFAS and Fluorinated Organic Compound Collections {#lists}
This section of the PFAS tree contains various lists gathered
across PubChem content. Additional community-based PFAS lists may
The "_PFAS and Fluorinated Organic Compound Collections_"
section of the PFAS tree contains various lists gathered
across PubChem content (see Figure 8). Additional community-based PFAS lists may
also be added here. The mapping files to construct this are kept
on the [eci/pubchem](https://gitlab.lcsb.uni.lu/eci/pubchem/)
repository on GitLab.
![The "PFAS and Fluorinated Organic Compound Collections" node, with all major collections shown (as of 24 March 2022).](fig/PFAS_list_of_lists.png)
![The "PFAS and Fluorinated Organic Compound Collections" node, with all major collections shown (CompTox as inset). Numbers and content listing from 24 March 2022.](fig/PFAS_list_of_lists.png)
Currently, the content (see Figure 8) comes from:
......@@ -298,26 +325,32 @@ sections.
### Search via PubChem Search {#pc-search}
Perhaps the most intuitive interaction is directly through
clicking on the numbers besides each node. This sends a query
clicking on the numbers besides each node (see Figure 9). This sends a query
directly to the PubChem Search interface and displays the
entire node contents, as shown in Figure 9. This query follows
"OECD PFAS Definition" > "Molecule contains PFAS parts larger than
CF~2~/CF~3~" > "Breakdown by isolated PFAS part count" >
"Contains 01 isolated PFAS part" > "Count of molecules 10001-100000" >
"Contains 01xC04F09-linear" and returns the 10,555 CIDs with
a C~4~F~9~ PFAS part. This query could then be downloaded,
or sent to Entrez (see next section).
"_OECD PFAS Definition_" > "_Molecule contains PFAS parts larger than
CF~2~/CF~3~_" > "_Breakdown by isolated PFAS part count_" >
"_Contains 01 isolated PFAS part_" > "_Count of molecules 10001-100000_" >
"_Contains 01xC04F09-linear_" and returns the 10,555 CIDs with
a C~4~F~9~ PFAS part. This query can then be downloaded (Figure 9, inset),
or sent to Entrez for advanced querying (see next section).
Note that clicking on the "?" beside a node (where present) will open a tool tip
explaining the node contents (Figure 9, bottom left).
![Querying node contents in PubChem Search. When clicking on the blue numbers (left), a search window will open in a new tab (right, main image). This collection can be browsed, downloaded (see inset) or sent to Entrez (see next section). Clicking on the "?" sign will open a tool tip (left panel, yellow blurb).](fig/Tree_PubChemSearch.png)
![Querying node contents in PubChem Search. When clicking on the blue numbers (left), a search window will open in a new tab (right, main image). This collection can be browsed, downloaded (see inset) or sent to Entrez (see next section). Clicking on the "?" sign will open a tool tip (left panel, bottom, see yellow blurb).](fig/Tree_PubChemSearch.png)
The download file contains a number of fields of interest,
including: CIDs, names and synonyms, several properties (_e.g._ XlogP),
structural information (molecular formula, SMILES, InChI, InChIKey)
as well as several metadata entries. These relevant ones are explained
as well as several metadata entries. The relevant metadata fields
of interest to indicate the amount of information available
to support the structures in the tree are explained
in the following table:
Table: Relevant metadata files in the PubChem Download files
| Header | Description | Type |
|-----------|---------|:----:|
|-----------|---------|----|
|annothits| Annotation categories present for this CID | Text |
|annothitcount | Count of annotation categories for CID | Numeric |
|cidcdate | CID creation date | YYYYMMDD |
......
No preview for this file type
......@@ -135,3 +135,21 @@ Type: dataset},
year = {2022},
doi = {10.26434/chemrxiv-2022-nmnnd}
}
@article{ruttkies_metfrag_2016,
title = {{MetFrag} relaunched: incorporating strategies beyond in silico fragmentation},
volume = {8},
copyright = {All rights reserved},
issn = {1758-2946},
shorttitle = {{MetFrag} relaunched},
url = {http://www.jcheminf.com/content/8/1/3},
doi = {10.1186/s13321-016-0115-9},
language = {en},
number = {1},
urldate = {2018-11-01},
journal = {Journal of Cheminformatics},
author = {Ruttkies, Christoph and Schymanski, Emma L. and Wolf, Sebastian and Hollender, Juliane and Neumann, Steffen},
month = dec,
year = {2016},
pages = {3},
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment