Commit b76d9ed2 authored by Emma Schymanski's avatar Emma Schymanski
Browse files

add PFAS Tree docs

First version - still work in progress!
parent 78c66eb3
title: "PFAS and Fluorinated Organic Compounds in PubChem Tree"
- "Emma L. Schymanski^1^, Parviel Chirsir^1^,"
- "Paul A. Thiessen^2^, Jian Zhang^2^ and Evan E. Bolton^2^"
date: "24/03/2022"
output: pdf_document
csl: journal-of-cheminformatics.csl
bibliography: refs.bib
urlcolor: blue
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#knitr::opts_chunk$set(fig.pos = "!H", out.extra = "")
^1^ Luxembourg Centre for Systems Biomedicine (LCSB),
University of Luxembourg, 6 avenue du Swing, 4367, Belvaux, Luxembourg.
ELS: ORCID [0000-0001-6868-8145](
PC: ORCID [0000-0002-9932-8609](
^2^ National Center for Biotechnology Information, National
Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
PAT: ORCID [0000-0002-1992-2086](,
JZ: ORCID [0000-0002-6192-4632](,
EEB: ORCID [0000-0002-5959-6190](
## Preamble
This document describes the "PFAS and Fluorinated Organic Compounds in
PubChem Tree" (see Figure 1) on the
[Classification Brower](
in [PubChem]( [@kim_pubchem_2021],
developed in collaboration with the
Environmental Cheminformatics group
at the [LCSB](,
[University of Luxembourg]( in consultation with
several community representatives (see Acknowledgements).
![The "[PFAS and Fluorinated Organic Compounds in PubChem Tree](" Landing Page.](fig/PFAS_Tree_Landing.png)
## Contents
This document is organised as follows:
| Section | Navigation | PDF Page |
|PubChem PFAS Tree Nodes | [Go to heading](#treenodes) | 2 |
|_OECD PFAS Definition_ | [Go to heading](#oecddef) | 2 |
|_Organofluorine Compounds_ | [Go to heading](#orgf) | ... |
|_PFAS and Fluorinated Organic Compound Collections_ | [Go to heading](#lists) | ... |
|Searching the Tree | [Go to heading](#search) | ... |
|_Search via PubChem Search_ | [Go to heading](#pc-search) | ... |
|_Interactions via Entrez_ | [Go to heading](#entrez) | ... |
|Implementation | [Go to heading](#impl) | ... |
|Statements and References | [Go to heading](#statements) | ... |
## PubChem PFAS Tree Nodes {#treenodes}
The tree is currently split into three main nodes that are constructed and
compiled separately. More nodes are under development and will be released
as they are ready.
Further details are given in the sections below.
### OECD PFAS Definition {#oecddef}
This node is constructed out of per- and polyfluoroalkyl substances
(PFAS) satisfying the OECD 2021 definition (contains at least one saturated CF~2~ or
CF~3~ part) in the OECD Report ENV/CBC/MONO(2021)25 (9 July 2021), available from
As this node includes over 6 million entries, browseability is a challenge.
Since the majority of these PFAS contained isolated CF~2~ or CF~3~ groups,
these were separated into individual sections. There are approx.
600K compounds with isolated CF~2~ groups, and approx 5.4 M compounds
with isolated CF~3~ groups. 188 K compounds remain that contain PFAS
parts greater than CF~2/3~.
#### PFAS "parts":
Note that here, "part" is used to describe a connected portion of the molecule
that satisfies the OECD PFAS definition. A given molecule may have
more than on PFAS parts present, some examples are given in Figure 2,
along with the count of parts.
![Examples of molecules with varying PFAS parts highlighted, drawn using CDK Depict [@mayfield_cdk].](fig/PFAS_parts_CDK.png)
#### The OECD PFAS Definition node
with the top two level subnodes, is shown in Figure 3.
![The OECD PFAS Definition part of the PFAS tree, with top two subnodes (numbers from 24 March 2022).](fig/OECDPFAS_TopTwoSubnodes.png)
#### Isolated CF~2~ and CF~3~ Nodes
The top two subnodes of the OECD PFAS Definition allows the browsing of all PFAS
molecules in PubChem containing at least one isolated CF~2~ part (top subnode)
or one isolated CF~3~ (next subnode). These are broken down similarly,
as shown in Figure 4 for the CF~2~ case.
![The isolated CF~2~ section of the OECD PFAS Definition node, with breakdown of the major parts (numbers from 24 March 2022)](fig/OECDPFAS_CF2combi.png)
The larger PFAS parts (left) are broken down by part type (linear, branched,
_etc._). Within these subcategories, dynamic construction is used.
If many (>20) variants are present, a breakdown by number of PFAS parts
is added (bottom left, Figure 4, see "Contains isolated unsaturated-linear
PFAS part"), if not, a list of the possibilities is given directly
(middle left, Figure 4, see "Contains isolated unsaturated-cyclic part").
The "Contains only isolated CF~2~" (or, for the CF~3~ node, only isolated
CF~3~) node (Figure 4, middle panel) is broken down by the number of
isolated groups (CF~2~ or, for the CF~3~ node, by CF~3~ groups). In both
cases the vast majority of molecules have only one isolated group.
The "Contains only isolated CF~2~/CF~3~" is also broken down by
the number of groups, sorted by increasing number of CF~2~ groups
(for both nodes). See Figure 4, right panel.
### PFAS Parts Larger than CF~2~/CF~3~
This is the section for larger PFAS parts ...
![The OECD PFAS Definition part of the tree, with top two subnodes (numbers from 24 March 2022)](fig/OECDPFAS_largerPFASparts.png)
### Organofluorine Compounds {#orgf}
Content still to come ...
### PFAS and Fluorinated Organic Compound Collections {#lists}
Content still to come ...
## Searching the Tree {#search}
While the tree offers possibilities for browsing and searching
PFAS and other organofluorine content, there are several powerful
search capabilites to empower this further.
### Search via PubChem Search {#pc-search}
Content still to come ...
### Interactions via Entrez {#entrez}
Content still to come ...
## Implementation {#impl}
Content still to come ...
- test set
- notes on implementation
<!-- ## Closing -->
## Statements and References {#statements}
### Author Contributions {#contrib}
<!-- -->
ELS: Conceptualization (equal), data curation, methodology, software, validation, writing - original draft preparation, writing - review and editing.
PC: Validation (supporting)
PAT: Data curation, methodology, software
JZ: Data curation, methodology, software
EEB: Conceptualization (equal), data curation, methodology, software (lead), validation, writing - original draft preparation, writing - review and editing.
### Acknowledgements {#ack}
We would like to acknowledge discussions with Zhanyun Wang (EMPA, CH),
Hans Peter Arp (NGI, NO), Ian Cousins, Luc Miaz, Jon Martin (ACES, SE),
as well as other project members of [ZeroPM](
We would also like to acknowledge discussions and contributions from
various members of the ECI and PubChem teams that were not directly
involved in these efforts but have likely contributed indirectly
through our many other collaborative efforts!
### References {#ref}
<?xml version="1.0" encoding="utf-8"?>
<style xmlns="" version="1.0" default-locale="en-US">
<!-- Generated with -->
<title>Journal of Cheminformatics</title>
<link href="" rel="self"/>
<link href="" rel="independent-parent"/>
<link href="" rel="documentation"/>
<link href="" rel="documentation"/>
<category citation-format="numeric"/>
<rights license="">This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License</rights>
title = {Empowering large chemical knowledge bases for exposomics: {PubChemLite} meets {MetFrag}},
volume = {13},
issn = {1758-2946},
shorttitle = {Empowering large chemical knowledge bases for exposomics},
url = {},
doi = {10.1186/s13321-021-00489-0},
abstract = {Abstract
Compound (or chemical) databases are an invaluable resource for many scientific disciplines. Exposomics researchers need to find and identify relevant chemicals that cover the entirety of potential (chemical and other) exposures over entire lifetimes. This daunting task, with over 100 million chemicals in the largest chemical databases, coupled with broadly acknowledged knowledge gaps in these resources, leaves researchers faced with too much—yet not enough—information at the same time to perform comprehensive exposomics research. Furthermore, the improvements in analytical technologies and computational mass spectrometry workflows coupled with the rapid growth in databases and increasing demand for high throughput “big data” services from the research community present significant challenges for both data hosts and workflow developers. This article explores how to reduce candidate search spaces in non-target small molecule identification workflows, while increasing content usability in the context of environmental and exposomics analyses, so as to profit from the increasing size and information content of large compound databases, while increasing efficiency at the same time. In this article, these methods are explored using PubChem, the NORMAN Network Suspect List Exchange and the in silico fragmentation approach MetFrag. A subset of the PubChem database relevant for exposomics, PubChemLite, is presented as a database resource that can be (and has been) integrated into current workflows for high resolution mass spectrometry. Benchmarking datasets from earlier publications are used to show how experimental knowledge and existing datasets can be used to detect and fill gaps in compound databases to progressively improve large resources such as PubChem, and topic-specific subsets such as PubChemLite. PubChemLite is a living collection, updating as annotation content in PubChem is updated, and exported to allow direct integration into existing workflows such as MetFrag. The source code and files necessary to recreate or adjust this are jointly hosted between the research parties (see data availability statement). This effort shows that enhancing the FAIRness (Findability, Accessibility, Interoperability and Reusability) of open resources can mutually enhance several resources for whole community benefit. The authors explicitly welcome additional community input on ideas for future developments.},
language = {en},
number = {1},
urldate = {2021-03-19},
journal = {Journal of Cheminformatics},
author = {Schymanski, Emma L. and Kondić, Todor and Neumann, Steffen and Thiessen, Paul A. and Zhang, Jian and Bolton, Evan E.},
month = dec,
year = {2021},
pages = {19},
note = {[cito:citesAsDataSource]},
addendum = {[cito:citesAsDataSource]}}
title = {{PubChem} in 2021: new data content and improved web interfaces},
volume = {49},
issn = {0305-1048, 1362-4962},
shorttitle = {{PubChem} in 2021},
url = {},
doi = {10.1093/nar/gkaa971},
abstract = {Abstract
PubChem ( is a popular chemical information resource that serves the scientific community as well as the general public, with millions of unique users per month. In the past two years, PubChem made substantial improvements. Data from more than 100 new data sources were added to PubChem, including chemical-literature links from Thieme Chemistry, chemical and physical property links from SpringerMaterials, and patent links from the World Intellectual Properties Organization (WIPO). PubChem's homepage and individual record pages were updated to help users find desired information faster. This update involved a data model change for the data objects used by these pages as well as by programmatic users. Several new services were introduced, including the PubChem Periodic Table and Element pages, Pathway pages, and Knowledge panels. Additionally, in response to the coronavirus disease 2019 (COVID-19) outbreak, PubChem created a special data collection that contains PubChem data related to COVID-19 and the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).},
language = {en},
number = {D1},
urldate = {2021-01-05},
journal = {Nucleic Acids Research},
author = {Kim, Sunghwan and Chen, Jie and Cheng, Tiejun and Gindulyte, Asta and He, Jia and He, Siqian and Li, Qingliang and Shoemaker, Benjamin A and Thiessen, Paul A and Yu, Bo and Zaslavsky, Leonid and Zhang, Jian and Bolton, Evan E},
month = jan,
year = {2021},
pages = {D1388--D1395}
title = {{FAIR} chemical structures in the {Journal} of {Cheminformatics}},
volume = {13},
copyright = {All rights reserved},
issn = {1758-2946},
url = {},
doi = {10.1186/s13321-021-00520-4},
language = {en},
number = {1},
urldate = {2021-07-15},
journal = {Journal of Cheminformatics},
author = {Schymanski, Emma L. and Bolton, Evan E.},
month = dec,
year = {2021},
pages = {50}
title = {Discovering and {Summarizing} {Relationships} {Between} {Chemicals}, {Genes}, {Proteins}, and {Diseases} in {PubChem}},
volume = {6},
issn = {2504-0537},
url = {},
doi = {10.3389/frma.2021.689059},
abstract = {The literature knowledge panels developed and implemented in PubChem are described. These help to uncover and summarize important relationships between chemicals, genes, proteins, and diseases by analyzing co-occurrences of terms in biomedical literature abstracts. Named entities in PubMed records are matched with chemical names in PubChem, disease names in Medical Subject Headings (MeSH), and gene/protein names in popular gene/protein information resources, and the most closely related entities are identified using statistical analysis and relevance-based sampling. Knowledge panels for the co-occurrence of chemical, disease, and gene/protein entities are included in PubChem Compound, Protein, and Gene pages, summarizing these in a compact form. Statistical methods for removing redundancy and estimating relevance scores are discussed, along with benefits and pitfalls of relying on automated (i.e., not human-curated) methods operating on data from multiple heterogeneous sources.},
urldate = {2021-07-19},
journal = {Frontiers in Research Metrics and Analytics},
author = {Zaslavsky, Leonid and Cheng, Tiejun and Gindulyte, Asta and He, Siqian and Kim, Sunghwan and Li, Qingliang and Thiessen, Paul and Yu, Bo and Bolton, Evan E.},
month = jul,
year = {2021},
pages = {689059}
address = {Paris},
title = {Reconciling {Terminology} of the {Universe} of {Per}- and {Polyfluoroalkyl} {Substances}: {Recommendations} and {Practical} {Guidance}},
url = {},
number = {No. 61},
urldate = {2021-11-14},
institution = {OECD Publishing},
author = {{OECD}},
year = {2021},
pages = {45}
title = {{CDK} {Depict} {Web} {Interface}},
url = {},
urldate = {2022-03-24},
year = {2022},
author = {Mayfield, John}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment