Commit 80032215 authored by Todor Kondic's avatar Todor Kondic Committed by Emma Schymanski
Browse files

pubchemlite: Redesign input to pubchemlite

The input and build scripts have so far been tightly coupled. For
example, the filter list input had to have a specific order and each
time it was necessary to add another category of pubchem to pubchemlite,
it was needed to create a matching condition in rest_grab_props.pl.

This unnecessarily inflexible approach was replaced by two different
inputs -- a mapping file which creates a relationship between the
pubchem table of contents and the metfrag columns and a manifest file in
a human-readable YAML format which allows user to direct build in a
convenient fashion.

Logging and error interception was further improved.

The most important changes are reflected in the top-level driver
script pb-driver-lite.sh and rest_grab_props.pl . The new file
read_manifest.pl is generating the filter bits and the legend file used
by rest_grab_props.pl to set up MetFrag column headers.
parent 8c9b867b
......@@ -2,6 +2,7 @@
A project for interactions with PubChem
* [PubChemLite Build Process Documentation](pubchemlite/README.org)
Content in this repository is shared under these license conditions:
* Code: Artistic-2.0
......
#+TITLE: PubChemLite Generation Scripts
* Overview
The entire process of building is controlled by [[file:pb-lite.sh][pb-lite.sh]] driver
script. This is just an example script and should be adapted to the
platform where the thing is run on. We run it perodically in a cron
job.
The driver script has several inputs (these are controlled through
the variables inside the script):
- PB_REPO :: The address of this git repository.
- INPUTDIR :: Path to the directory containing the input
/filter_listN.lst/ files.
- OUTDIR :: Path to the directory which will contain the dump of the
build for later inspection.
- MF_DIR :: Path to the directory which contains the resulting
MetFrag compatible CSVs.
- MF_DIR_01 :: A backup of the previous.
- TOPDIR :: The top-level build directory.
The files ~filter_listN.lst~, where N stands for the
index of the tier being built contain the indices to
PubChemLite.
The process begins by cloning and updating of this repository to a
folder in ~TOPDIR~. Then, it continues in several stages:
1. If there was anything in the build folder, move it to a backup
location in the ~OUTDIR~.
2. The script [[file:fetch_sources]] downloads and verifies the files
using MD5 hashes. Logged in ~tiers~ under ~TOPDIR~.
3. The script [[file:fetch_scripts]] downloads the Perl builder scripts
from PubChem. Logged in ~tiers~ under ~TOPDIR~.
4. With the sources present, the [[file:sanity_prebuild]] checks for
The entire process of building is controlled by [[file:pb-lite-driver.sh][pb-lite-driver.sh]]
script. We run it perodically in a cron job.
The driver script has a single input argument, ~INPUTDIR~ which
contains two input files,
- manifest file :: a YAML file (named ~manifest.yaml~) configuring the
parameters of the build process;
- bit/category/metfrag map file :: A CSV (with a /.map/ extension)
with three columns - *BIT*, *CATEGORY* and *METFRAGCOL*; each row
represents a mapping between an index bit, the correspoding PubChem
category and a MetFrag table column; the user can define which parts
of the entire PubChem [[https://ftp.ncbi.nlm.nih.gov/pubchem/.toc_fp/index-current.tab][Table of Contents]] will end up in the
PubChemLite database; for the moment, the order of columns matters:
first *BIT*, then *CATEGORY* and finally *METFRAGCOL*;
** Manifest file in more detail
#+BEGIN_SRC yaml
# The file containing index bits of pubchem, the pubchem category
# names and the corresponding metfrag column names.
map: PubChemLite_exposomics.map
# The top-level build directory.
topdir: tiers
# Where to store the entire build (for backup and forensics).
outdir: backup
# Directory, or directories where to store output MetFrag files.
mf_dirs:
- mf_dir1
- mf_dir2
#+END_SRC
- map :: the name of the manifest file;
- topdir :: the path to the top-level build directory;
- outdir :: the path to the backup directory; if the path is not
absolute it is relative to the ~topdir~;
- mf_dirs :: the list of paths where to deposit the resulting MetFrag
CSVs; if a given path is not absolute, it will be relative to the
~topdir~;
** Example of the map file
#+BEGIN_SRC csv
BIT,CATEGORY,METFRAGCOL
192,Agrochemical Information,AgroChemInfo
426,Biomolecular Interactions and Pathways,BioPathway
82,Drug and Medication Information,DrugMedicInfo
204,Food Additives and Ingredients,FoodRelated
344,Pharmacology and Biochemistry,PharmacoInfo
356,Safety and Hazards,SafetyInfo
396,Toxicity,ToxicityInfo
350,Use and Manufacturing,KnownUse
137,Associated Disorders and Diseases,DisorderDisease
171,Identification,Identification
#+END_SRC
* The Build Process
The build process follows roughly this sequence:
1. if there was anything in the build folder, move it to a backup
location in the ~outdir~;
2. the script [[file:fetch_sources]] downloads and verifies the files
using MD5 hashes;
3. the script [[file:adapt_scripts]] adapts the Perl builder scripts
from PubChem to the platform they need to be executed on;
4. with the sources present, the [[file:sanity_prebuild]] checks for
possible strange changes in sizes of files if it has anything from
a previous build to compare with. Logged in ~tiers~ under ~TOPDIR~.
5. Generate the MetFrag CSVs from sources using the PubChem PERL
scripts, for each tier.
6. Transfer the build to its backup location in ~OUTDIR~.
7. Make dated copies of the current MetFrag CSVs (if any) and store
them in ~archive~ subdirectories of ~MF_DIR~ and ~MF_DIR_01~.
8. Transfer the resulting CSVs to ~MF_DIR~ and ~MF_DIR_01~.
* Example filter_tierN.lsts
These files should go into the ~INPUTDIR~.
- ~filter_tier0.lst~
#+BEGIN_SRC
280 Agrochemical Information
120 Drug and Medication Information
298 Food Additives and Ingredients
460 Pharmacology and Biochemistry
477 Safety and Hazards
532 Toxicity
468 Use and Manufacturing
#+END_SRC
- ~filter_tier1.lst~
#+BEGIN_SRC
280 Agrochemical Information
576 Biomolecular Interactions and Pathways
120 Drug and Medication Information
298 Food Additives and Ingredients
460 Pharmacology and Biochemistry
477 Safety and Hazards
532 Toxicity
468 Use and Manufacturing
#+END_SRC
a previous build to compare with;
5. generate the MetFrag CSVs from sources using the PubChem Perl
scripts;
6. transfer the build to its backup location in ~outdir~;
7. transfer MetFrag CSVs (if any) to ~mf_dirs~ (from the /manifest/)
The build log is located in ~outdir~, unless the build was
interrupted with an error in which case it will be found in a
directory from which the ~pb-lite-driver~ script was started in.
......
......@@ -7,9 +7,15 @@ PDIR=$(dirname "$BASH_SOURCE")
say "(adapt_scripts):" Adapting paths in Perl scripts.
for fn in ${SCRIPTDIR}/*.pl;do
bfn=$(basename "$fn")
sed -e "s,/usr/bin/cat,$(which cat),g" -e "s,/usr/bin/gunzip,$(which gunzip),g" "$fn" > "${WORKDIR}/${bfn}"
sed -e "s,/usr/bin/cat,$(which cat),g" \
-e "s,/usr/bin/gunzip,$(which gunzip),g" \
"$fn" > "${WORKDIR}/${bfn}"
[ "$?" -ne 0 ] && fatal "(adapt_scripts): Error."
chmod u+x "${WORKDIR}/${bfn}"
say "(adapt_scripts): Adapted ${bfn}"
done
exit 0
......@@ -4,17 +4,15 @@
PDIR=$(dirname "$BASH_SOURCE")
. "$PDIR/library.sh"
# If other destination directory given.
ADDIR=$1
if [[ "x${ADDIR}" != "x" ]]; then
DDIR=$ADDIR
fi
say "(fetch_sources)" "DDIR is $DDIR"
fetch "$PDIR/sources.txt" "$DDIR"
say -n "(fetch_sources):"
say "HUUUH????"
say "(fetch_sources):"
cp -v "$DDIR/index-current.tab" "$DDIR/index.tsv" 1>&2
[ ! "$?" ] && fatal "(fetch_sources): Error. Unable to copy index-current.tab to index.tsv. Abort."
say -n "(fetch_sources):"
[ "$?" -ne 0 ] && fatal "(fetch_sources): Error. Unable to copy index-current.tab to index.tsv. Abort."
say "(fetch_sources):"
cp -v "$DDIR/cid-bits-current.tab.gz" "$DDIR/cid-bits.tab.gz" 1>&2
[ ! "$?" ] && fatal "(fetch_sources): Error. Unable to copy cid-bits-current.tab.gz to cid-bits.tab.gz. Abort."
[ "$?" -ne 0 ] && fatal "(fetch_sources): Error. Unable to copy cid-bits-current.tab.gz to cid-bits.tab.gz. Abort."
exit 0
......@@ -6,10 +6,6 @@ use warnings;
my $file = $ARGV[0];
# Commented this out: no defaults.
# if ( ! defined( $file ) || $file eq "" ) {
# $file = "filter_tier0.lst";
# }
print STDERR ":: Using file: \"$file\" ::\n";
my $use_counts = $ARGV[1];
......
......@@ -5,12 +5,7 @@ PDIR=$(dirname "$BASH_SOURCE")
DDIR=$PWD
set -o pipefail # this will capture a problem that happens while piping
filterfile=$1
legendfile=$2
# say '>>>' TIER 0 : BEGIN '>>>'
# gen_tier cid-bits.tab.gz filter_tier0.lst 0 # tier 0
# say '<<<' TIER 0 : END '<<<'
say '>>>' TIER 1 : BEGIN '>>>'
gen_tier cid-bits.tab.gz filter_tier1.lst 1 # tier 1
say '<<<' TIER 1 : END '<<<'
gen_tier cid-bits.tab.gz "$filterfile" "$legendfile"
......@@ -3,10 +3,16 @@ PDIR=$(dirname "$BASH_SOURCE")
function say {
echo $@ >&2
echo -e "LOG LADY>" $@ >&2
}
function header {
say "[* START *]" $@
}
function footer {
say "[* END *]" $@
}
function fetch {
......@@ -16,13 +22,8 @@ function fetch {
DEST_DIR=$2
if [ "${status}" -ne 0 ]; then
say -e "Error. The file containing the list of files to be
downloaded does not exist. "
exit 1
fi
[ "${status}" -ne 0 ] && fatal "Error. The file containing the list of files to be
downloaded does not exist. "
cd "${DEST_DIR}"
say Downloading source files to "${DEST_DIR}"
......@@ -31,9 +32,9 @@ function fetch {
bfn=$(basename "$fn")
mdfn="${bfn}.md5"
rm -f "${mdfn}"
say -e "(fetch): Downloading ${fn}.md5"
say "(fetch): Downloading ${fn}.md5"
! (curl -s -f -R -O "${fn}.md5") &&
say -e "(fetch): Warning: MD5 hash for file $fn not found."
say "(fetch): Warning: MD5 hash for file $fn not found."
newhash=""
if [ -e "$mdfn" ]; then
......@@ -44,26 +45,24 @@ function fetch {
oldhash=$(md5sum "${bfn}"|awk '{print $1}')
[[ "$newhash" == "$oldhash" ]] && \
say -e "(fetch): Skipping unchanged file $fn ." && \
say "(fetch): Skipping unchanged file $fn ." && \
continue
else
say -e "(fetch): Downloading $fn"
say "(fetch): Downloading $fn"
! (curl -s -f -R -O "$fn") && \
say -e "(fetch): Error: File could not be downloaded from:" $fn ".Aborting" && \
exit -1
fatal "(fetch): Error: File could not be downloaded from:" $fn ".Aborting"
dwnhash=$(md5sum "${bfn}"|awk '{print $1}')
[[ "$newhash" != "$dwnhash" ]] && \
say -e "(fetch): Error: Download of $fn corrupted. Abort." && \
exit -2
fatal "(fetch): Error: Download of $fn corrupted. Abort."
fi
done < "$FN_SRC"
exit 0
# xargs -n1 curl -R -O < $FN_SRC
}
function fetch_wild {
url=$1
patt=$2
......@@ -75,53 +74,109 @@ function fetch_wild {
function gen_tier_a {
cidbits=$1
filtertier=$2
tierno=$3
echo bits : "$cidbits" 1>&2
echo filter : "$filtertier" 1>&2
say 'o' stage filter_toc_info: START '(' $(date) ')'
( ${DCMPR} < "$cidbits" | "${SCRIPTS[filter_toc_info]}" "$filtertier" 1 | ${CMPR} > "./tier${tierno}_bits.tsv.gz" )
filterfile=$2
say CID bit file : "$cidbits"
say PubChemLite filter file : "$filterfile"
say "Script: ${SCRIPTS[filter_toc_info]}"
header BUILD filter_toc_info '(' $(date) ')'
( ${DCMPR} < "$cidbits" | "${SCRIPTS[filter_toc_info]}" "$filterfile" 1 | ${CMPR} > "./generated_bits.tsv.gz" )
status=$?
say 'o' stage filter_toc_info: END '(' $(date) ')'
[ ! "$status" ] && fatal "stage: filter_toc_info failed"
footer BUILD filter_toc_info '(' $(date) ')'
[ "$status" -ne 0 ] && fatal "BUILD: filter_toc_info failed"
return $status
}
function gen_tier_b {
tierno=$1
say 'o' stage pull_cid_content: START '(' $(date) ')'
( "${SCRIPTS[pull_cid_content]}" "./tier${tierno}_bits.tsv.gz" > "./tier${tierno}_data.out")
header BUILD pull_cid_content '(' $(date) ')'
( "${SCRIPTS[pull_cid_content]}" "./generated_bits.tsv.gz" > "./generated_data.out")
status=$?
say 'o' stage pull_cid_content: END '(' $(date) ')'
[ ! "$status" ] && fatal "stage: pull_cid_content failed"
footer BUILD pull_cid_content '(' $(date) ')'
[ "$status" -ne 0 ] && fatal "BUILD: pull_cid_content failed"
return $status
}
function gen_tier_c {
tierno=$1
say 'o' stage rest_grab_props: START '(' $(date) ')'
( "${SCRIPTS[rest_grab_props]}" < "./tier${tierno}_data.out" | ${CMPR} > "./tier${tierno}_data_complete.out.gz" )
header BUILD rest_grab_props '(' $(date) ')'
legendfile=$1
( "${SCRIPTS[rest_grab_props]}" "$legendfile" < "./generated_data.out" | ${CMPR} > "./generated_data_complete.out.gz" )
status=$?
say 'o' stage rest_grab_props: END '(' $(date) ')'
[ ! "$status" ] && fatal "stage: rest_grab_props failed"
footer BUILD rest_grab_props '(' $(date) ')'
[ "$status" -ne 0 ] && fatal "BUILD: rest_grab_props failed"
return $status
}
function gen_tier_d {
tierno=$1
say 'o' stage remove_unwanted_cases: START '(' $(date) ')'
( ${DCMPR} < "tier${tierno}_data_complete.out.gz" | "${SCRIPTS[remove_unwanted_cases]}" "tier${tierno}_" > "PubChemLite_tier${tierno}.csv" )
say 'o' stage remove_unwanted_cases: END '(' $(date) ')'
[ ! "$status" ] && fatal "stage: rest_unwanted_cases failed"
header BUILD remove_unwanted_cases '(' $(date) ')'
( ${DCMPR} < "generated_data_complete.out.gz" | "${SCRIPTS[remove_unwanted_cases]}" "generated_" > "$OUTMFFILE" )
status=$?
footer BUILD remove_unwanted_cases '(' $(date) ')'
[ "$status" -ne 0 ] && fatal "BUILD: rest_unwanted_cases failed"
return $status
}
function gen_tier {
set -o pipefail # this will capture a problem that
# happens while piping
cidbits=$1
filterfile=$2
legendfile=$3
gen_tier_a "$cidbits" "$filterfile" && \
gen_tier_b && \
gen_tier_c "$legendfile" && \
gen_tier_d
}
function gen_tier_test {
set -o pipefail # this will capture a problem that
# happens while piping
cidbits=$1
filtertier=$2
tierno=$3
gen_tier_a "$cidbits" "$filtertier" "$tierno" && gen_tier_b "$tierno" && gen_tier_c "$tierno" && gen_tier_d "$tierno"
filterfile=$2
legendfile=$3
# gen_tier_c "$legendfile" && \
# gen_tier_d
}
function stamp {
header STAMP
say "Current directory is $PWD"
say "Input files are located in $INPUTDIR"
say "The top-level directory used by PCL is $TOPDIR"
say "The build directory is $WORKDIR"
say "The filter file used is $FILTERFILE"
say "The MetFrag files are going to be written to the following dirs (MF_DIRS): ${MF_DIRS[@]}"
say "The output MetFrag file will be $OUTMFFILE in the MF_DIRS"
say "The output MetFrag file will also be saved as $OUTMFFILE the MF_DIRS"
say "The scripts have been located in $SCRIPTDIR"
say "The full build result is going to be stored in $OUTDIR"
say "Legend file is $LEGENDFILE"
footer STAMP
}
function gen_filtfile {
map=$1
loc=$2
fname="$loc/$3"
say "(gen_filtfile) Generating filter file: $fname"
awk 'BEGIN {FS=","}
NR==1 {next}
NR>1 {print($1,$2)}' \
"$map" > "$fname"
}
function gen_legend {
map=$1
loc=$2
fname="$loc/$3"
say "(gen_legend) Generating legend file: $fname"
awk 'BEGIN {FS=",";OFS=","}
NR==1 {next}
NR>1 {print($2,$3)}' \
"$map" > "$fname"
}
Agrochemical Information, AgroChemInfo
Biomolecular Interactions and Pathways, BioPathway
Drug and Medication Information, DrugMedicInfo
Food Additives and Ingredients , FoodRelated
Pharmacology and Biochemistry,PharmacoInfo
Safety and Hazards,SafetyInfo
Toxicity,ToxicityInfo
Use and Manufacturing,KnownUse
Associated Disorders and Diseases,
Identification,Identification
use strict;
use warnings;
my $mapfile = $ARGV[0];
print STDERR ":: Using mappings file: \"$mapfile\" ::\n";
# The list of mappings.
my %map = ();
my @hdrl = ();
open(FMAP,$mapfile);
while ($_ = <FMAP>) {
chop;
my ($cty,$mfcol) = split(/\s*,\s*/,$_,2);
push @hdrl,$mfcol;
}
my $hdr = join("",join("\t",@hdrl),"\n");
close(FMAP);
# Define the following environment variables:
# INPUTDIR : Where are the input files located
# OUTDIR : Where to store the build after its completion
# MF_DIRS: Where to copy the MetFrag results
# TOPDIR: the "build" directory
# TIER_IDC: indices of tiers
# LOGFILE: Optional. If non-empty, write stderr to it.
# INITIALISATION
# Logging.
LOGFILE=$PWD/build.log
[ -n "$LOGFILE" ] && exec 2> "$LOGFILE" && echo "********** PubChem Lite LOG START $(date) **********" 1>&2
# Kill top-level process using fatal from a subshell.
trap "exit 1" SIGUSR1
TOPPID="$$"
fatal(){
echo "(FATAL!)" "$@" >&2
kill SIGUSR1 "$TOPPID"
kill -s SIGUSR1 "$TOPPID"
}
[ -n "${LOGFILE}" ] && exec 2> "$LOGFILE" && echo "********** PubChem Lite LOG START $(date) **********" 1>&2
# The subdir of the local clone of the pubchem repo where the scripts
# are.
# Scripts location.
SCRIPTDIR=$(dirname $(readlink -f "$BASH_SOURCE"))
source "${SCRIPTDIR}/library.sh"
[ ! -d "${INPUTDIR}" ] && fatal "(pb-lite-driver): Error. INPUTDIR does not exist: ${INPUTDIR}"
[ -z ${OUTDIR+x} ] && fatal "(pb-lite-driver): Error. OUTDIR not defined."
[ -z ${MF_DIRS+x} ] && fatal "(pb-lite-driver): Error. MF_DIRS not defined."
for mfd in "${MF_DIRS[@]}";do
[ ! -d "${mfd}" ] && \
fatal "(pb-lite-driver): Error. MetFrag CSV dir ${mfd} does not exist."
done
[ ! -d "${TOPDIR}" ] && fatal "(pb-lite-driver): Error. TOPDIR does not exist: ${TOPDIR}"
[ -z "${TIER_IDC+x}" ] && fatal "(pb-lite-driver): Error. TIER_IDC does not exist: ${TIER_IDC[@]}."
# INPUTS
# Path to input directory.
INPUTDIR="$1"
# Manifest file.
MANIF="$INPUTDIR/manifest.yaml"
# INITIALISE ENVIRONMENT FROM MANIFEST
# Read in and declare variables from the manifest.
while read -r type key val; do
[[ "$type" == "s" ]] && declare -x $key="$val" || \
[[ "$type" == "a" ]] && declare -x -a $key="$val"
done < <(perl "${SCRIPTDIR}/read_manifest.pl" "$MANIF")
# The map file (bits, categories, metfrag columns).
MAPFILE=$(readlink -f "$INPUTDIR")/$MAPFILE
# Filter filename.
FILTERFILE=$(basename "${MAPFILE%.map}.filter")
# Legend filename.
LEGENDFILE=$(basename "${MAPFILE%.map}.legend")
# Local build dir.
export WORKDIR=$TOPDIR/tiers
# MetFrag filename.
OUTMFFILE=$(basename "${MAPFILE%.map}_$(date +%Y%m%d).csv")
OUTMFCLONE=$(basename "${MAPFILE%.map}.csv")
# Does manifest exist?
[ ! -e "$MANIF" ] && fatal "(pb-lite-driver.sh): Manifest file $MANIF not found, or unreadable."
# What about the filter file?
[ ! -e "$MAPFILE" ] && fatal "(pb-lite-driver): Error, the map file $MAPFILE does not exist."
# Top level project directory.
TOPDIR=$(readlink -f "$TOPDIR")
mkdir -p "$TOPDIR"
[ ! -d "$TOPDIR" ] && fatal "(pb-lite-driver): Error. TOPDIR does not exist: $TOPDIR"
# Where to build.
WORKDIR="$TOPDIR/build"
mkdir -p "$WORKDIR"
bwd=$(basename "$WORKDIR")
OUTDIR="$OUTDIR/$bwd"
source "${SCRIPTDIR}/library.sh"
# GENERATE BIT AND LEGEND INPUTS
gen_filtfile "$MAPFILE" "$WORKDIR" "$FILTERFILE"
gen_legend "$MAPFILE" "$WORKDIR" "$LEGENDFILE"
FILTERFILE="$WORKDIR"/$(basename "$FILTERFILE")
LEGENDFILE="$WORKDIR"/$(basename "$LEGENDFILE")
# PREPARATION
cd "$TOPDIR"
# Write basic build info to the log file.
stamp
# Clean up.
rm -vf "$WORKDIR"/*.{log,out,rpt,csv}
......@@ -46,62 +92,44 @@ rm -vf "$WORKDIR"/*.{log,out,rpt,csv}
[ -d "$OUTDIR" ] && rm -Rf "${OUTDIR}.bak"
[ -d "$OUTDIR" ] && mv "${OUTDIR}" "${OUTDIR}.bak"
# Download sources for the build.
(cd "$WORKDIR"; source "${SCRIPTDIR}/fetch_sources")
! [ $? ] && fatal "(pb-lite-driver.sh): Errors during fetch_sources."
# Adapt paths.
(cd "$WORKDIR"; source "${SCRIPTDIR}/adapt_scripts")
! [ $? ] && fatal "(pb-lite.sh): Errors during adapt_scripts."
# Is there anything obviously crazy with the sources?
(cd "$WORKDIR"; source "${SCRIPTDIR}/sanity_prebuild" "${OUTDIR}")
! [ $? ] && fatal "(pb-lite-driver.sh): Errors during sanity_prebuild."
# Build tiers.
( cd "$WORKDIR"
for num in ${TIER_IDC[@]};do
fn="$INPUTDIR/filter_tier${num}.lst"
[ ! -e "$fn" ] && fatal "Error! Input filter file $fn not found. Aborting."
cp "$fn" .
done