diff --git a/.gitignore b/.gitignore index cd6313679809314eec7e9efddf0382ad0c50fee9..d510d7780bd726ded74c347290dbdd4a5426273b 100644 --- a/.gitignore +++ b/.gitignore @@ -11,3 +11,4 @@ src/utils/version/version_git.h .project .cproject nbproject +sandbox diff --git a/docs/content/images/fisher/res-genes.png b/docs/content/images/fisher/res-genes.png new file mode 100644 index 0000000000000000000000000000000000000000..db3ba07893ccfc3c0dfb4092a63851373237836e Binary files /dev/null and b/docs/content/images/fisher/res-genes.png differ diff --git a/docs/content/tools/fisher.rst b/docs/content/tools/fisher.rst index aa771c07821cad0144acaa06628d4a7211d1843f..bd2f57f7b21a233153e9dab691e4d9f42b8c2373 100644 --- a/docs/content/tools/fisher.rst +++ b/docs/content/tools/fisher.rst @@ -4,7 +4,8 @@ *fisher* ######## -Perform fisher's exact test on the non/overlap between 2 files. +Perform fisher's exact test on the number of overlaps/unique intervals between +2 files. | @@ -14,7 +15,9 @@ spatially, we resort to shuffling the genome and checking the simulated scenarios using `Fisher's Exact Test`_ . - +This implementation can calculate the number of overlaps and the number +of intervals unique to each file and it infers (or accepts) the number +that are not present in each file. Given a pair of input files `-a` and `-b` in the usual BedTools parlance: @@ -23,9 +26,11 @@ Given a pair of input files `-a` and `-b` in the usual BedTools parlance: $ cat a.bed chr1 10 20 chr1 30 40 + chr1 51 52 $ cat b.bed chr1 15 25 + chr1 51 52 And a genome of 500 bases: @@ -40,44 +45,95 @@ can do this with ``fisher`` as: .. code-block:: bash $ bedtools fisher -a a.bed -b b.bed -g t.genome - # Contingency Table + # Number of query intervals: 3 + # Number of db intervals: 2 + # Number of overlaps: 2 + # Number of possible intervals (estimated): 37 + # phyper(2 - 1, 3, 37 - 3, 2, lower.tail=F) + # Contingency Table Of Counts #_________________________________________ - # | not in -b | in -b | - # not in -a | 475 | 5 | - # in -a | 15 | 5 | + # | in -b | not in -b | + # in -a | 2 | 1 | + # not in -a | 0 | 34 | #_________________________________________ # p-values for fisher's exact test left right two-tail ratio - 1 1.3466e-05 1.3466e-05 31.667 + 1 0.0045045 0.0045045 inf Where we can see the constructed contingency table and the pvalues for left, right and two-tail tests. -From here, we can say that given **500 bases** of genome, it is unlikely that a region of -20 bases total from `-a` and 10 bases total from `-b` would share 5 bases if the regions -were randomly distributed. *Consult your statistician for a more precise definition*. +From here, we can say that given **500 bases** of genome, it is unlikely that we +would see **as many** overlaps as we do if the intervals from `a` and `b` were not +related. + +.. note:: -The above was highly significant, but if our genome were only **50 bases**: + the total number of **possible** intervals in the above example was + estimated to be 37. This is based on a heuristic that uses the mean sizes of + intervals in the `a` and `b` sets and the size of the genome. The reported + p-value will depend greatly on this. Below, we show how well the reported + value matches with simulations. + +The above had a fairly low p-value (0.0045), but if our genome were only **60 bases**: .. code-block:: bash - $ echo -e "chr1\t50" > t.genome + $ echo -e "chr1\t60" > t.genome $ bedtools fisher -a a.bed -b b.bed -g t.genome - # Contingency Table + # Number of query intervals: 3 + # Number of db intervals: 2 + # Number of overlaps: 2 + # Number of possible intervals (estimated): 4 + # phyper(2 - 1, 3, 4 - 3, 2, lower.tail=F) + # Contingency Table Of Counts #_________________________________________ - # | not in -b | in -b | - # not in -a | 25 | 15 | - # in -a | 5 | 5 | + # | in -b | not in -b | + # in -a | 2 | 1 | + # not in -a | 0 | 1 | #_________________________________________ # p-values for fisher's exact test left right two-tail ratio - 0.86011 0.35497 0.49401 1.667 + 1 0.5 1 inf We can see that neither tail is significant. Intuitively, this makes sense; -if we randomly place 20 (from `-a`), and 10 (from `-b`) bases of intervals -within 50 bases, it doesn't seem unlikely that we'd see 5 bases of overlap. +if we randomly place 3 intervals (from `-a`), and 2 (from `-b`) intervals from +within 60 bases, it doesn't seem unlikely that we'd see 2 overlaps. + +Note also that since the genome size is much smaller, the number of possible +intervals is also decreased. + +========== +Evaluation +========== + +The p-value depends on knowing or inferring the total number of possible +intervals (to fill in the lower right corner of the contingency table). This +inference is not straightforward since we will most likely have variable sized +intervals within and between files. Below, we show the correspondence of +the p-value reported by `fisher` and one from simulated data. + +.. figure:: ../images/fisher/res-genes.png + :alt: plot of p-values of fisher vs. simulated + :figclass: align-center + + The comparison of the p-value from 'fisher' to one derived by simulation + (see tests/fisher/ for details). The top plot shows the p-value distribution. + Since we are most interested in extreme p-values, the middle plot shows + -log10(p-value). The bottom plot is the same as the middle except looking at + the other tail of the p-value distribution. + + Note that we do see inflation from the fisher test, but we do not see + 'false-negatives'--that is, bedtools fisher is less likely to miss 'true' + candidates, but it will give many candidates for further exploration. As + such we recommend validating low p-values from fisher using simulation. + + This evaluation used all known canonical genes on chromosome 1 and repeatedly + (1000 times) randomly generated 3000 intervals between 20 and 5250 bases. It + then calculated the p-value for each set using fisher and then using shuffled + data. .. note:: diff --git a/src/utils/Contexts/ContextFisher.cpp b/src/utils/Contexts/ContextFisher.cpp index 90f07282602dc1903765707cab819c025610866a..2e9387dddcab9d436e70741afa11d5b8f6e91a14 100644 --- a/src/utils/Contexts/ContextFisher.cpp +++ b/src/utils/Contexts/ContextFisher.cpp @@ -7,8 +7,7 @@ ContextFisher::ContextFisher() { setSortedInput(true); - setUseMergedIntervals(true); - + setUseMergedIntervals(false); } ContextFisher::~ContextFisher() { @@ -38,9 +37,16 @@ bool ContextFisher::parseCmdArgs(int argc, char **argv, int skipFirstArgs) else if (strcmp(_argv[_i], "-S") == 0) { if (!handle_S()) return false; } + else if (strcmp(_argv[_i], "-exclude") == 0) { + if (!handle_exclude()) return false; + } if (strcmp(_argv[_i], "-g") == 0) { if (!handle_g()) return false; } + if(strcmp(_argv[_i], "-m") == 0) { + markUsed(_i - _skipFirstArgs); + setUseMergedIntervals(true); + } } return ContextIntersect::parseCmdArgs(argc, argv, _skipFirstArgs); } @@ -112,4 +118,19 @@ bool ContextFisher::handle_S() { return false; } +bool ContextFisher::handle_exclude() +{ + if (_argc <= _i+1) { + _errorMsg = "\n***** ERROR: -exclude option given, but no file specified. *****"; + return false; + } + + do { + markUsed(_i - _skipFirstArgs); + _i++; + markUsed(_i - _skipFirstArgs); + } while (_argc > _i+1 && _argv[_i+1][0] != '-'); + setExcludeFile(string(_argv[_i])); + return true; +} diff --git a/src/utils/Contexts/ContextFisher.h b/src/utils/Contexts/ContextFisher.h index 49e7dcb2e49a0384d348d74cd0c739e9f6a7fa00..fd1d8ada4b414303a9ab225f72dbf2f81fd88273 100644 --- a/src/utils/Contexts/ContextFisher.h +++ b/src/utils/Contexts/ContextFisher.h @@ -17,10 +17,17 @@ public: ~ContextFisher(); virtual bool parseCmdArgs(int argc, char **argv, int skipFirstArgs); virtual bool isValidState(); + string getExcludeFile() { return _excludeFile; } + void setExcludeFile(string excludeFile) { _excludeFile = excludeFile; } + private: bool handle_s(); bool handle_S(); + bool handle_exclude(); + string _excludeFile; + + }; #endif /* CONTEXTFISHER_H_ */ diff --git a/src/utils/bedFile/bedFile.cpp b/src/utils/bedFile/bedFile.cpp index 7c1ae8dc1184fea620542ba4046cf1b2521fcb09..ed8d229d69accabb8fb9c3240ba71a97c1c736dc 100644 --- a/src/utils/bedFile/bedFile.cpp +++ b/src/utils/bedFile/bedFile.cpp @@ -657,12 +657,24 @@ void BedFile::setBed12 (bool isBed12) { this->isBed12 = isBed12; } void BedFile::loadBedFileIntoMap() { BED bedEntry; - + Open(); while (GetNextBed(bedEntry)) { if (_status == BED_VALID) { - BIN bin = getBin(bedEntry.start, bedEntry.end); - bedMap[bedEntry.chrom][bin].push_back(bedEntry); + addBEDIntoMap(bedEntry); + } + } + Close(); +} + +void BedFile::loadBedFileIntoMergedMap() { + + BED bedEntry; + + Open(); + while (GetNextMergedBed(bedEntry)) { + if (_status == BED_VALID) { + addBEDIntoMap(bedEntry); } } Close(); diff --git a/src/utils/bedFile/bedFile.h b/src/utils/bedFile/bedFile.h index 440986c0007fba7607d2106e58e7643f0955dccb..a738c21ecc686c7cc80accf773b526d41f448dfe 100644 --- a/src/utils/bedFile/bedFile.h +++ b/src/utils/bedFile/bedFile.h @@ -418,6 +418,7 @@ public: // load a BED file into a map keyed by chrom, then bin. value is // vector of BEDs void loadBedFileIntoMap(); + void loadBedFileIntoMergedMap(); // load a BED entry into and existing map void addBEDIntoMap(BED bedEntry); diff --git a/test/fisher/README.md b/test/fisher/README.md new file mode 100644 index 0000000000000000000000000000000000000000..d0179ca956b1b6b5ca21a7191a96da78d8e0500d --- /dev/null +++ b/test/fisher/README.md @@ -0,0 +1,20 @@ +Fisher Testing +============== + +Fisher is now based on the count of interval overlaps, subject to `-f`. +We can compare the output of fisher on simulated data by running `python sim.py` +which will show the output from `bedtools fisher` and then running `bash shuf.sh` +which will repeatedly run + +```Shell + +bedtools intersect -wo -a taa.bed -b <(bedtools shuffle -allowBeyondChromEnd -i tbb.bed -g tgg.genome) | wc -l +``` + +and then report the proportion of times that number is >= the observed intersection. + +In `sim.py` changing the lenght of the intervals (via `maxA`, `maxB`) has the greatest effect on the +correspondence of the simulated p-value from `shuf.sh` and the one from `fisher`. The right-tailed p-value +from `fisher` should correspond well to the value from the simulation. + + diff --git a/test/fisher/a.bed b/test/fisher/a.bed new file mode 100644 index 0000000000000000000000000000000000000000..4c4b6b32e884118e66ff19d3ac24c80668a98e29 --- /dev/null +++ b/test/fisher/a.bed @@ -0,0 +1,3 @@ +chr1 10 20 +chr1 30 40 +chr1 51 52 diff --git a/test/fisher/b.bed b/test/fisher/b.bed new file mode 100644 index 0000000000000000000000000000000000000000..995e020ca9d3782da31754f901eda32a003d744d --- /dev/null +++ b/test/fisher/b.bed @@ -0,0 +1,2 @@ +chr1 15 25 +chr1 51 52 diff --git a/test/fisher/cmp.sh b/test/fisher/cmp.sh new file mode 100644 index 0000000000000000000000000000000000000000..5683e42748199b04a6dd43eeb4ad0d35cd69cf76 --- /dev/null +++ b/test/fisher/cmp.sh @@ -0,0 +1,9 @@ +set -eo pipefail + +echo "fisher,shuffled" + +for i in $(seq 1000); do + fisher=$(python ./sim.py | tail -1 | cut -f 2) + shuffle=$(bash shuf.sh) + echo "$fisher,$shuffle" +done diff --git a/test/fisher/plot.py b/test/fisher/plot.py new file mode 100644 index 0000000000000000000000000000000000000000..6d87a0289f41b18c89a1fb7a899544124d72e5cb --- /dev/null +++ b/test/fisher/plot.py @@ -0,0 +1,46 @@ +import sys + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +import seaborn as sns + +sns.set(style="darkgrid") + +fig, axs = plt.subplots(3, figsize=(4, 12)) +df = pd.read_csv(sys.argv[1]) + +axs[0].scatter(df.fisher, df.shuffled, s=4) +axs[0].set_xlim(0, 1) +axs[0].set_ylim(0, 1) +axs[0].set_xlabel('fisher p-value') +axs[0].set_ylabel('shuffled p-value') +axs[0].plot([0, 1], [0, 1], ls='--') + +x = -np.log10(df.fisher) +y = -np.log10(df.shuffled) + +m = int(max(x.max(), y.max())) + 1 +axs[1].scatter(x, y, s=4) +axs[1].set_xlim(0, m) +axs[1].set_ylim(0, m) +axs[1].set_xlabel('-log10(fisher p-value)') +axs[1].set_ylabel('-log10(shuffled p-value)') +axs[1].plot([0, m], [0, m], ls='--') + + +x = -np.log10(1 - np.minimum(1-1e-6, df.fisher)) +y = -np.log10(1 - np.minimum(1-1e-6, df.shuffled)) + +m = int(max(x.max(), y.max())) + 1 +axs[2].scatter(x, y, s=4) +axs[2].set_xlim(0, m) +axs[2].set_ylim(0, m) +axs[2].set_xlabel('-log10(1 - fisher p-value)') +axs[2].set_ylabel('-log10(1 - shuffled p-value)') +axs[2].plot([0, m], [0, m], ls='--') + +plt.tight_layout() +plt.savefig(sys.argv[1].replace('.txt', '') + '.png') + +fig.show() diff --git a/test/fisher/shuf.sh b/test/fisher/shuf.sh new file mode 100644 index 0000000000000000000000000000000000000000..3c3bd102797a57d8a0c439b4e23abf4dedae3ffe --- /dev/null +++ b/test/fisher/shuf.sh @@ -0,0 +1,14 @@ +set -eo pipefail + +obs=$(bedtools intersect -wo -a taa.bed -b tbb.bed | wc -l) + +p=$(seq 100 | xargs -P 7 -n 1 bash -c "bedtools intersect -wo -a taa.bed -b <(bedtools shuffle -allowBeyondChromEnd -i tbb.bed -g tgg.genome) | wc -l " | awk -vobs=$obs '{s += ($1 > obs)}END{print (1 + s) / (1 + NR)}') + +if [ '1' -eq $(echo $p'< 0.1' | bc -l) ] || [ '1' -eq $(echo $p'> 0.9' | bc -l) ]; then + p=$(seq 1000 | xargs -P 7 -n 1 bash -c "bedtools intersect -wo -a taa.bed -b <(bedtools shuffle -allowBeyondChromEnd -i tbb.bed -g tgg.genome) | wc -l " | awk -vobs=$obs '{s += ($1 > obs)}END{print (1 + s) / (1 + NR)}') +fi +if [ '1' -eq $(echo $p'< 0.01' | bc -l) ] || [ '1' -eq $(echo $p'> 0.99' | bc -l) ]; then + p=$(seq 5000 | xargs -P 7 -n 1 bash -c "bedtools intersect -wo -a taa.bed -b <(bedtools shuffle -allowBeyondChromEnd -i tbb.bed -g tgg.genome) | wc -l " | awk -vobs=$obs '{s += ($1 > obs)}END{print (1 + s) / (1 + NR)}') +fi + +echo $p diff --git a/test/fisher/sim.py b/test/fisher/sim.py new file mode 100644 index 0000000000000000000000000000000000000000..6d26d3a74f5907ef173259185beb92870ce27795 --- /dev/null +++ b/test/fisher/sim.py @@ -0,0 +1,31 @@ +import sys +from random import randint +from subprocess import check_output + +nA = 3000 +minA, maxA = (20, 5250) +#minA, maxA = (200, 250) + +bIntervals = [(x[0], int(x[1]), int(x[2])) for x in (l.split("\t") for l in + open('hg19.knownCanonical.bed')) if x[0] == "chr1" ] +bIntervals.sort() +genome_size = max(b[2] for b in bIntervals) + 50000 + +with open('tbb.bed', 'w') as fh: + for chrom, start, end in bIntervals: + fh.write("\t".join((chrom, str(start), str(end))) + "\n") + +with open('taa.bed', 'w') as fh: + vals = [] + for i in range(nA): + s = randint(0, genome_size - maxA) + e = randint(s + minA, min(genome_size, s + maxA)) + vals.append((s, e)) + for s, e in sorted(vals): + fh.write("chr1\t%i\t%i\n" % (s, e)) + fh.flush() + +print >> open('tgg.genome', 'w'), ("chr1\t%i" % genome_size) + +# NOTE: add -m here to make merged output +print check_output("../../bin/bedtools fisher -a taa.bed -b tbb.bed -g tgg.genome", shell=True).strip() diff --git a/test/fisher/t.500.genome b/test/fisher/t.500.genome new file mode 100644 index 0000000000000000000000000000000000000000..1b42eb3673964b293ab0c76758459dbfe7e29903 --- /dev/null +++ b/test/fisher/t.500.genome @@ -0,0 +1 @@ +chr1 500 diff --git a/test/fisher/t.60.genome b/test/fisher/t.60.genome new file mode 100644 index 0000000000000000000000000000000000000000..94a6fa1235c8ff85491e72413e0f07a3bf8ed36a --- /dev/null +++ b/test/fisher/t.60.genome @@ -0,0 +1 @@ +chr1 60 diff --git a/test/fisher/test-fisher.sh b/test/fisher/test-fisher.sh new file mode 100644 index 0000000000000000000000000000000000000000..72bd10ebb1c1f4714be6b865f21532bb53ebff86 --- /dev/null +++ b/test/fisher/test-fisher.sh @@ -0,0 +1,53 @@ +BT=${BT-../../bin/bedtools} + +check() +{ + if diff $1 $2; then + echo ok + return 1 + else + echo fail + return 0 + fi +} + +echo " fisher.t1...\c" +echo \ +"# Number of query intervals: 3 +# Number of db intervals: 2 +# Number of overlaps: 2 +# Number of possible intervals (estimated): 34 +# phyper(2 - 1, 3, 34 - 3, 2, lower.tail=F) +# Contingency Table Of Counts +#_________________________________________ +# | in -b | not in -b | +# in -a | 2 | 1 | +# not in -a | 0 | 31 | +#_________________________________________ +# p-values for fisher's exact test +left right two-tail ratio +1 0.0053476 0.0053476 inf" > exp +$BT fisher -a a.bed -b b.bed -g t.500.genome > obs +check obs exp +rm obs exp + + +echo " fisher.t2...\c" +echo \ +"# Number of query intervals: 3 +# Number of db intervals: 2 +# Number of overlaps: 2 +# Number of possible intervals (estimated): 4 +# phyper(2 - 1, 3, 4 - 3, 2, lower.tail=F) +# Contingency Table Of Counts +#_________________________________________ +# | in -b | not in -b | +# in -a | 2 | 1 | +# not in -a | 0 | 1 | +#_________________________________________ +# p-values for fisher's exact test +left right two-tail ratio +1 0.5 1 inf" > exp +$BT fisher -a a.bed -b b.bed -g t.60.genome > obs +check obs exp +rm obs exp diff --git a/test/test.sh b/test/test.sh index 76ec3f1eaab175eaa543e7af68a388797676d164..c583df2034447faebb5afbda043821db5391cf73 100644 --- a/test/test.sh +++ b/test/test.sh @@ -22,6 +22,9 @@ cd expand; bash test-expand.sh; cd .. echo " Testing bedtools flank:" cd flank; bash test-flank.sh; cd .. +echo " Testing bedtools fisher:" +cd fisher; bash test-fisher.sh; cd .. + echo " Testing bedtools genomecov:" cd genomecov; bash test-genomecov.sh; cd ..