Commit cf911bfa authored by arq5x's avatar arq5x

merge conflicts. integrate @brentp's nice improvments to fisher.

parents fc7d185b ed4e8a9a
......@@ -4,7 +4,8 @@
*fisher*
########
Perform fisher's exact test on the non/overlap between 2 files.
Perform fisher's exact test on the number of overlaps/unique intervals between
2 files.
|
......@@ -14,7 +15,9 @@ spatially, we resort to shuffling the genome and checking the simulated
scenarios using
`Fisher's Exact Test`_ .
This implementation can calculate the number of overlaps and the number
of intervals unique to each file and it infers (or accepts) the number
that are not present in each file.
Given a pair of input files `-a` and `-b` in the usual BedTools parlance:
......@@ -23,9 +26,11 @@ Given a pair of input files `-a` and `-b` in the usual BedTools parlance:
$ cat a.bed
chr1 10 20
chr1 30 40
chr1 51 52
$ cat b.bed
chr1 15 25
chr1 51 52
And a genome of 500 bases:
......@@ -40,44 +45,95 @@ can do this with ``fisher`` as:
.. code-block:: bash
$ bedtools fisher -a a.bed -b b.bed -g t.genome
# Contingency Table
# Number of query intervals: 3
# Number of db intervals: 2
# Number of overlaps: 2
# Number of possible intervals (estimated): 37
# phyper(2 - 1, 3, 37 - 3, 2, lower.tail=F)
# Contingency Table Of Counts
#_________________________________________
# | not in -b | in -b |
# not in -a | 475 | 5 |
# in -a | 15 | 5 |
# | in -b | not in -b |
# in -a | 2 | 1 |
# not in -a | 0 | 34 |
#_________________________________________
# p-values for fisher's exact test
left right two-tail ratio
1 1.3466e-05 1.3466e-05 31.667
1 0.0045045 0.0045045 inf
Where we can see the constructed contingency table and the pvalues for left, right
and two-tail tests.
From here, we can say that given **500 bases** of genome, it is unlikely that a region of
20 bases total from `-a` and 10 bases total from `-b` would share 5 bases if the regions
were randomly distributed. *Consult your statistician for a more precise definition*.
From here, we can say that given **500 bases** of genome, it is unlikely that we
would see **as many** overlaps as we do if the intervals from `a` and `b` were not
related.
.. note::
The above was highly significant, but if our genome were only **50 bases**:
the total number of **possible** intervals in the above example was
estimated to be 37. This is based on a heuristic that uses the mean sizes of
intervals in the `a` and `b` sets and the size of the genome. The reported
p-value will depend greatly on this. Below, we show how well the reported
value matches with simulations.
The above had a fairly low p-value (0.0045), but if our genome were only **60 bases**:
.. code-block:: bash
$ echo -e "chr1\t50" > t.genome
$ echo -e "chr1\t60" > t.genome
$ bedtools fisher -a a.bed -b b.bed -g t.genome
# Contingency Table
# Number of query intervals: 3
# Number of db intervals: 2
# Number of overlaps: 2
# Number of possible intervals (estimated): 4
# phyper(2 - 1, 3, 4 - 3, 2, lower.tail=F)
# Contingency Table Of Counts
#_________________________________________
# | not in -b | in -b |
# not in -a | 25 | 15 |
# in -a | 5 | 5 |
# | in -b | not in -b |
# in -a | 2 | 1 |
# not in -a | 0 | 1 |
#_________________________________________
# p-values for fisher's exact test
left right two-tail ratio
0.86011 0.35497 0.49401 1.667
1 0.5 1 inf
We can see that neither tail is significant. Intuitively, this makes sense;
if we randomly place 20 (from `-a`), and 10 (from `-b`) bases of intervals
within 50 bases, it doesn't seem unlikely that we'd see 5 bases of overlap.
if we randomly place 3 intervals (from `-a`), and 2 (from `-b`) intervals from
within 60 bases, it doesn't seem unlikely that we'd see 2 overlaps.
Note also that since the genome size is much smaller, the number of possible
intervals is also decreased.
==========
Evaluation
==========
The p-value depends on knowing or inferring the total number of possible
intervals (to fill in the lower right corner of the contingency table). This
inference is not straightforward since we will most likely have variable sized
intervals within and between files. Below, we show the correspondence of
the p-value reported by `fisher` and one from simulated data.
.. figure:: ../images/fisher/res-genes.png
:alt: plot of p-values of fisher vs. simulated
:figclass: align-center
The comparison of the p-value from 'fisher' to one derived by simulation
(see tests/fisher/ for details). The top plot shows the p-value distribution.
Since we are most interested in extreme p-values, the middle plot shows
-log10(p-value). The bottom plot is the same as the middle except looking at
the other tail of the p-value distribution.
Note that we do see inflation from the fisher test, but we do not see
'false-negatives'--that is, bedtools fisher is less likely to miss 'true'
candidates, but it will give many candidates for further exploration. As
such we recommend validating low p-values from fisher using simulation.
This evaluation used all known canonical genes on chromosome 1 and repeatedly
(1000 times) randomly generated 3000 intervals between 20 and 5250 bases. It
then calculated the p-value for each set using fisher and then using shuffled
data.
.. note::
......
......@@ -9,9 +9,21 @@ Fisher::Fisher(ContextFisher *context)
_unionVal(0),
_numIntersections(0),
_queryLen(0),
_dbLen(0)
_dbLen(0),
_queryCounts(0),
_dbCounts(0),
_overlapCounts(0)
{
_blockMgr = new BlockMgr(_context->getOverlapFraction(), _context->getReciprocal());
_haveExclude = false;
if(!(_context->getExcludeFile().empty())){
string ex = _context->getExcludeFile();
exclude = new BedFile(ex);
exclude->loadBedFileIntoMergedMap();
_haveExclude = true;
}
}
Fisher::~Fisher(void) {
......@@ -26,30 +38,42 @@ bool Fisher::calculate() {
}
// header
cout << "# Contingency Table" << endl;
// for fisher's exact test, we need the contingency table
// XXXXXXXX | not in A | in A
// not in B | n11: in neither | n12: only in A
// in B | n21: only in B | n22: in A & B
//
double left, right, two;
long long genomeSize = _context->getGenomeFile()->getGenomeSize();
// bases covered by neither a nor b
long long n11 = genomeSize - _queryLen - _dbLen + _intersectionVal;
// bases covered only by -b
long long n12 = _dbLen - _intersectionVal;
// bases covered only by -a
long long n21 = _queryLen - _intersectionVal;
// bases covered by both
long long n22 = _intersectionVal;
printf("#__________________________________________\n");
printf("# | %-12s | %-12s |\n", "not in -b", "in -b");
printf("# not in -a | %-12lld | %-12lld |\n", n11, n12);
printf("# in -a | %-12lld | %-12lld |\n", n21, n22);
printf("#__________________________________________\n");
if(_haveExclude){
genomeSize -= exclude->getTotalFlattenedLength();
}
// bases covered by neither
long long n22_full_bases = genomeSize;
//long long n22_bases = genomeSize - _queryLen - _dbLen + _intersectionVal;
long double dMean = 1.0 + _dbLen / (long double)_dbCounts;
long double qMean = 1.0 + _queryLen / (long double)_queryCounts;
// heursitic, but seems to work quite well -- better than doing more intuitive sum then divide.
long double bMean = (qMean + dMean);
//bMean = (_unionVal + 2.0 * _intersectionVal) / (long double)(_dbCounts + _queryCounts);
long long n11 = (long)_overlapCounts;
// this could be < 0 because multiple overlaps
long long n12 = (long)max(0L, (long)_queryCounts - (long)_overlapCounts);
long long n21 = max(0L, (long)(_dbCounts - _overlapCounts));
long long n22_full = max(n21 + n21 + n11, (long long)(n22_full_bases / bMean));
long long n22 = max(0L, (long)(n22_full - n12 - n21 - n11));
printf("# Number of query intervals: %lu\n", _queryCounts);
printf("# Number of db intervals: %lu\n", _dbCounts);
printf("# Number of overlaps: %lu\n", _overlapCounts);
printf("# Number of possible intervals (estimated): %lld\n", n22_full);
printf("# phyper(%lld - 1, %lu, %lld - %lu, %lu, lower.tail=F)\n", n11, _queryCounts, n22_full, _queryCounts, _dbCounts);
cout << "# Contingency Table Of Counts" << endl;
printf("#_________________________________________\n");
printf("# | %-12s | %-12s |\n", " in -b", "not in -b");
printf("# in -a | %-12lld | %-12lld |\n", n11, n12);
printf("# not in -a | %-12lld | %-12lld |\n", n21, n22);
printf("#_________________________________________\n");
kt_fisher_exact(n11, n12, n21, n22, &left, &right, &two);
double ratio = ((double)n11 / (double)n12) / ((double)n21 / (double)n22);
......@@ -57,7 +81,7 @@ bool Fisher::calculate() {
printf("# p-values for fisher's exact test\n");
printf("left\tright\ttwo-tail\tratio\n");
printf("%.5g\t%.5g\t%.5g\t%.3f\n", left, right, two, ratio);
return true;
}
......@@ -66,6 +90,7 @@ bool Fisher::getFisher() {
if (!sweep.init()) {
return false;
}
RecordKeyVector hitSet;
while (sweep.next(hitSet)) {
if (_context->getObeySplits()) {
......@@ -82,6 +107,9 @@ bool Fisher::getFisher() {
_queryLen = sweep.getQueryTotalRecordLength();
_dbLen = sweep.getDatabaseTotalRecordLength();
_queryCounts = sweep.getQueryTotalRecords();
_dbCounts = sweep.getDatabaseTotalRecords();
_unionVal = _queryLen + _dbLen;
return true;
}
......@@ -93,10 +121,14 @@ unsigned long Fisher::getTotalIntersection(RecordKeyVector &recList)
int keyStart = key->getStartPos();
int keyEnd = key->getEndPos();
_overlapCounts += recList.size();
_qsizes.push_back((keyEnd - keyStart));
int hitIdx = 0;
for (RecordKeyVector::const_iterator_type iter = recList.begin(); iter != recList.end(); iter = recList.next()) {
int maxStart = max((*iter)->getStartPos(), keyStart);
int minEnd = min((*iter)->getEndPos(), keyEnd);
_qsizes.push_back((int)(minEnd - maxStart));
if (_context->getObeySplits()) {
intersection += _blockMgr->getOverlapBases(hitIdx);
hitIdx++;
......@@ -107,4 +139,3 @@ unsigned long Fisher::getTotalIntersection(RecordKeyVector &recList)
_numIntersections += (int)recList.size();
return intersection;
}
#ifndef FISHER_H
#define FISHER_H
#include "bedFile.h"
#include "bedFilePE.h"
#include "ContextFisher.h"
class BlockMgr;
......@@ -19,11 +22,17 @@ private:
BlockMgr *_blockMgr;
unsigned long _intersectionVal;
unsigned long _unionVal;
bool _haveExclude;
int _numIntersections;
unsigned long _queryLen;
unsigned long _dbLen;
unsigned long _queryCounts;
unsigned long _dbCounts;
unsigned long _overlapCounts;
bool getFisher();
BedFile *exclude;
vector<int> _qsizes;
unsigned long getTotalIntersection(RecordKeyVector &hits);
};
......
......@@ -9,27 +9,32 @@ BIN_DIR = ../../bin/
INCLUDES = -I$(UTILITIES_DIR)/Contexts/ \
-I$(UTILITIES_DIR)/general/ \
-I$(UTILITIES_DIR)/fileType/ \
-I$(UTILITIES_DIR)/gzstream/ \
-I$(UTILITIES_DIR)/GenomeFile/ \
-I$(UTILITIES_DIR)/BamTools/include \
-I$(UTILITIES_DIR)/BamTools/src \
-I$(UTILITIES_DIR)/BlockedIntervals \
-I$(UTILITIES_DIR)/BamTools-Ancillary \
-I$(UTILITIES_DIR)/FileRecordTools/ \
-I$(UTILITIES_DIR)/FileRecordTools/FileReaders/ \
-I$(UTILITIES_DIR)/FileRecordTools/Records/ \
-I$(UTILITIES_DIR)/RecordOutputMgr/ \
-I$(UTILITIES_DIR)/gzstream/ \
-I$(UTILITIES_DIR)/bedFile/ \
-I$(UTILITIES_DIR)/bedFilePE/ \
-I$(UTILITIES_DIR)/GenomeFile/ \
-I$(UTILITIES_DIR)/lineFileUtilities/ \
-I$(UTILITIES_DIR)/BamTools/include \
-I$(UTILITIES_DIR)/BamTools/src \
-I$(UTILITIES_DIR)/BlockedIntervals \
-I$(UTILITIES_DIR)/BamTools-Ancillary \
-I$(UTILITIES_DIR)/FileRecordTools/ \
-I$(UTILITIES_DIR)/FileRecordTools/FileReaders/ \
-I$(UTILITIES_DIR)/FileRecordTools/Records/ \
-I$(UTILITIES_DIR)/RecordOutputMgr/ \
-I$(UTILITIES_DIR)/KeyListOps/ \
-I$(UTILITIES_DIR)/NewChromsweep \
-I$(UTILITIES_DIR)/VectorOps \
-I$(UTILITIES_DIR)/NewChromsweep \
-I$(UTILITIES_DIR)/VectorOps \
-I . \
-I$(UTILITIES_DIR)/BinTree \
-I$(UTILITIES_DIR)/version/
-I$(UTILITIES_DIR)/BinTree \
-I$(UTILITIES_DIR)/version/
# ----------------------------------
# define our source and object files
# ----------------------------------
SOURCES= fisherMain.cpp Fisher.cpp Fisher.h kfunc.c
SOURCES= fisherMain.cpp Fisher.cpp Fisher.h kfunc.c \
../utils/Contexts/ContextFisher.cpp \
../utils/Contexts/ContextFisher.h
OBJECTS= fisherMain.o Fisher.o
BUILT_OBJECTS= $(patsubst %,$(OBJ_DIR)/%,$(OBJECTS))
PROGRAM= Fisher
......
......@@ -32,13 +32,9 @@ void fisher_help(void) {
cerr << "\nTool: bedtools fisher (aka fisher)" << endl;
cerr << "Version: " << VERSION << "\n";
cerr << "Summary: Calculate Fisher statistic b/w two feature files."
<< endl
<< "\t Fisher is the length of the intersection over the union."
<< endl
<< "\t Values range from 0 (no intersection) to 1 (self intersection)."
<< endl << endl;
cerr << "Usage: " << PROGRAM_NAME << " [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf> -g <genome>" << endl << endl;
cerr << "Usage: " << PROGRAM_NAME << " [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf> -g <genome file>" << endl << endl;
cerr << "Options: " << endl;
......@@ -47,6 +43,9 @@ void fisher_help(void) {
cerr << "\t\t- Default is 1E-9 (i.e., 1bp)." << endl;
cerr << "\t\t- FLOAT (e.g. 0.50)" << endl << endl;
cerr << "\t-m\t" << "Merge overlapping intervals before" << endl;
cerr << "\t\t- looking at overlap." << endl << endl;
cerr << "\t-r\t" << "Require that the fraction overlap be reciprocal for A and B." << endl;
cerr << "\t\t- In other words, if -f is 0.90 and -r is used, this requires" << endl;
cerr << "\t\t that B overlap 90% of A and A _also_ overlaps 90% of B." << endl << endl;
......
......@@ -7,8 +7,7 @@
ContextFisher::ContextFisher() {
setSortedInput(true);
setUseMergedIntervals(true);
setUseMergedIntervals(false);
}
ContextFisher::~ContextFisher() {
......@@ -38,9 +37,16 @@ bool ContextFisher::parseCmdArgs(int argc, char **argv, int skipFirstArgs)
else if (strcmp(_argv[_i], "-S") == 0) {
if (!handle_S()) return false;
}
else if (strcmp(_argv[_i], "-exclude") == 0) {
if (!handle_exclude()) return false;
}
if (strcmp(_argv[_i], "-g") == 0) {
if (!handle_g()) return false;
}
if(strcmp(_argv[_i], "-m") == 0) {
markUsed(_i - _skipFirstArgs);
setUseMergedIntervals(true);
}
}
return ContextIntersect::parseCmdArgs(argc, argv, _skipFirstArgs);
}
......@@ -112,4 +118,19 @@ bool ContextFisher::handle_S() {
return false;
}
bool ContextFisher::handle_exclude()
{
if (_argc <= _i+1) {
_errorMsg = "\n***** ERROR: -exclude option given, but no file specified. *****";
return false;
}
do {
markUsed(_i - _skipFirstArgs);
_i++;
markUsed(_i - _skipFirstArgs);
} while (_argc > _i+1 && _argv[_i+1][0] != '-');
setExcludeFile(string(_argv[_i]));
return true;
}
......@@ -17,10 +17,17 @@ public:
~ContextFisher();
virtual bool parseCmdArgs(int argc, char **argv, int skipFirstArgs);
virtual bool isValidState();
string getExcludeFile() { return _excludeFile; }
void setExcludeFile(string excludeFile) { _excludeFile = excludeFile; }
private:
bool handle_s();
bool handle_S();
bool handle_exclude();
string _excludeFile;
};
#endif /* CONTEXTFISHER_H_ */
This diff is collapsed.
......@@ -51,6 +51,9 @@ public:
unsigned long getQueryTotalRecordLength() { return _queryRecordsTotalLength; }
unsigned long getDatabaseTotalRecordLength() { return _databaseRecordsTotalLength; }
unsigned long getQueryTotalRecords() { return _queryTotalRecords; }
unsigned long getDatabaseTotalRecords() { return _databaseTotalRecords; }
protected:
ContextIntersect *_context;
FileRecordMgr *_queryFRM;
......@@ -63,6 +66,9 @@ protected:
unsigned long _databaseRecordsTotalLength;
unsigned long _queryTotalRecords;
unsigned long _databaseTotalRecords;
bool _wasInitialized;
// a cache of still active features from the database file
......@@ -98,4 +104,4 @@ protected:
};
#endif /* NewChromSweep_H */
\ No newline at end of file
#endif /* NewChromSweep_H */
......@@ -657,12 +657,24 @@ void BedFile::setBed12 (bool isBed12) { this->isBed12 = isBed12; }
void BedFile::loadBedFileIntoMap() {
BED bedEntry;
Open();
while (GetNextBed(bedEntry)) {
if (_status == BED_VALID) {
BIN bin = getBin(bedEntry.start, bedEntry.end);
bedMap[bedEntry.chrom][bin].push_back(bedEntry);
addBEDIntoMap(bedEntry);
}
}
Close();
}
void BedFile::loadBedFileIntoMergedMap() {
BED bedEntry;
Open();
while (GetNextMergedBed(bedEntry)) {
if (_status == BED_VALID) {
addBEDIntoMap(bedEntry);
}
}
Close();
......
......@@ -418,6 +418,7 @@ public:
// load a BED file into a map keyed by chrom, then bin. value is
// vector of BEDs
void loadBedFileIntoMap();
void loadBedFileIntoMergedMap();
// load a BED entry into and existing map
void addBEDIntoMap(BED bedEntry);
......
Fisher Testing
==============
Fisher is now based on the count of interval overlaps, subject to `-f`.
We can compare the output of fisher on simulated data by running `python sim.py`
which will show the output from `bedtools fisher` and then running `bash shuf.sh`
which will repeatedly run
```Shell
bedtools intersect -wo -a taa.bed -b <(bedtools shuffle -allowBeyondChromEnd -i tbb.bed -g tgg.genome) | wc -l
```
and then report the proportion of times that number is >= the observed intersection.
In `sim.py` changing the lenght of the intervals (via `maxA`, `maxB`) has the greatest effect on the
correspondence of the simulated p-value from `shuf.sh` and the one from `fisher`. The right-tailed p-value
from `fisher` should correspond well to the value from the simulation.
set -eo pipefail
echo "fisher,shuffled"
for i in $(seq 1000); do
fisher=$(python ./sim.py | tail -1 | cut -f 2)
shuffle=$(bash shuf.sh)
echo "$fisher,$shuffle"
done
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
fig, axs = plt.subplots(3, figsize=(4, 12))
df = pd.read_csv(sys.argv[1])
axs[0].scatter(df.fisher, df.shuffled, s=4)
axs[0].set_xlim(0, 1)
axs[0].set_ylim(0, 1)
axs[0].set_xlabel('fisher p-value')
axs[0].set_ylabel('shuffled p-value')
axs[0].plot([0, 1], [0, 1], ls='--')
x = -np.log10(df.fisher)
y = -np.log10(df.shuffled)
m = int(max(x.max(), y.max())) + 1
axs[1].scatter(x, y, s=4)
axs[1].set_xlim(0, m)
axs[1].set_ylim(0, m)
axs[1].set_xlabel('-log10(fisher p-value)')
axs[1].set_ylabel('-log10(shuffled p-value)')
axs[1].plot([0, m], [0, m], ls='--')
x = -np.log10(1 - np.minimum(1-1e-6, df.fisher))
y = -np.log10(1 - np.minimum(1-1e-6, df.shuffled))
m = int(max(x.max(), y.max())) + 1
axs[2].scatter(x, y, s=4)
axs[2].set_xlim(0, m)
axs[2].set_ylim(0, m)
axs[2].set_xlabel('-log10(1 - fisher p-value)')
axs[2].set_ylabel('-log10(1 - shuffled p-value)')
axs[2].plot([0, m], [0, m], ls='--')
plt.tight_layout()
plt.savefig(sys.argv[1].replace('.txt', '') + '.png')
fig.show()
set -eo pipefail
obs=$(bedtools intersect -wo -a taa.bed -b tbb.bed | wc -l)
p=$(seq 100 | xargs -P 7 -n 1 bash -c "bedtools intersect -wo -a taa.bed -b <(bedtools shuffle -allowBeyondChromEnd -i tbb.bed -g tgg.genome) | wc -l " | awk -vobs=$obs '{s += ($1 > obs)}END{print (1 + s) / (1 + NR)}')
if [ '1' -eq $(echo $p'< 0.1' | bc -l) ] || [ '1' -eq $(echo $p'> 0.9' | bc -l) ]; then
p=$(seq 1000 | xargs -P 7 -n 1 bash -c "bedtools intersect -wo -a taa.bed -b <(bedtools shuffle -allowBeyondChromEnd -i tbb.bed -g tgg.genome) | wc -l " | awk -vobs=$obs '{s += ($1 > obs)}END{print (1 + s) / (1 + NR)}')
fi
if [ '1' -eq $(echo $p'< 0.01' | bc -l) ] || [ '1' -eq $(echo $p'> 0.99' | bc -l) ]; then
p=$(seq 5000 | xargs -P 7 -n 1 bash -c "bedtools intersect -wo -a taa.bed -b <(bedtools shuffle -allowBeyondChromEnd -i tbb.bed -g tgg.genome) | wc -l " | awk -vobs=$obs '{s += ($1 > obs)}END{print (1 + s) / (1 + NR)}')
fi
echo $p
import sys
from random import randint
from subprocess import check_output
nA = 3000
minA, maxA = (20, 5250)
#minA, maxA = (200, 250)
bIntervals = [(x[0], int(x[1]), int(x[2])) for x in (l.split("\t") for l in
open('hg19.knownCanonical.bed')) if x[0] == "chr1" ]
bIntervals.sort()
genome_size = max(b[2] for b in bIntervals) + 50000
with open('tbb.bed', 'w') as fh:
for chrom, start, end in bIntervals:
fh.write("\t".join((chrom, str(start), str(end))) + "\n")
with open('taa.bed', 'w') as fh:
vals = []
for i in range(nA):
s = randint(0, genome_size - maxA)
e = randint(s + minA, min(genome_size, s + maxA))
vals.append((s, e))
for s, e in sorted(vals):
fh.write("chr1\t%i\t%i\n" % (s, e))
fh.flush()
print >> open('tgg.genome', 'w'), ("chr1\t%i" % genome_size)
# NOTE: add -m here to make merged output
print check_output("../../bin/bedtools fisher -a taa.bed -b tbb.bed -g tgg.genome", shell=True).strip()
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment