diff --git a/docs/content/images/fisher/res-genes.png b/docs/content/images/fisher/res-genes.png new file mode 100644 index 0000000000000000000000000000000000000000..db3ba07893ccfc3c0dfb4092a63851373237836e Binary files /dev/null and b/docs/content/images/fisher/res-genes.png differ diff --git a/docs/content/tools/fisher.rst b/docs/content/tools/fisher.rst index aa771c07821cad0144acaa06628d4a7211d1843f..bd2f57f7b21a233153e9dab691e4d9f42b8c2373 100644 --- a/docs/content/tools/fisher.rst +++ b/docs/content/tools/fisher.rst @@ -4,7 +4,8 @@ *fisher* ######## -Perform fisher's exact test on the non/overlap between 2 files. +Perform fisher's exact test on the number of overlaps/unique intervals between +2 files. | @@ -14,7 +15,9 @@ spatially, we resort to shuffling the genome and checking the simulated scenarios using `Fisher's Exact Test`_ . - +This implementation can calculate the number of overlaps and the number +of intervals unique to each file and it infers (or accepts) the number +that are not present in each file. Given a pair of input files `-a` and `-b` in the usual BedTools parlance: @@ -23,9 +26,11 @@ Given a pair of input files `-a` and `-b` in the usual BedTools parlance: $ cat a.bed chr1 10 20 chr1 30 40 + chr1 51 52 $ cat b.bed chr1 15 25 + chr1 51 52 And a genome of 500 bases: @@ -40,44 +45,95 @@ can do this with ``fisher`` as: .. code-block:: bash $ bedtools fisher -a a.bed -b b.bed -g t.genome - # Contingency Table + # Number of query intervals: 3 + # Number of db intervals: 2 + # Number of overlaps: 2 + # Number of possible intervals (estimated): 37 + # phyper(2 - 1, 3, 37 - 3, 2, lower.tail=F) + # Contingency Table Of Counts #_________________________________________ - # | not in -b | in -b | - # not in -a | 475 | 5 | - # in -a | 15 | 5 | + # | in -b | not in -b | + # in -a | 2 | 1 | + # not in -a | 0 | 34 | #_________________________________________ # p-values for fisher's exact test left right two-tail ratio - 1 1.3466e-05 1.3466e-05 31.667 + 1 0.0045045 0.0045045 inf Where we can see the constructed contingency table and the pvalues for left, right and two-tail tests. -From here, we can say that given **500 bases** of genome, it is unlikely that a region of -20 bases total from `-a` and 10 bases total from `-b` would share 5 bases if the regions -were randomly distributed. *Consult your statistician for a more precise definition*. +From here, we can say that given **500 bases** of genome, it is unlikely that we +would see **as many** overlaps as we do if the intervals from `a` and `b` were not +related. + +.. note:: -The above was highly significant, but if our genome were only **50 bases**: + the total number of **possible** intervals in the above example was + estimated to be 37. This is based on a heuristic that uses the mean sizes of + intervals in the `a` and `b` sets and the size of the genome. The reported + p-value will depend greatly on this. Below, we show how well the reported + value matches with simulations. + +The above had a fairly low p-value (0.0045), but if our genome were only **60 bases**: .. code-block:: bash - $ echo -e "chr1\t50" > t.genome + $ echo -e "chr1\t60" > t.genome $ bedtools fisher -a a.bed -b b.bed -g t.genome - # Contingency Table + # Number of query intervals: 3 + # Number of db intervals: 2 + # Number of overlaps: 2 + # Number of possible intervals (estimated): 4 + # phyper(2 - 1, 3, 4 - 3, 2, lower.tail=F) + # Contingency Table Of Counts #_________________________________________ - # | not in -b | in -b | - # not in -a | 25 | 15 | - # in -a | 5 | 5 | + # | in -b | not in -b | + # in -a | 2 | 1 | + # not in -a | 0 | 1 | #_________________________________________ # p-values for fisher's exact test left right two-tail ratio - 0.86011 0.35497 0.49401 1.667 + 1 0.5 1 inf We can see that neither tail is significant. Intuitively, this makes sense; -if we randomly place 20 (from `-a`), and 10 (from `-b`) bases of intervals -within 50 bases, it doesn't seem unlikely that we'd see 5 bases of overlap. +if we randomly place 3 intervals (from `-a`), and 2 (from `-b`) intervals from +within 60 bases, it doesn't seem unlikely that we'd see 2 overlaps. + +Note also that since the genome size is much smaller, the number of possible +intervals is also decreased. + +========== +Evaluation +========== + +The p-value depends on knowing or inferring the total number of possible +intervals (to fill in the lower right corner of the contingency table). This +inference is not straightforward since we will most likely have variable sized +intervals within and between files. Below, we show the correspondence of +the p-value reported by `fisher` and one from simulated data. + +.. figure:: ../images/fisher/res-genes.png + :alt: plot of p-values of fisher vs. simulated + :figclass: align-center + + The comparison of the p-value from 'fisher' to one derived by simulation + (see tests/fisher/ for details). The top plot shows the p-value distribution. + Since we are most interested in extreme p-values, the middle plot shows + -log10(p-value). The bottom plot is the same as the middle except looking at + the other tail of the p-value distribution. + + Note that we do see inflation from the fisher test, but we do not see + 'false-negatives'--that is, bedtools fisher is less likely to miss 'true' + candidates, but it will give many candidates for further exploration. As + such we recommend validating low p-values from fisher using simulation. + + This evaluation used all known canonical genes on chromosome 1 and repeatedly + (1000 times) randomly generated 3000 intervals between 20 and 5250 bases. It + then calculated the p-value for each set using fisher and then using shuffled + data. .. note:: diff --git a/src/fisher/Fisher.cpp b/src/fisher/Fisher.cpp index 103456b64904a9db210a2bc6addfb8f0e8146703..3fb420c9a3acaaa10a8e7a4134fa7ac153455137 100644 --- a/src/fisher/Fisher.cpp +++ b/src/fisher/Fisher.cpp @@ -9,9 +9,21 @@ Fisher::Fisher(ContextFisher *context) _unionVal(0), _numIntersections(0), _queryLen(0), - _dbLen(0) + _dbLen(0), + _queryCounts(0), + _dbCounts(0), + _overlapCounts(0) + { _blockMgr = new BlockMgr(_context->getOverlapFraction(), _context->getReciprocal()); + _haveExclude = false; + + if(!(_context->getExcludeFile().empty())){ + string ex = _context->getExcludeFile(); + exclude = new BedFile(ex); + exclude->loadBedFileIntoMergedMap(); + _haveExclude = true; + } } Fisher::~Fisher(void) { @@ -26,30 +38,42 @@ bool Fisher::calculate() { } // header - cout << "# Contingency Table" << endl; - // for fisher's exact test, we need the contingency table - // XXXXXXXX | not in A | in A - // not in B | n11: in neither | n12: only in A - // in B | n21: only in B | n22: in A & B - // double left, right, two; long long genomeSize = _context->getGenomeFile()->getGenomeSize(); - // bases covered by neither a nor b - long long n11 = genomeSize - _queryLen - _dbLen + _intersectionVal; - // bases covered only by -b - long long n12 = _dbLen - _intersectionVal; - // bases covered only by -a - long long n21 = _queryLen - _intersectionVal; - // bases covered by both - long long n22 = _intersectionVal; - - printf("#__________________________________________\n"); - printf("# | %-12s | %-12s |\n", "not in -b", "in -b"); - printf("# not in -a | %-12lld | %-12lld |\n", n11, n12); - printf("# in -a | %-12lld | %-12lld |\n", n21, n22); - printf("#__________________________________________\n"); + if(_haveExclude){ + genomeSize -= exclude->getTotalFlattenedLength(); + } + // bases covered by neither + long long n22_full_bases = genomeSize; + //long long n22_bases = genomeSize - _queryLen - _dbLen + _intersectionVal; + long double dMean = 1.0 + _dbLen / (long double)_dbCounts; + long double qMean = 1.0 + _queryLen / (long double)_queryCounts; + + // heursitic, but seems to work quite well -- better than doing more intuitive sum then divide. + long double bMean = (qMean + dMean); + //bMean = (_unionVal + 2.0 * _intersectionVal) / (long double)(_dbCounts + _queryCounts); + + long long n11 = (long)_overlapCounts; + // this could be < 0 because multiple overlaps + long long n12 = (long)max(0L, (long)_queryCounts - (long)_overlapCounts); + long long n21 = max(0L, (long)(_dbCounts - _overlapCounts)); + long long n22_full = max(n21 + n21 + n11, (long long)(n22_full_bases / bMean)); + long long n22 = max(0L, (long)(n22_full - n12 - n21 - n11)); + + printf("# Number of query intervals: %lu\n", _queryCounts); + printf("# Number of db intervals: %lu\n", _dbCounts); + printf("# Number of overlaps: %lu\n", _overlapCounts); + printf("# Number of possible intervals (estimated): %lld\n", n22_full); + + printf("# phyper(%lld - 1, %lu, %lld - %lu, %lu, lower.tail=F)\n", n11, _queryCounts, n22_full, _queryCounts, _dbCounts); + cout << "# Contingency Table Of Counts" << endl; + printf("#_________________________________________\n"); + printf("# | %-12s | %-12s |\n", " in -b", "not in -b"); + printf("# in -a | %-12lld | %-12lld |\n", n11, n12); + printf("# not in -a | %-12lld | %-12lld |\n", n21, n22); + printf("#_________________________________________\n"); kt_fisher_exact(n11, n12, n21, n22, &left, &right, &two); double ratio = ((double)n11 / (double)n12) / ((double)n21 / (double)n22); @@ -57,7 +81,7 @@ bool Fisher::calculate() { printf("# p-values for fisher's exact test\n"); printf("left\tright\ttwo-tail\tratio\n"); printf("%.5g\t%.5g\t%.5g\t%.3f\n", left, right, two, ratio); - + return true; } @@ -66,6 +90,7 @@ bool Fisher::getFisher() { if (!sweep.init()) { return false; } + RecordKeyVector hitSet; while (sweep.next(hitSet)) { if (_context->getObeySplits()) { @@ -82,6 +107,9 @@ bool Fisher::getFisher() { _queryLen = sweep.getQueryTotalRecordLength(); _dbLen = sweep.getDatabaseTotalRecordLength(); + _queryCounts = sweep.getQueryTotalRecords(); + _dbCounts = sweep.getDatabaseTotalRecords(); + _unionVal = _queryLen + _dbLen; return true; } @@ -93,10 +121,14 @@ unsigned long Fisher::getTotalIntersection(RecordKeyVector &recList) int keyStart = key->getStartPos(); int keyEnd = key->getEndPos(); + _overlapCounts += recList.size(); + _qsizes.push_back((keyEnd - keyStart)); + int hitIdx = 0; for (RecordKeyVector::const_iterator_type iter = recList.begin(); iter != recList.end(); iter = recList.next()) { int maxStart = max((*iter)->getStartPos(), keyStart); int minEnd = min((*iter)->getEndPos(), keyEnd); + _qsizes.push_back((int)(minEnd - maxStart)); if (_context->getObeySplits()) { intersection += _blockMgr->getOverlapBases(hitIdx); hitIdx++; @@ -107,4 +139,3 @@ unsigned long Fisher::getTotalIntersection(RecordKeyVector &recList) _numIntersections += (int)recList.size(); return intersection; } - diff --git a/src/fisher/Fisher.h b/src/fisher/Fisher.h index 01f2157d32343f0ca08449f453d2380b526db69a..c9d1850e5b1015b87938a0e31495798c79984efb 100644 --- a/src/fisher/Fisher.h +++ b/src/fisher/Fisher.h @@ -1,6 +1,9 @@ #ifndef FISHER_H #define FISHER_H +#include "bedFile.h" +#include "bedFilePE.h" + #include "ContextFisher.h" class BlockMgr; @@ -19,11 +22,17 @@ private: BlockMgr *_blockMgr; unsigned long _intersectionVal; unsigned long _unionVal; + bool _haveExclude; int _numIntersections; unsigned long _queryLen; unsigned long _dbLen; + unsigned long _queryCounts; + unsigned long _dbCounts; + unsigned long _overlapCounts; bool getFisher(); + BedFile *exclude; + vector<int> _qsizes; unsigned long getTotalIntersection(RecordKeyVector &hits); }; diff --git a/src/fisher/Makefile b/src/fisher/Makefile index b94a5097c99b13b3b5924c6b4f3c11ad2bf19b40..e9b66511b73e4ec1c7c33a0731dec33d3b236436 100644 --- a/src/fisher/Makefile +++ b/src/fisher/Makefile @@ -9,27 +9,32 @@ BIN_DIR = ../../bin/ INCLUDES = -I$(UTILITIES_DIR)/Contexts/ \ -I$(UTILITIES_DIR)/general/ \ -I$(UTILITIES_DIR)/fileType/ \ - -I$(UTILITIES_DIR)/gzstream/ \ - -I$(UTILITIES_DIR)/GenomeFile/ \ - -I$(UTILITIES_DIR)/BamTools/include \ - -I$(UTILITIES_DIR)/BamTools/src \ - -I$(UTILITIES_DIR)/BlockedIntervals \ - -I$(UTILITIES_DIR)/BamTools-Ancillary \ - -I$(UTILITIES_DIR)/FileRecordTools/ \ - -I$(UTILITIES_DIR)/FileRecordTools/FileReaders/ \ - -I$(UTILITIES_DIR)/FileRecordTools/Records/ \ - -I$(UTILITIES_DIR)/RecordOutputMgr/ \ + -I$(UTILITIES_DIR)/gzstream/ \ + -I$(UTILITIES_DIR)/bedFile/ \ + -I$(UTILITIES_DIR)/bedFilePE/ \ + -I$(UTILITIES_DIR)/GenomeFile/ \ + -I$(UTILITIES_DIR)/lineFileUtilities/ \ + -I$(UTILITIES_DIR)/BamTools/include \ + -I$(UTILITIES_DIR)/BamTools/src \ + -I$(UTILITIES_DIR)/BlockedIntervals \ + -I$(UTILITIES_DIR)/BamTools-Ancillary \ + -I$(UTILITIES_DIR)/FileRecordTools/ \ + -I$(UTILITIES_DIR)/FileRecordTools/FileReaders/ \ + -I$(UTILITIES_DIR)/FileRecordTools/Records/ \ + -I$(UTILITIES_DIR)/RecordOutputMgr/ \ -I$(UTILITIES_DIR)/KeyListOps/ \ - -I$(UTILITIES_DIR)/NewChromsweep \ - -I$(UTILITIES_DIR)/VectorOps \ + -I$(UTILITIES_DIR)/NewChromsweep \ + -I$(UTILITIES_DIR)/VectorOps \ -I . \ - -I$(UTILITIES_DIR)/BinTree \ - -I$(UTILITIES_DIR)/version/ + -I$(UTILITIES_DIR)/BinTree \ + -I$(UTILITIES_DIR)/version/ # ---------------------------------- # define our source and object files # ---------------------------------- -SOURCES= fisherMain.cpp Fisher.cpp Fisher.h kfunc.c +SOURCES= fisherMain.cpp Fisher.cpp Fisher.h kfunc.c \ + ../utils/Contexts/ContextFisher.cpp \ + ../utils/Contexts/ContextFisher.h OBJECTS= fisherMain.o Fisher.o BUILT_OBJECTS= $(patsubst %,$(OBJ_DIR)/%,$(OBJECTS)) PROGRAM= Fisher diff --git a/src/fisher/fisherMain.cpp b/src/fisher/fisherMain.cpp index 755dc97a133c91db1e70cd49bc1919ab730636a2..584664e198bbdfc8ae0b854777f7f9122494e574 100644 --- a/src/fisher/fisherMain.cpp +++ b/src/fisher/fisherMain.cpp @@ -32,13 +32,9 @@ void fisher_help(void) { cerr << "\nTool: bedtools fisher (aka fisher)" << endl; cerr << "Version: " << VERSION << "\n"; cerr << "Summary: Calculate Fisher statistic b/w two feature files." - << endl - << "\t Fisher is the length of the intersection over the union." - << endl - << "\t Values range from 0 (no intersection) to 1 (self intersection)." << endl << endl; - cerr << "Usage: " << PROGRAM_NAME << " [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf> -g <genome>" << endl << endl; + cerr << "Usage: " << PROGRAM_NAME << " [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf> -g <genome file>" << endl << endl; cerr << "Options: " << endl; @@ -47,6 +43,9 @@ void fisher_help(void) { cerr << "\t\t- Default is 1E-9 (i.e., 1bp)." << endl; cerr << "\t\t- FLOAT (e.g. 0.50)" << endl << endl; + cerr << "\t-m\t" << "Merge overlapping intervals before" << endl; + cerr << "\t\t- looking at overlap." << endl << endl; + cerr << "\t-r\t" << "Require that the fraction overlap be reciprocal for A and B." << endl; cerr << "\t\t- In other words, if -f is 0.90 and -r is used, this requires" << endl; cerr << "\t\t that B overlap 90% of A and A _also_ overlaps 90% of B." << endl << endl; diff --git a/src/utils/Contexts/ContextFisher.cpp b/src/utils/Contexts/ContextFisher.cpp index 90f07282602dc1903765707cab819c025610866a..2e9387dddcab9d436e70741afa11d5b8f6e91a14 100644 --- a/src/utils/Contexts/ContextFisher.cpp +++ b/src/utils/Contexts/ContextFisher.cpp @@ -7,8 +7,7 @@ ContextFisher::ContextFisher() { setSortedInput(true); - setUseMergedIntervals(true); - + setUseMergedIntervals(false); } ContextFisher::~ContextFisher() { @@ -38,9 +37,16 @@ bool ContextFisher::parseCmdArgs(int argc, char **argv, int skipFirstArgs) else if (strcmp(_argv[_i], "-S") == 0) { if (!handle_S()) return false; } + else if (strcmp(_argv[_i], "-exclude") == 0) { + if (!handle_exclude()) return false; + } if (strcmp(_argv[_i], "-g") == 0) { if (!handle_g()) return false; } + if(strcmp(_argv[_i], "-m") == 0) { + markUsed(_i - _skipFirstArgs); + setUseMergedIntervals(true); + } } return ContextIntersect::parseCmdArgs(argc, argv, _skipFirstArgs); } @@ -112,4 +118,19 @@ bool ContextFisher::handle_S() { return false; } +bool ContextFisher::handle_exclude() +{ + if (_argc <= _i+1) { + _errorMsg = "\n***** ERROR: -exclude option given, but no file specified. *****"; + return false; + } + + do { + markUsed(_i - _skipFirstArgs); + _i++; + markUsed(_i - _skipFirstArgs); + } while (_argc > _i+1 && _argv[_i+1][0] != '-'); + setExcludeFile(string(_argv[_i])); + return true; +} diff --git a/src/utils/Contexts/ContextFisher.h b/src/utils/Contexts/ContextFisher.h index 49e7dcb2e49a0384d348d74cd0c739e9f6a7fa00..fd1d8ada4b414303a9ab225f72dbf2f81fd88273 100644 --- a/src/utils/Contexts/ContextFisher.h +++ b/src/utils/Contexts/ContextFisher.h @@ -17,10 +17,17 @@ public: ~ContextFisher(); virtual bool parseCmdArgs(int argc, char **argv, int skipFirstArgs); virtual bool isValidState(); + string getExcludeFile() { return _excludeFile; } + void setExcludeFile(string excludeFile) { _excludeFile = excludeFile; } + private: bool handle_s(); bool handle_S(); + bool handle_exclude(); + string _excludeFile; + + }; #endif /* CONTEXTFISHER_H_ */ diff --git a/src/utils/NewChromsweep/NewChromsweep.cpp b/src/utils/NewChromsweep/NewChromsweep.cpp index 0b84e6a545c710d83b4f488bffce404d71a046c5..82a9860deb8e16d5afc317f10f15875fd649c27f 100644 --- a/src/utils/NewChromsweep/NewChromsweep.cpp +++ b/src/utils/NewChromsweep/NewChromsweep.cpp @@ -1,252 +1,258 @@ -/***************************************************************************** - NewChromsweep.cpp - - (c) 2009 - Aaron Quinlan - Hall Laboratory - Department of Biochemistry and Molecular Genetics - University of Virginia - aaronquinlan@gmail.com - - Licenced under the GNU General Public License 2.0 license. -******************************************************************************/ - -#include "NewChromsweep.h" -#include "ContextIntersect.h" -#include "FileRecordMgr.h" - -NewChromSweep::NewChromSweep(ContextIntersect *context) -: _context(context), - _queryFRM(NULL), - _numDBs(_context->getNumDatabaseFiles()), - _queryRecordsTotalLength(0), - _databaseRecordsTotalLength(0), - _wasInitialized(false), - _currQueryRec(NULL), - _runToQueryEnd(false) -{ -} - - -bool NewChromSweep::init() { - - //Create new FileRecordMgrs for the input files. - //Open them, and get the first record from each. - //otherwise, return true. - _queryFRM = _context->getFile(_context->getQueryFileIdx()); - - _dbFRMs.resize(_numDBs, NULL); - for (int i=0; i < _numDBs; i++) { - _dbFRMs[i] = _context->getDatabaseFile(i); - } - - _currDbRecs.resize(_numDBs, NULL); - for (int i=0; i < _numDBs; i++) { - nextRecord(false, i); - } - - _caches.resize(_numDBs); - - //determine whether to stop when the database end is hit, or keep going until the - //end of the query file is hit as well. - - if (_context->getNoHit() || _context->getWriteCount() || _context->getWriteOverlap() || _context->getWriteAllOverlap() || _context->getLeftJoin()) { - _runToQueryEnd = true; - } - _wasInitialized = true; - return true; - } - -void NewChromSweep::closeOut() { - while (!_queryFRM->eof()) { - nextRecord(true); - } - - for (int i=0; i < _numDBs; i++) { - while (!_dbFRMs[i]->eof()) { - nextRecord(false, i); - } - } -} - -NewChromSweep::~NewChromSweep(void) { - if (!_wasInitialized) { - return; - } - _queryFRM->deleteRecord(_currQueryRec); - _currQueryRec = NULL; - - - for (int i=0; i < _numDBs; i++) { - _dbFRMs[i]->deleteRecord(_currDbRecs[i]); - _currDbRecs[i] = NULL; - } - - _queryFRM->close(); - - for (int i=0; i < _numDBs; i++) { - _dbFRMs[i]->close(); - } -} - - -void NewChromSweep::scanCache(int dbIdx, RecordKeyVector &retList) { - recListIterType cacheIter = _caches[dbIdx].begin(); - while (cacheIter != _caches[dbIdx].end()) - { - const Record *cacheRec = cacheIter->value(); - if (_currQueryRec->sameChrom(cacheRec) && !(_currQueryRec->after(cacheRec))) { - if (intersects(_currQueryRec, cacheRec)) { - retList.push_back(cacheRec); - } else break; // cacheRec is after the query rec, stop scanning. - cacheIter = _caches[dbIdx].next(); - } - else { - cacheIter = _caches[dbIdx].deleteCurrent(); - _dbFRMs[dbIdx]->deleteRecord(cacheRec); - } - } -} - -void NewChromSweep::clearCache(int dbIdx) -{ - //delete all objects pointed to by cache - recListType &cache = _caches[dbIdx]; - for (recListIterType iter = cache.begin(); iter != cache.end(); iter = cache.next()) { - _dbFRMs[dbIdx]->deleteRecord(iter->value()); - } - cache.clear(); -} - -void NewChromSweep::masterScan(RecordKeyVector &retList) { - for (int i=0; i < _numDBs; i++) { - if (dbFinished(i) || chromChange(i, retList)) { - continue; - } else { - - // scan the database cache for hits - scanCache(i, retList); - //skip if we hit the end of the DB - // advance the db until we are ahead of the query. update hits and cache as necessary - while (_currDbRecs[i] != NULL && - _currQueryRec->sameChrom(_currDbRecs[i]) && - !(_currDbRecs[i]->after(_currQueryRec))) { - if (intersects(_currQueryRec, _currDbRecs[i])) { - retList.push_back(_currDbRecs[i]); - } - if (_currQueryRec->after(_currDbRecs[i])) { - _dbFRMs[i]->deleteRecord(_currDbRecs[i]); - _currDbRecs[i] = NULL; - } else { - _caches[i].push_back(_currDbRecs[i]); - _currDbRecs[i] = NULL; - } - nextRecord(false, i); - } - } - } -} - -bool NewChromSweep::chromChange(int dbIdx, RecordKeyVector &retList) -{ - // the files are on the same chrom - if (_currDbRecs[dbIdx] == NULL || _currQueryRec->sameChrom(_currDbRecs[dbIdx])) { - return false; - } - // the query is ahead of the database. fast-forward the database to catch-up. - if (_currQueryRec->chromAfter(_currDbRecs[dbIdx])) { - - while (_currDbRecs[dbIdx] != NULL && - _currQueryRec->chromAfter(_currDbRecs[dbIdx])) { - _dbFRMs[dbIdx]->deleteRecord(_currDbRecs[dbIdx]); - nextRecord(false, dbIdx); - } - clearCache(dbIdx); - return false; - } - // the database is ahead of the query. - else { - // 1. scan the cache for remaining hits on the query's current chrom. - scanCache(dbIdx, retList); - - return true; - } - - //control can't reach here, but compiler still wants a return statement. - return true; -} - - -bool NewChromSweep::next(RecordKeyVector &retList) { - retList.clearVector(); - if (_currQueryRec != NULL) { - _queryFRM->deleteRecord(_currQueryRec); - } - - if (!nextRecord(true)) return false; // query EOF hit - - - if (allCurrDBrecsNull() && allCachesEmpty() && !_runToQueryEnd) { - return false; - } - _currChromName = _currQueryRec->getChrName(); - - masterScan(retList); - - if (_context->getSortOutput()) { - retList.sortVector(); - } - - retList.setKey(_currQueryRec); - return true; -} - -bool NewChromSweep::nextRecord(bool query, int dbIdx) { - if (query) { - _currQueryRec = _queryFRM->getNextRecord(); - if (_currQueryRec != NULL) { - _queryRecordsTotalLength += (unsigned long)(_currQueryRec->getEndPos() - _currQueryRec->getStartPos()); - return true; - } - return false; - } else { //database - Record *rec = _dbFRMs[dbIdx]->getNextRecord(); - _currDbRecs[dbIdx] = rec; - if (rec != NULL) { - _databaseRecordsTotalLength += (unsigned long)(rec->getEndPos() - rec->getStartPos()); - return true; - } - return false; - } -} - -bool NewChromSweep::intersects(const Record *rec1, const Record *rec2) const -{ - return rec1->sameChromIntersects(rec2, _context->getSameStrand(), _context->getDiffStrand(), - _context->getOverlapFraction(), _context->getReciprocal()); -} - - -bool NewChromSweep::allCachesEmpty() { - for (int i=0; i < _numDBs; i++) { - if (!_caches[i].empty()) { - return false; - } - } - return true; -} - -bool NewChromSweep::allCurrDBrecsNull() { - for (int i=0; i < _numDBs; i++) { - if (_currDbRecs[i] != NULL) { - return false; - } - } - return true; -} - -bool NewChromSweep::dbFinished(int dbIdx) { - if (_currDbRecs[dbIdx] == NULL && _caches[dbIdx].empty()) { - return true; - } - return false; -} +/***************************************************************************** + NewChromsweep.cpp + + (c) 2009 - Aaron Quinlan + Hall Laboratory + Department of Biochemistry and Molecular Genetics + University of Virginia + aaronquinlan@gmail.com + + Licenced under the GNU General Public License 2.0 license. +******************************************************************************/ + +#include "NewChromsweep.h" +#include "ContextIntersect.h" +#include "FileRecordMgr.h" + +NewChromSweep::NewChromSweep(ContextIntersect *context, + bool useMergedIntervals) +: _context(context), + _queryFRM(NULL), + _numDBs(_context->getNumDatabaseFiles()), + _useMergedIntervals(false), + _queryRecordsTotalLength(0), + _databaseRecordsTotalLength(0), + _queryTotalRecords(0), + _databaseTotalRecords(0), + _wasInitialized(false), + _currQueryRec(NULL), + _runToQueryEnd(false) +{ +} + + +bool NewChromSweep::init() { + + //Create new FileRecordMgrs for the input files. + //Open them, and get the first record from each. + //otherwise, return true. + _queryFRM = _context->getFile(_context->getQueryFileIdx()); + + _dbFRMs.resize(_numDBs, NULL); + for (int i=0; i < _numDBs; i++) { + _dbFRMs[i] = _context->getDatabaseFile(i); + } + + _currDbRecs.resize(_numDBs, NULL); + for (int i=0; i < _numDBs; i++) { + nextRecord(false, i); + } + + _caches.resize(_numDBs); + + //determine whether to stop when the database end is hit, or keep going until the + //end of the query file is hit as well. + + if (_context->getNoHit() || _context->getWriteCount() || _context->getWriteOverlap() || _context->getWriteAllOverlap() || _context->getLeftJoin()) { + _runToQueryEnd = true; + } + _wasInitialized = true; + return true; + } + +void NewChromSweep::closeOut() { + while (!_queryFRM->eof()) { + nextRecord(true); + } + + for (int i=0; i < _numDBs; i++) { + while (!_dbFRMs[i]->eof()) { + nextRecord(false, i); + } + } +} + +NewChromSweep::~NewChromSweep(void) { + if (!_wasInitialized) { + return; + } + _queryFRM->deleteRecord(_currQueryRec); + _currQueryRec = NULL; + + + for (int i=0; i < _numDBs; i++) { + _dbFRMs[i]->deleteRecord(_currDbRecs[i]); + _currDbRecs[i] = NULL; + } + + _queryFRM->close(); + + for (int i=0; i < _numDBs; i++) { + _dbFRMs[i]->close(); + } +} + + +void NewChromSweep::scanCache(int dbIdx, RecordKeyVector &retList) { + recListIterType cacheIter = _caches[dbIdx].begin(); + while (cacheIter != _caches[dbIdx].end()) + { + const Record *cacheRec = cacheIter->value(); + if (_currQueryRec->sameChrom(cacheRec) && !(_currQueryRec->after(cacheRec))) { + if (intersects(_currQueryRec, cacheRec)) { + retList.push_back(cacheRec); + } else break; // cacheRec is after the query rec, stop scanning. + cacheIter = _caches[dbIdx].next(); + } + else { + cacheIter = _caches[dbIdx].deleteCurrent(); + _dbFRMs[dbIdx]->deleteRecord(cacheRec); + } + } +} + +void NewChromSweep::clearCache(int dbIdx) +{ + //delete all objects pointed to by cache + recListType &cache = _caches[dbIdx]; + for (recListIterType iter = cache.begin(); iter != cache.end(); iter = cache.next()) { + _dbFRMs[dbIdx]->deleteRecord(iter->value()); + } + cache.clear(); +} + +void NewChromSweep::masterScan(RecordKeyVector &retList) { + for (int i=0; i < _numDBs; i++) { + if (dbFinished(i) || chromChange(i, retList)) { + continue; + } else { + + // scan the database cache for hits + scanCache(i, retList); + //skip if we hit the end of the DB + // advance the db until we are ahead of the query. update hits and cache as necessary + while (_currDbRecs[i] != NULL && + _currQueryRec->sameChrom(_currDbRecs[i]) && + !(_currDbRecs[i]->after(_currQueryRec))) { + if (intersects(_currQueryRec, _currDbRecs[i])) { + retList.push_back(_currDbRecs[i]); + } + if (_currQueryRec->after(_currDbRecs[i])) { + _dbFRMs[i]->deleteRecord(_currDbRecs[i]); + _currDbRecs[i] = NULL; + } else { + _caches[i].push_back(_currDbRecs[i]); + _currDbRecs[i] = NULL; + } + nextRecord(false, i); + } + } + } +} + +bool NewChromSweep::chromChange(int dbIdx, RecordKeyVector &retList) +{ + // the files are on the same chrom + if (_currDbRecs[dbIdx] == NULL || _currQueryRec->sameChrom(_currDbRecs[dbIdx])) { + return false; + } + // the query is ahead of the database. fast-forward the database to catch-up. + if (_currQueryRec->chromAfter(_currDbRecs[dbIdx])) { + + while (_currDbRecs[dbIdx] != NULL && + _currQueryRec->chromAfter(_currDbRecs[dbIdx])) { + _dbFRMs[dbIdx]->deleteRecord(_currDbRecs[dbIdx]); + nextRecord(false, dbIdx); + } + clearCache(dbIdx); + return false; + } + // the database is ahead of the query. + else { + // 1. scan the cache for remaining hits on the query's current chrom. + scanCache(dbIdx, retList); + + return true; + } + + //control can't reach here, but compiler still wants a return statement. + return true; +} + + +bool NewChromSweep::next(RecordKeyVector &retList) { + retList.clearVector(); + if (_currQueryRec != NULL) { + _queryFRM->deleteRecord(_currQueryRec); + } + + if (!nextRecord(true)) return false; // query EOF hit + + + if (allCurrDBrecsNull() && allCachesEmpty() && !_runToQueryEnd) { + return false; + } + _currChromName = _currQueryRec->getChrName(); + + masterScan(retList); + + if (_context->getSortOutput()) { + retList.sortVector(); + } + + retList.setKey(_currQueryRec); + return true; +} + +bool NewChromSweep::nextRecord(bool query, int dbIdx) { + if (query) { + _currQueryRec = _queryFRM->getNextRecord(); + if (_currQueryRec != NULL) { + _queryRecordsTotalLength += (unsigned long)(_currQueryRec->getEndPos() - _currQueryRec->getStartPos()); + _queryTotalRecords++; + return true; + } + return false; + } else { //database + Record *rec = _dbFRMs[dbIdx]->getNextRecord(); + _currDbRecs[dbIdx] = rec; + if (rec != NULL) { + _databaseRecordsTotalLength += (unsigned long)(rec->getEndPos() - rec->getStartPos()); + _databaseTotalRecords++; + return true; + } + return false; + } +} + +bool NewChromSweep::intersects(const Record *rec1, const Record *rec2) const +{ + return rec1->sameChromIntersects(rec2, _context->getSameStrand(), _context->getDiffStrand(), + _context->getOverlapFraction(), _context->getReciprocal()); +} + + +bool NewChromSweep::allCachesEmpty() { + for (int i=0; i < _numDBs; i++) { + if (!_caches[i].empty()) { + return false; + } + } + return true; +} + +bool NewChromSweep::allCurrDBrecsNull() { + for (int i=0; i < _numDBs; i++) { + if (_currDbRecs[i] != NULL) { + return false; + } + } + return true; +} + +bool NewChromSweep::dbFinished(int dbIdx) { + if (_currDbRecs[dbIdx] == NULL && _caches[dbIdx].empty()) { + return true; + } + return false; +} diff --git a/src/utils/NewChromsweep/NewChromsweep.h b/src/utils/NewChromsweep/NewChromsweep.h index 5ed9516d7ed57dae60eb0b45618c673bf74295eb..0f5392f72f9cdeee018025dcfab6a5a3e177646f 100644 --- a/src/utils/NewChromsweep/NewChromsweep.h +++ b/src/utils/NewChromsweep/NewChromsweep.h @@ -51,6 +51,9 @@ public: unsigned long getQueryTotalRecordLength() { return _queryRecordsTotalLength; } unsigned long getDatabaseTotalRecordLength() { return _databaseRecordsTotalLength; } + unsigned long getQueryTotalRecords() { return _queryTotalRecords; } + unsigned long getDatabaseTotalRecords() { return _databaseTotalRecords; } + protected: ContextIntersect *_context; FileRecordMgr *_queryFRM; @@ -63,6 +66,9 @@ protected: unsigned long _databaseRecordsTotalLength; + unsigned long _queryTotalRecords; + unsigned long _databaseTotalRecords; + bool _wasInitialized; // a cache of still active features from the database file @@ -98,4 +104,4 @@ protected: }; -#endif /* NewChromSweep_H */ \ No newline at end of file +#endif /* NewChromSweep_H */ diff --git a/src/utils/bedFile/bedFile.cpp b/src/utils/bedFile/bedFile.cpp index 7c1ae8dc1184fea620542ba4046cf1b2521fcb09..ed8d229d69accabb8fb9c3240ba71a97c1c736dc 100644 --- a/src/utils/bedFile/bedFile.cpp +++ b/src/utils/bedFile/bedFile.cpp @@ -657,12 +657,24 @@ void BedFile::setBed12 (bool isBed12) { this->isBed12 = isBed12; } void BedFile::loadBedFileIntoMap() { BED bedEntry; - + Open(); while (GetNextBed(bedEntry)) { if (_status == BED_VALID) { - BIN bin = getBin(bedEntry.start, bedEntry.end); - bedMap[bedEntry.chrom][bin].push_back(bedEntry); + addBEDIntoMap(bedEntry); + } + } + Close(); +} + +void BedFile::loadBedFileIntoMergedMap() { + + BED bedEntry; + + Open(); + while (GetNextMergedBed(bedEntry)) { + if (_status == BED_VALID) { + addBEDIntoMap(bedEntry); } } Close(); diff --git a/src/utils/bedFile/bedFile.h b/src/utils/bedFile/bedFile.h index 440986c0007fba7607d2106e58e7643f0955dccb..a738c21ecc686c7cc80accf773b526d41f448dfe 100644 --- a/src/utils/bedFile/bedFile.h +++ b/src/utils/bedFile/bedFile.h @@ -418,6 +418,7 @@ public: // load a BED file into a map keyed by chrom, then bin. value is // vector of BEDs void loadBedFileIntoMap(); + void loadBedFileIntoMergedMap(); // load a BED entry into and existing map void addBEDIntoMap(BED bedEntry); diff --git a/test/fisher/README.md b/test/fisher/README.md new file mode 100644 index 0000000000000000000000000000000000000000..d0179ca956b1b6b5ca21a7191a96da78d8e0500d --- /dev/null +++ b/test/fisher/README.md @@ -0,0 +1,20 @@ +Fisher Testing +============== + +Fisher is now based on the count of interval overlaps, subject to `-f`. +We can compare the output of fisher on simulated data by running `python sim.py` +which will show the output from `bedtools fisher` and then running `bash shuf.sh` +which will repeatedly run + +```Shell + +bedtools intersect -wo -a taa.bed -b <(bedtools shuffle -allowBeyondChromEnd -i tbb.bed -g tgg.genome) | wc -l +``` + +and then report the proportion of times that number is >= the observed intersection. + +In `sim.py` changing the lenght of the intervals (via `maxA`, `maxB`) has the greatest effect on the +correspondence of the simulated p-value from `shuf.sh` and the one from `fisher`. The right-tailed p-value +from `fisher` should correspond well to the value from the simulation. + + diff --git a/test/fisher/cmp.sh b/test/fisher/cmp.sh new file mode 100644 index 0000000000000000000000000000000000000000..5683e42748199b04a6dd43eeb4ad0d35cd69cf76 --- /dev/null +++ b/test/fisher/cmp.sh @@ -0,0 +1,9 @@ +set -eo pipefail + +echo "fisher,shuffled" + +for i in $(seq 1000); do + fisher=$(python ./sim.py | tail -1 | cut -f 2) + shuffle=$(bash shuf.sh) + echo "$fisher,$shuffle" +done diff --git a/test/fisher/plot.py b/test/fisher/plot.py new file mode 100644 index 0000000000000000000000000000000000000000..6d87a0289f41b18c89a1fb7a899544124d72e5cb --- /dev/null +++ b/test/fisher/plot.py @@ -0,0 +1,46 @@ +import sys + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +import seaborn as sns + +sns.set(style="darkgrid") + +fig, axs = plt.subplots(3, figsize=(4, 12)) +df = pd.read_csv(sys.argv[1]) + +axs[0].scatter(df.fisher, df.shuffled, s=4) +axs[0].set_xlim(0, 1) +axs[0].set_ylim(0, 1) +axs[0].set_xlabel('fisher p-value') +axs[0].set_ylabel('shuffled p-value') +axs[0].plot([0, 1], [0, 1], ls='--') + +x = -np.log10(df.fisher) +y = -np.log10(df.shuffled) + +m = int(max(x.max(), y.max())) + 1 +axs[1].scatter(x, y, s=4) +axs[1].set_xlim(0, m) +axs[1].set_ylim(0, m) +axs[1].set_xlabel('-log10(fisher p-value)') +axs[1].set_ylabel('-log10(shuffled p-value)') +axs[1].plot([0, m], [0, m], ls='--') + + +x = -np.log10(1 - np.minimum(1-1e-6, df.fisher)) +y = -np.log10(1 - np.minimum(1-1e-6, df.shuffled)) + +m = int(max(x.max(), y.max())) + 1 +axs[2].scatter(x, y, s=4) +axs[2].set_xlim(0, m) +axs[2].set_ylim(0, m) +axs[2].set_xlabel('-log10(1 - fisher p-value)') +axs[2].set_ylabel('-log10(1 - shuffled p-value)') +axs[2].plot([0, m], [0, m], ls='--') + +plt.tight_layout() +plt.savefig(sys.argv[1].replace('.txt', '') + '.png') + +fig.show() diff --git a/test/fisher/shuf.sh b/test/fisher/shuf.sh new file mode 100644 index 0000000000000000000000000000000000000000..3c3bd102797a57d8a0c439b4e23abf4dedae3ffe --- /dev/null +++ b/test/fisher/shuf.sh @@ -0,0 +1,14 @@ +set -eo pipefail + +obs=$(bedtools intersect -wo -a taa.bed -b tbb.bed | wc -l) + +p=$(seq 100 | xargs -P 7 -n 1 bash -c "bedtools intersect -wo -a taa.bed -b <(bedtools shuffle -allowBeyondChromEnd -i tbb.bed -g tgg.genome) | wc -l " | awk -vobs=$obs '{s += ($1 > obs)}END{print (1 + s) / (1 + NR)}') + +if [ '1' -eq $(echo $p'< 0.1' | bc -l) ] || [ '1' -eq $(echo $p'> 0.9' | bc -l) ]; then + p=$(seq 1000 | xargs -P 7 -n 1 bash -c "bedtools intersect -wo -a taa.bed -b <(bedtools shuffle -allowBeyondChromEnd -i tbb.bed -g tgg.genome) | wc -l " | awk -vobs=$obs '{s += ($1 > obs)}END{print (1 + s) / (1 + NR)}') +fi +if [ '1' -eq $(echo $p'< 0.01' | bc -l) ] || [ '1' -eq $(echo $p'> 0.99' | bc -l) ]; then + p=$(seq 5000 | xargs -P 7 -n 1 bash -c "bedtools intersect -wo -a taa.bed -b <(bedtools shuffle -allowBeyondChromEnd -i tbb.bed -g tgg.genome) | wc -l " | awk -vobs=$obs '{s += ($1 > obs)}END{print (1 + s) / (1 + NR)}') +fi + +echo $p diff --git a/test/fisher/sim.py b/test/fisher/sim.py new file mode 100644 index 0000000000000000000000000000000000000000..6d26d3a74f5907ef173259185beb92870ce27795 --- /dev/null +++ b/test/fisher/sim.py @@ -0,0 +1,31 @@ +import sys +from random import randint +from subprocess import check_output + +nA = 3000 +minA, maxA = (20, 5250) +#minA, maxA = (200, 250) + +bIntervals = [(x[0], int(x[1]), int(x[2])) for x in (l.split("\t") for l in + open('hg19.knownCanonical.bed')) if x[0] == "chr1" ] +bIntervals.sort() +genome_size = max(b[2] for b in bIntervals) + 50000 + +with open('tbb.bed', 'w') as fh: + for chrom, start, end in bIntervals: + fh.write("\t".join((chrom, str(start), str(end))) + "\n") + +with open('taa.bed', 'w') as fh: + vals = [] + for i in range(nA): + s = randint(0, genome_size - maxA) + e = randint(s + minA, min(genome_size, s + maxA)) + vals.append((s, e)) + for s, e in sorted(vals): + fh.write("chr1\t%i\t%i\n" % (s, e)) + fh.flush() + +print >> open('tgg.genome', 'w'), ("chr1\t%i" % genome_size) + +# NOTE: add -m here to make merged output +print check_output("../../bin/bedtools fisher -a taa.bed -b tbb.bed -g tgg.genome", shell=True).strip()