Skip to content
Snippets Groups Projects
Code owners
Assign users and groups as approvers for specific file changes. Learn more.
fisher.rst 4.72 KiB

fisher

Perform fisher's exact test on the non/overlap between 2 files.


Traditionally, in order to test whether 2 sets of intervals are related spatially, we resort to shuffling the genome and checking the simulated (shuffled) versus the observed. We can do the same analytically for many scenarios using Fisher's Exact Test .

Given a pair of input files -a and -b in the usual BedTools parlance:

$ cat a.bed
chr1  10  20
chr1  30  40

$ cat b.bed
chr1  15   25

And a genome of 500 bases:

$ echo -e "chr1\t500" > t.genome

We may wish to know if the amount of overlap between the 2 sets of intervals is more than we would expect given their coverage and the size of the genome. We can do this with fisher as:

$ bedtools fisher -a a.bed -b b.bed -g t.genome
# Contingency Table
#_________________________________________
#           | not in -b    | in -b        |
# not in -a | 475          | 5           |
#     in -a | 15           | 5            |
#_________________________________________
# p-values for fisher's exact test
left    right   two-tail    ratio
1.00000 0.00001 0.00001 31.667

Where we can see the constructed contingency table and the pvalues for left, right and two-tail tests. From here, we can say that given 500 bases of genome, it is unlikely that a region of 20 bases total from -a and 10 bases total from -b would share 5 bases if the regions were randomly distributed. Consult your statistician for a more precise definition.

The above was highly significant, but if our genome were only 50 bases:

$ echo -e "chr1\t50" > t.genome
$ bedtools fisher -a a.bed -b b.bed -g t.genome
# Contingency Table
#_________________________________________
#           | not in -b    | in -b        |
# not in -a | 25           | 15           |
#     in -a | 5            | 5            |
#_________________________________________
# p-values for fisher's exact test
left    right   two-tail    ratio
0.86011 0.35497 0.49401 1.667

We can see that neither tail is significant. Intuitively, this makes sense; if we randomly place 20 (from -a), and 10 (from -b) bases of intervals within 50 bases, it doesn't seem unlikely that we'd see 5 bases of overlap.

Note

The fisher tool requires that your data is pre-sorted by chromosome and then by start position (e.g., sort -k1,1 -k2,2n in.bed > in.sorted.bed for BED files).

This uses Heng Li's implementation of Fisher's exact test in kfunc.c.

Usage and option summary

Usage:

bedtools fisher [OPTIONS] -a <BED/GFF/VCF> -b <BED/GFF/VCF> -g <genome>
Option Description
-a BED/GFF/VCF file A. Each feature in A is compared to B in search of overlaps. Use "stdin" if passing A with a UNIX pipe.
-b BED/GFF/VCF file B. Use "stdin" if passing B with a UNIX pipe.
-g genome file listing chromosome size.
-f Minimum overlap required as a fraction of A. Default is 1E-9 (i.e. 1bp).
-r Require that the fraction of overlap be reciprocal for A and B. In other words, if -f is 0.90 and -r is used, this requires that B overlap at least 90% of A and that A also overlaps at least 90% of B.
-s Force "strandedness". That is, only report hits in B that overlap A on the same strand. By default, overlaps are reported without respect to strand.
-S Require different strandedness. That is, only report hits in B that overlap A on the _opposite_ strand. By default, overlaps are reported without respect to strand.
-split Treat "split" BAM (i.e., having an "N" CIGAR operation) or BED12 entries as distinct BED intervals.