From e5c8310a630131adc38c0c8980b2313ecfce57cd Mon Sep 17 00:00:00 2001 From: arq5x <arq5x@virginia.edu> Date: Fri, 13 Dec 2013 11:31:51 -0500 Subject: [PATCH] [DOC] tweak intersect docs --- docs/content/tools/intersect.rst | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/docs/content/tools/intersect.rst b/docs/content/tools/intersect.rst index 4db9d7ce..51f03748 100755 --- a/docs/content/tools/intersect.rst +++ b/docs/content/tools/intersect.rst @@ -543,13 +543,23 @@ start position. When both input files are position-sorted, the algorithm can like the way database systems join two tables. This option is invoked with the ``-sorted`` option. +.. note:: + + By default, the ``-sorted`` option requires that the records are **GROUPED** + by chromosome and that within each chromosome group, the records are sorted by + chromosome position. One way to achieve this (for BED files for example) is use + the UNIX sort utility to sort both files by chromosome and then by position. + That is, ``sort -k1,1 -k2,2n in.bed > in.sorted.bed``. However, since we merely + require that the chromsomes are grouped (that is, all records for a given chromosome + come in a single block in the file), sorting criteria other than the alphanumeric + criteria that is used by the ``sort`` utility are fine. For example, you could use + the "version sort" (``-V``) option in newer versions of GNU sort to make the chromosomes + come in this (chr1, chr2, chr3) order instead of this (chr1, chr10, chr11) order. + + For example: .. code-block:: bash - - $ sort -k1,1 -k2,2n big.bed > big.sorted.bed - - $ sort -k1,1 -k2,2n huge.bed > huge.sorted.bed $ bedtools intersect -a big.sorted.bed -b huge.sorted.bed -sorted @@ -557,14 +567,17 @@ For example: ========================================================================== ``-g`` Define an alternate chromosome sort order via a genome file. ========================================================================== -By default, the ``-sorted`` option expects that the input files are sorted -alphanumerically by chromosome. However, there arise cases where ones input +As described above, the ``-sorted`` option expects that the input files are grouped +by chromosome. However, there arise cases where ones input files are sorted by a different criteria and it is to computationally onerous to resort the files alphanumerically. For example, the GATK expects that BAM files are sorted in a very specific manner. The ``-g`` option allows one to specify an exact ording that should be expected in the input (e.g., BAM, BED, etc.) files. All you need to do is re-order you genome file to -specify the order. +specify the order. Also, the use of a genome file to specify the expected +order allows the ``intersect`` tool to detect when two files are internally +grouped but each file actually follows a different order. This will cause +incorrect results and the ``-g`` file will alert you to such problems. For example, an alphanumerically ordered genome file would look like the following: -- GitLab