Skip to content

General Description of DAware

Christoph Diegelmann edited this page May 16, 2017 · 9 revisions

Purpose of this page is to give you an overview on how Linear Time Suffix Sorting By Depth Awareness works.

We define 2 types of groups:

  • F for suffixes which are sorted until their next smaller (finished / end is flagged)
  • S for suffixes i smaller than suffix i + depth (need to be sorted)

And 3 types of temporary partitions:

  • L for suffixes i larger than suffix i + depth
  • T for suffixes with tandem repeats (lead to their own outer group)
  • U for suffixes whose next char** has not been tested yet (unkown)

Pseudo Code:

Initially all groups are by definition type U
For each group:
  Partition into type L, T and S by their next char**
    Inducing the type T part yields only additional type L and S parts
    Flag type L and possibly unique S parts to make them type F
    Rename*
    Loop starting at the right most S part:
      If type F:
        Break out of the loop if it is the left most
        Skip to the next lower S part
      If type S:
        Sort into groups ONLY by their next char** [This is the O(n * log(n)) part]
        Flag all unique groups as type F as they are now sorted until $
        For the rest:
          Rename*
          Partition each group into type L, T and S by their next char**
            Inducing the type T part yields only additional type L and S parts
            Flag type L and possibly unique S parts to make them type F
            Rename*
Sort each group from left to right into their final order [This is also O(n * log(n))]

Notes:

  • We can savely flag the ends if we use improved two stage because it has a minimum reduction rate of 2 giving us a free bit.
  • O(n * log(n)) is due to the use of introsort rather than a linear time integer sort.
  • Let b be the size of a group, a be the suffixes bigger than the current, k the alphabet size and s the depth of a group:
    • If we combine radix sort [O(b * log(a) / log(b))] and introsort [O(b * log(b))] we can get a worst case of O(n * sqrt(log(n)).
  • Inducing mean scan the L/S and check if an element of the current group is before it. It is similar to induce copying but the order isn't final. It's just used to group repeats by their length.