/usr/include/pbseq/alignment/suffixarray/ssort.hpp is in libblasr-dev 0~20161219-2.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 | /****************************************************************
Copyright (C) Lucent Technologies 1997
All Rights Reserved
Permission to use, copy, modify, and distribute this software and
its documentation for any purpose and without fee is hereby
granted, provided that the above copyright notice appear in all
copies and that both that the copyright notice and this
permission notice and warranty disclaimer appear in supporting
documentation, and that the name Lucent Technologies or any of
its entities not be used in advertising or publicity pertaining
to distribution of the software without specific, written prior
permission.
LUCENT DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE,
INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS.
IN NO EVENT SHALL LUCENT OR ANY OF ITS ENTITIES BE LIABLE FOR ANY
SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER
IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF
THIS SOFTWARE.
****************************************************************/
/*
Suffix sort
Peter M. McIlroy
M. Douglas McIlroy
Prototype
int ssort(int a[], int s[]);
Purpose
Return in a[] a suffix array for the original
contents of a[]. (The original values in a[]
are typically serial numbers of distinct tokens
in some list.) Optionally return in s[] an
array of shared lengths between adjacent sorted
suffixes.
Precondition
Array a[] holds n values, with n>=1. Exactly k
distinct values, in the range 0..k-1, are present.
Value 0, an endmark, appears exactly once, at a[n-1].
Pointer s is 0 or points to an array of n elements.
Postcondition
Array a[] is a copy of the internal array p[]
that records the sorting permutation: if i<j
then the original suffix a[p[i]..n-1] is
lexicographically less than a[p[j]..n-1].
If s is nonzero, its elements are filled in with
"shared lengths": s[i] is the length of the agreeing
prefixes of a[p[i-1]..n-1] and a[p[i]..n-1].
s[0] is 0.
Return value
0 success
1 out of space; data unchanged
2 bad data; data may be changed
Cost
Sorting takes time O(n log m), where m is the longest
duplicate (possibly overlapped) substring, and uses n words
of temporary memory. There are at most log n passes,
each involving essentially 5 serial and 6 random
traversals of size-n arrays (a[] and p[]). Shared
lengths cost an extra n words of temporary memory;
for time cost see comment with function shared() below.
A tradeoff desribed after step (3) below saves
about 20% in time at the cost of another n words.
Terminology
"The h-gram at a[i]" is the string a[i], a[i+1], ...,
a[i+h-1], with subscripts taken mod n, except that
when h=0 the "0-gram" at a[i] is a[i]. (More succinctly,
the h-gram at a[i] is a[i], ..., a[i+max(h,1)-1].)
"The h-successor of the h-gram at a[i]" is the h-gram
at a[i+h].
Method
Order 2h-grams by doing steps (1)-(5) below
for h = 0, 1, 2, 4, ... Because of the unique endmark,
ordering n-grams (or longer grams) is exactly the same as
ordering the suffixes of a[].
This is a fillip on radix sort. The i-th stage of a
traditional radix sort distributes on the i-th "character"
counting from the right of the keys, thus arranging the keys
so that their i-character suffixes are in lexicographic order.
Here, instead, we distribute on h-gram prefixes of 2h-grams,
where h = 2^i, thus arranging 2h-grams in order. Each
h-gram is coded into one element of a[].
This is possible within a constant word width because
there are only n h-grams for any h. With 2^i-grams being
placed in order at stage i, and each stage taking time O(n),
sorting is complete after log n stages for a total
time of O(n log n).
The method for shared lengths is given with the
shared-length code far below.
History
The general radix-sorting notion is due to Manber and
Myers, SODA '91, who used a big-endian radix sort. They
also made the observation that by cleverly multiplexing
uses of the two arrays, one can get by with O(1) extra
space beyond arrays a[] and p[]. The circular-array idea,
which greatly simplifies the program, is due to P. McIlroy.
Steps (1) and (2) below are based on M&M's method for
the first cycle (h=0); they do something entirely different
for the h-gram doubling cycles and develop in a[] the
inverse of permutation p[].
The state of the art has improved since this code was
written. A variant developed independently by (at least)
Larssen at Lund, Sadakane at Tokyo and Quinlan at Bell Labs
is several times as fast; see Larssen and Sadakane
LU-CS-TR:99-214, Dept. of CS, Lund Univ. Sweden.
Data used in the sort
a[] array of h-gram codes.
al[] linked-list area, overlaid on a[].
p[] permutation.
pl[] list heads, overlaid on p[].
s[] array of shared lengths (optional).
q[] tree of bucket refinements, for calculating s[].
ORIG mark at the end of lists, originally on p[].
BUCK mark at start of "buckets", stretches of p[]
that point to identical codes in a[].
Were the overlaid arrays pulled apart, the algorithm would
simplify only slightly, mainly by dropping the inner loop from
steps (1) and (2).
Description of steps, except for optional shared lengths
(0) (0a) Initialize p[] to contain the identity permutation.
(0b) Place mark ORIG on every element of p[].
State before step 1
Each element a[i] encodes the h-gram at a[i].
The h-grams are coded 0,..,k-1 in increasing
lexicographic order. When h>0, permutation p[]
lists h-grams in lexicographic order: if i<j then
a[p[i]] <= a[p[j]]. When h=0, p[] contains
the input order 0,..,n-1.
(1) Construct in al[] linked lists of like-valued
codes in a[] ordered in reverse of the order
of their h-successors as given by p[].
(Done by iterating over elements of a[] in p[]-order
and operating on their h-predecessors.)
Place in pl[j] the head of the list of elements with
value j. Thus, if a[7]=a[3]=a[9]=2, and these
elements were visited in this order according to p[],
then pl[2]=9, al[9]=3, al[3]=7. The list is developed
by pushing previous contents of pl[j] onto the list.
In particular the original content of pl[j] (which
was p[j]) gets pushed to the end of the list (al[7]
in the example), still bearing the mark ORIG. The mark
serves both to note the end of the list and to flag the
fact that this element contains a quantity of a different
kind.
During step (1), a[] contains three kinds of data:
unprocessed elements of a[], list links of al[]
and list ends, marked ORIG, that contain original
data from p[]. p[] contains two kinds of data:
elements of p[], marked ORIG, and elements of pl[].
The elements of p[] that have been displaced by pl[]
are retrieved by chasing the lists; this is done by
a for loop. Each list will be chased at most once in
step (1) and once in step (2) and the total size of
all lists will never exceed n, so the overhead of list
chasing is at most linear in n.
State before step (2)
Array a[] has been wiped out and replaced by
the lists of al[], which together with pl[] give
exactly the same information. The first k elements
of p[] are occupied by pl[] and the remaining elements
are occupied by ORIG-marked elements of p[].
(2) Make p[] point to the (former) elements of a[]
in order of increasing value. This is done in reverse:
biggest values first, and for each value running out the
list created in (1), which is already in reverse order.
New values fill in the size-n array p[] from the top,
while the size-k array pl[] of list heads shrinks
toward the bottom. Because no list is empty, the
two uses of p[] cannot collide.
Place mark BUCK at each bucket start, namely the
element of p[] last transferred from each list.
State before step 3
Permutation p[] lists h-grams in lexicographic order.
It is bucketed by h-grams. Within each bucket h-grams
are ordered by their h-successors.
(3) Reconstruct codes in a[] from the buckets. The
element of a[] pointed to from the first bucket is 0;
the elements pointed to from the second bucket are 1;
and so on.
This step would be unnecessary if al[] did not share
storage with a[]. Then steps (2) and (4) could
be combined.
State before step 4
Permutation p[] is as before step 3, and code array
a[] is as before step 1.
(4) Refine buckets by values of h-successors in a[] by
placing a BUCK mark on each p[i] where the h-successors
of a[p[i]] and a[p[i-1]] differ.
State before step 5
Array p[] holds an ordered, bucketed list of 2h-grams.
(5) Recode a[] according to the new buckets, exactly as
a[] was reconstructed in step (3). Count the
buckets to determine a new value of k. If k==n, every
bucket is a singleton and sorting is complete.
Bad input
Code that invokes finish() is solely defensive. It
could be dropped if you like to live dangerously.
*/
#include <stdlib.h>
#include "../../pbdata/utils.hpp"
enum {
ORIG = ~(~0u>>1), /* sign bit */
BUCK = ~(~0u>>1)
};
void shared(SAIndex a[], SAIndex n, SAIndex p[], SAIndex q[], SAIndex s[], int h);
//#define pred(i, h) (((t=(i)-(h))<0) ? (t+n): (t))
#define pred(i, h) ( ((i)<(h)) ? ((i)-(h)+(n)) : ((i)-(h)))
#define succ(i, h) ( ((t=(i)+(h))>=n) ? (t-n): (t))
inline int
ssort(SAIndex a[], SAIndex s[])
{
SAIndex h, i, j, l, n;
SAIndex k = 0; /* initialized for lint */
SAIndex *p = 0;
int result = 0;
SAIndex *q = 0;
# define al a
# define pl p
//# define finish(r) if(1){result=r; goto out;}else
# define finish(r) do {result=r; goto out;} while(0)
for(j=n=0; a[n]>0; n++) /* find n */
if(a[n] > j)
j = a[n]; /* and max element */
//if(a[n++]<0 || j>=n)
n++;
if (j>= n)
finish(2);
p = ProtectedNew<SAIndex>(n);
if(p == nullptr)
finish(1);
for(i=0; i<n; i++) /* (0) initialize */
p[i] = i | ORIG;
if(s) { /* shared lengths */
// q = malloc(n*sizeof(int));
q = ProtectedNew<SAIndex>(n);
if(q == nullptr)
finish(1);
}
for(h = 0; ; h = h==0? 1: 2*h) {
for(i=0; i<n; i++) { /* (1) link */
for(j=p[i]; !(j&ORIG); j=al[j]);
j = pred(j&~ORIG, h);
l = a[j];
al[j] = pl[l];
pl[l] = j;
}
if(h == 0) { /* find k */
for(k=0; k<n; k++)
if(pl[k]&ORIG)
break;
for(i=k; i<n; i++) /* check input */
if(!(pl[i]&ORIG))
finish(2);
}
abort(); // else infinite loop! probably never called. ~cdunn 2016
#ifdef DEAD_CODE_20160111
// Compiler would warn here (because of the infinite loop):
for(i=n; --k>=0; ) { /* (2) order */
j = pl[k];
do
p[--i] = j;
while(!((j=al[j]) & ORIG));
p[i] |= BUCK;
}
for(i=0; i<n; i++) { /* (3) reconstruct */
if(p[i] & BUCK)
k++;
a[p[i]&~BUCK] = k;
}
for(i=0, j=-1; i<n; i++, j=l) { /* (4) refine */
l = a[succ(p[i]&~BUCK, h)];
if(l != j)
p[i] |= BUCK;
}
if(s)
shared(a, n, p, q, s, h);
for(i=0, k=-1; i<n; i++) { /* (5) recode */
if(p[i] & BUCK)
k++;
a[p[i]&~BUCK] = k;
p[i] |= ORIG; /* (0b) reinitialize */
}
if(++k >= n)
break;
#endif // DEAD_CODE
}
for(i=0; i<n; i++)
a[i] = p[i] & ~ORIG;
out:
delete [] p;
delete [] q;
return result;
}
/* shareda.c: shared lengths for suffix arrays by ancestor tree
a[] temporary space, used for pointers to containing buckets
n, p[], s[], h as in ssort()
q[] tree of bucket heads
Array q holds a tree of bucket heads. Initially q[0]
is 0 and q[i] = -1 for i>0. For each new bucket i
created in the current pass, q[i] is (notionally)
made to point to the previous bucket, j. The shared
length between the i-1st and i-th smallest suffix
is stored in s[i]. The shared length between two
arbitrary buckets is the minimum of the stored
shared lengths on the paths to their nearest common
ancestor (excluding that ancestor).
For efficient search, paths are compressed by making
each bucket point to the next bucket on the path to
the root with a strictly smaller stored shared length.
This may change the identity of the nearest common
ancestor, but does not affect shared lengths.
Cost
There are 3 serial passes and 1 random pass over
size-n arrays per call, plus the inner loops of steps
(2) and (3), each of which is executed n times fpr
each call of ssort(). The asymptotic worst case has
not been analyzed, however the code has always been
seen to be at least as fast as a guaranteed O(n log n)
method that takes twice as much code. (That method
keeps in element i of an integer array the minimum
value of s[] over the largest aligned power-of-2
interval beginning at i.)
(0) On the first pass, with h=0, the shared length between
buckets is zero, and every bucket bcomes a child of the
root node, 0.
(1) On subsequent passes the old bucket that each potential
new bucket refines is remembered in a[]. Array a[]
is ordered by substring places, while q[] is ordered
by substring values.
If the values placed in a[] were kept in another array,
the loops of shared() could be combined with loops of
lssort().
(2) When an h-gram bucket is split, the shared lengths between
adjacent subbuckets is h+u, where u is the shared length
between the h-successors that differed (causing the split).
To find u, we locate the nearest common ancestor.
The common ancestor is found by searching the path from
the higher-numbered bucket, k1, until reaching a bucket
whose number does not exceed that of the lower-numbered
bucket, k0. This method is justified by two observations.
All the new subbuckets have strictly longer shared lengths
than all the old buckets. All the subbuckets of a
bucket fall between the bucket head and its next higher
numbered sibling.
(3) After new buckets have been processed, the tree q is
updated. New buckets are recognized by having mark
BUCK and q[i]<0.
*/
inline void
shared(SAIndex a[], SAIndex n, SAIndex p[], SAIndex q[], SAIndex s[], int h)
{
SAIndex i, j, k0, k1, t, u;
if(h == 0) { /* (0) initialize */
for(i=0; i<n; i++)
if(p[i]&BUCK)
q[i] = s[i] = 0;
else
q[i] = -1;
return;
}
for(i=j=0; i<n; i++) { /* (1) get parents */
//if(q[i] >= 0)
if (true)
j = i;
a[p[i]&~BUCK] = j;
}
for(i=0; i<n; i++) { /* (2) find lengths */
if(!(p[i]&BUCK))
continue;
k0 = j; /* k0=garbage if i=0 */
k1 = j = a[succ(p[i]&~BUCK, h)];
//if(q[i] >= 0)
if (true)
continue;
for(u=n; k1>k0; k1=q[k1])
if(s[k1] < u)
u = s[k1];
s[i] = h + u;
}
for(i=0; i<n; i++) /* (3) update tree */
if(p[i]&BUCK) {
//if(q[i] < 0)
if (false)
for(t=j; ;t=q[t]) {
q[i] = t;
if(s[i] > s[t])
break;
}
j = i;
}
}
|