Generalized Suffix Trees
A Suffix Tree (ST) is a tree-like data structure for storing
a string. Suffix trees can be used to solve the exact string matching problem
in linear time, achieving about the same worst-case bound as the
Knuth-Morris-Pratt
and the Boyer-Moore
algorithms.
For a string S of length m, a substring T of length n can be matched in O(n) time by first constructing a suffix tree for S. This preprocessing will take linear O(m) time. The O(m) preprocessing time and O(n) search time makes suffix trees favorable with large strings or texts, since the search time depends on the length of T rather than on the length of S.
Definition ( suffix tree ). A suffix tree T for a string S (with n = |S|) is a rooted, labeled tree with a leaf for each non-empty suffix of S. Furthermore, a suffix tree satisfies the following properties:
- Each internal node, other than the root, has at least two children;
- Each edge leaving a particular node is labeled with a non-empty substring of S of which the first symbol is unique among all first symbols of the edge labels of the edges leaving this particular node;
- For any leaf in the tree, the concatenation of the edge labels on the path from the root to this leaf exactly spells out a non-empty suffix of s.
A Generalized Suffix Tree (GST) is a ST that represents the suffixes of a set {S1, ...., Sn} of strings. A few important applications of a GST are
- Finding the Longest Common Substring (LCS) of two strings;
- Solving the All-Pairs Suffix-Prefix Matching problem (APPSMP);
- Solving the k-mismatch problem;
In my Master's thesis, I use a GST to find regularity in corpora of natural language sentences using the ABL (Alignment-Based Learning) framework. The use of a suffix tree allows new alignment possibilities and allows alignment in time linear in the number of sentences in the corpus.
For any questions or comments, please do not hesitate to contact me.
References
- Boyer, R. S. and Moore, J. S. [1977].
A fast string search algorithm.
Communications of the Association for Computing Machinery, 20(10), 762-772.
[ PDF ] - Gusfield, D. [1997].
Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology.
Cambridge, UK: Cambridge University Press. - Knuth, D. E., Morris, J. H. and Pratt, V. R. [1977].
Fast pattern matching in strings.
Society for Industrial and Applied Mathematics Journal on Computing, 6, 323-350.
[ PDF ] - McCreight, E. M. [1976].
A space-economical suffix tree construction algorithm.
Journal of the Association for Computing Machinery, 23(2), 262-272.
[ URL
]
- Ukkonen, E. [1995].
On-line construction of suffix trees.
Algorithmica, 14(3), 249-260.
