Information Technology & Communication
TRELLIS: An Efficient Algorithm . . .
TRELLIS: An Efficient Algorithm for Constructing Genome-Scale Disk-Based Suffix Trees
Given that sequence databases (such as GenBank) have already crossed a total of 100 gigabases from over 165,000 organisms, it has become critical to develop effective techniques for storing, querying, and analyzing these massive sequence data. A suffix tree is a versatile data structure that can be used to solve a variety of sequence-based problems, and it can be built in linear time and space, provided the resulting tree can reside in main-memory. To index large sequence databases, several disk-based suffix tree algorithms have been proposed in recent years. Some of these methods suffer from the data-skew problem, resulting in suffix tree partitions of non-uniform size, whereas others adopt a quadratic time construction algorithm, discarding the suffix link information, which can result in slower querying times. We propose a novel, linear time, disk-based suffix tree algorithm called Trellis, which does not suffer from the skew problem and maintains suffix links. We compare Trellis with competing methods and show that it outperforms previous methods in the construction time or query time by over an order of magnitude or more. Furthermore, Trellis is able to index the entire Human Genome (~3 gigabases) in about 48 hours using under 1GB of memory.
Benjarath Phoophakdee
Computer Science Department
Rensselaer Polytechnic Institute