Personal tools
สารบัญ
Thaischolar Email Log in
แจ้งจุดบกพร่องของเว็บไซต์
 
Document Actions

TRELLIS: An Efficient Algorithm for Constructing Genome-Scale Disk-Based Suffix Trees

by OEA last modified 2006-04-03 16:34

BENJARATH PHOOPHAKDEE, Computer Science Department Rensselaer Polytechnic Institute


ABSTRACT

Given that sequence databases (such as GenBank) have already crossed a total of 100 gigabases from over 165,000 organisms, it has become critical to develop effective techniques for storing, querying, and analyzing these massive sequence data. A suffix tree is a versatile data structure that can be used to solve a variety of sequence-based problems, and it can be built in linear time and space, provided the resulting tree can reside in main-memory. To index large sequence databases, several disk-based suffix tree algorithms have been proposed in recent years. Some of these methods suffer from the data-skew problem, resulting in suffix tree partitions of non-uniform size, whereas others adopt a quadratic time construction algorithm, discarding the suffix link information, which can result in slower querying times. We propose a novel, linear time, disk-based suffix tree algorithm called Trellis, which does not suffer from the skew problem and maintains suffix links. We compare Trellis with competing methods and show that it outperforms previous methods in the construction time or query time by over an order of magnitude or more. Furthermore, Trellis is able to index the entire Human Genome (~3 gigabases) in about 48 hours using under 1GB of memory.


สงวนลิขสิทธิ์ Copyright, All rights reserved. © 2000-2008 by Office of Educational Affairs

1906 23rd Street, N.W., Washington D.C. 20008 Tel. (202)667-8010 Fax. (202)265-7239