Database searches for DNA sequences that can take biologists and medical researchers days can now be completed in a matter of minutes, thanks to a new search method developed by computer scientists at Carnegie Mellon University.
The method developed by Carl Kingsford, associate professor of computational biology, and Brad Solomon, a Ph.D. student in the Computational Biology Department, is designed for searching so-called “short reads” — DNA and RNA sequences generated by high-throughput sequencing techniques. It relies on a new indexing data structure, called Sequence Bloom Trees (SBTs), that the researchers describe in a report published online today by the journal Nature Biotechnology.
The National Institutes of Health maintains a huge database, called the Sequence Read Archive, that contains about three petabases, or sequences totaling three quadrillion base-pairs. The information is useful to a wide swath of researchers, from those asking questions about basic biological processes to those studying potential cancer cures.
“The database contains untold numbers of as-yet undiscovered insights and is heavily used,” Kingsford said. “Its main problem is that it’s very difficult to search.”