SDSC AND UC San Diego Announce Web Engine for Comparing Protein and DNA Sequences

Published November 17, 1998

SAN DIEGO, CALIF. - Researchers at the University of California, San Diego (UCSD), and the San Diego Supercomputer Center (SDSC) announced the availability of a powerful, Web-accessible computational tool, called Meta-MEME, that will help biologists detect shared features and evolutionary relationships among the growing stream of protein and DNA sequence data being produced by the Human Genome Project and related sequencing efforts.

"Meta-MEME detects these family relationships by analyzing evolutionary fingerprints in the sequences with methods more common in speech recognition software," said William Grundy of UCSD. In a sense, Meta-MEME listens to the echoes of evolution, spoken in the language of protein sequences.

Molecular biologists worldwide can tap transparently into the computational power at SDSC by using the Meta-MEME software on the Web ( http://metameme.sdsc.edu/). Created by Grundy and Charles Elkan in the Computer Science and Engineering department of UCSD's Irwin and Joan Jacobs School of Engineering and Timothy Bailey at SDSC, Meta-MEME compares families of evolutionarily related DNA or protein sequences using a Sun Microsystems Enterprise Server 10000 at SDSC. The Meta-MEME project is funded by the National Biomedical Computation Resource of SDSC, UCSD, and The Scripps Research Institute.

A biologist begins by submitting a family of similar DNA or protein sequences for analysis. After that, the entire Meta-MEME process is automatic. The biologist submits sequences and the MEME analysis via the Meta-MEME Web site, and up to four sets of results are e-mailed to the user: the statistical model, alignments showing where common features appear in the sequences, an alignment showing how the sequences are related to one another, and the results of searching a large sequence database using the model.

The statistical models and analyses produced by Meta-MEME can help biologists infer evolutionary family trees, uncover previously unrecognized relationships between species, or develop experiments to determine a protein's function.

EVOLUTIONARY FINGERPRINTS

By comparing corresponding proteins or stretches of DNA from different species, Meta-MEME can detect extremely subtle evolutionary relationships that might be missed by less sophisticated methods. Biologists may also discover previously unknown relationships by using Meta-MEME models to search publicly available databases of unannotated genetic data, which might turn up distant evolutionary relatives of the genetic sequence.

"If a gene that causes cancer in mice, for example, were shown to share a common ancestor with a human gene, this would strongly implicate the human gene as a cancer agent," Bailey said. "However, because the amino acid sequence of the common ancestor is almost never available, this common ancestor can only be inferred rather than proven."

Meta-MEME applies the power of probabilistic reasoning to the task of recognizing ancestral relationships among biological sequences. Although not the first software system to apply such methods to biological sequence modeling, Meta-MEME is the only such system to focus on evolutionary fingerprints, called motifs.

In this context, a motif is a short "word" in the code for a protein or DNA that appears in a similar form in all or most of the members of a given sequence family. The appearance of a motif in a distantly related sequences implies that, over thousands or even millions of years, this particular region of the ancestral sequence has remained relatively unchanged. Such consistency nearly always means that the motif is required for the protein to function properly. Often the only evidence from which to infer a common ancestral relationship lies in a handful of these small motifs.

INSPIRED BY SPEECH RECOGNITION

To characterize and infer relationships among sets of DNA or protein sequences, Meta-MEME employs machine learning techniques from artificial intelligence (AI). In the past two years, one of AI's major successes has resulted in speech recognition systems that cost less than $100 and have relatively low error rates for dictation.

Every commercially available speech recognition system on the market today uses a class of statistical models called hidden Markov models (HMMs) as the basis for its processing. Surprisingly, HMMs can also be applied to biological sequences. For Meta-MEME, the "speech" is the series of nucleotides or amino acids that make up the biological sequence.

Just as speech recognition software trained on utterances of the word "hello" can accurately recognize new instances of that utterance, a Meta-MEME model trained on a set of related hemoglobin sequences can recognize previously unannotated hemoglobins in a large protein database.

One of the primary strengths of hidden Markov models is their probabilistic underpinning. Given a Meta-MEME model and a candidate biological sequence, there are efficient algorithms that can answer questions such as, "What is the probability that this model generated this sequence?"

"Although humans are very good at recognizing shared features of spoken words, images, or even biological sequences, people are notoriously bad at estimating probabilities," Grundy said. "For detecting sequences that share a common ancestor, probabilities are precisely what is needed." A personv may be able to predict a functional relationship between sequences, but before an expensive wet lab experiment will be carried out to verify thev prediction, an accurate estimate of the probability of the ancestral relationship is needed.

SDSC is a research unit of the University of California, San Diego, and the leading-edge site of the National Partnership for Advanced Computational Infrastructure ( www.npaci.edu). SDSC is sponsored by the National Science Foundation through NPACI and by other federal agencies, the State and University of California, and private organizations. For additional information about SDSC, see www.sdsc.edu, or contact Ann Redelfs at SDSC, 619-534-5032, redelfs@sdsc.edu.

Contact:
David Hart, NPACI/SDSC
619-534-8314, dhart@sdsc.edu

News

SDSC AND UC San Diego Announce Web Engine for Comparing Protein and DNA Sequences

Categories

Archive

San Diego Supercomputer Center