Genome sequences are now known for thousands of different species. We are at a remarkable time in biology where at last we can look at the "source code" for life—the DNA sequences that specify development, regulation, and function of organisms—but we are still far from adequately understanding how to read this vast trove of encoded information or being able to reconstruct how it evolved.
Our laboratory works on computational methods for large-scale genome sequence analysis. We use probabilistic modeling techniques to develop new algorithms to find features in DNA, RNA, and protein sequences.
One of our primary interests is in identifying novel structural and catalytic RNAs. Another is to recognize evolutionarily remote protein sequence homologies. Our core mission is the engineering of practical and robust software tools for the molecular biology community, capturing and transferring the best theoretical results in genome sequence analysis, in areas including mathematics, algorithms, computational science, and statistical inference. Two of our best-known software tools are HMMER and Infernal for protein and RNA sequence alignment and database homology searches.
One of the greatest mysteries in biology is how the genomic specification of complex neural systems evolves—how low-level changes in genome sequence give rise to all the glorious variation in neural circuitry and behavior that selection acts upon. This is an area in which biology still largely lacks quantitative language for asking precise questions. At Janelia, as one avenue forward, our laboratory is beginning to explore collaborations with neuroscientists working on the molecular regulatory specification of neural cell types in fly, worm, and mouse.
A palimpsest is a text that has been incompletely erased and overwritten many times. (Paper used to be precious.) Careful analysis of a palimpsest can reveal the shadows of old texts and illuminate lost history. Genome sequences are genetic palimpsests that have been overwritten by eons of evolution.
My laboratory uses computational sequence analysis to infer the structures, functions, and evolutionary histories of DNA sequences in modern genomes. We want to understand how genome sequences encode biological function and how complex high-level biological functions such as neural circuits and innate behavior evolve by low-level changes in genome sequence information.
Large-scale DNA sequencing technology has brought about a remarkable and revolutionary time in biology. We now have genome sequences for thousands of different species and have just begun to look systematically at the "source code" for life. Interpretation of genome sequences depends on computational analysis to discover genes and to infer the biochemical functions of those genes by recognizing similarities to known sequences or structures. To recognize the subtle shadows of ancient ancestry in these multibillion-year-old genetic palimpsests, my laboratory and others have pioneered the use of sophisticated probabilistic inference methods. My work aims to make fundamental technological contributions to the way that DNA sequences are analyzed. Our core mission is the engineering of practical and robust software tools for the molecular biology community, capturing and transferring the best theoretical results in genome sequence analysis, in areas including mathematics, algorithms, computational science, and statistical inference.
HMMER: A New Generation of Homology Search Tools
The main tool that geneticists use to recognize evolutionarily related sequences is a program called BLAST, first introduced in the early 1990s. BLAST is a fundamental tool in the field —- molecular biology's Google. But the algorithms and mathematics that underlie sequence homology recognition went through a major revolution in the 1990s with the advent of probabilistic inference methods, particularly a class of methods called hidden Markov models (HMMs). At a theoretical level, we now understand a powerful and extensible statistical inference framework that formalizes the problem of distant sequence homology recognition.
Over the past decade, many HMM-based approaches have been developed for sequence analysis, including a software package called HMMER from my lab. HMMER is widely used for so-called "profile" searches of sequence databases, starting with a multiple alignment of a protein sequence family of interest. HMMER is the software underlying many protein family and protein domain databases, including Pfam and SMART. But in theory, the power of HMM-based methods is much more general than profile searches. In particular, HMM methods offer what we believe is a more powerful foundation for single sequence comparison and database searches, where BLAST remains the traditional workhorse for the field.
The main reason that BLAST remains the workhorse is that for a long time BLAST was about 100-fold faster than the fastest implementations of the newer and supposedly better HMM-based methods. At Janelia, my laboratory launched a major effort to engineer software that delivers the power of HMM-based methods, while running at or above BLAST speed. Our aim with this project, called HMMER3, is to bring about a generational change in the most important tool of molecular sequence analysis.
Infernal: Identifying Homologous RNA Structures in Genome Sequences
According to the RNA world hypothesis, RNA catalysts and replicators preceded modern protein/DNA machines. This hypothesis arose from the discovery of catalytic RNAs and also from the fact that functional RNAs are used instead of protein enzymes in some ancient, highly conserved roles in modern organisms. Some proponents of the RNA world hypothesis view extant functional RNAs as "molecular fossils" of the RNA world. How many genes encode functional RNA rather than protein? What are their functions? How many are evolutionarily ancient? The answers to these questions might shed some light on the origin of life. Our laboratory has a particular interest in the computational analysis of RNA.
Many RNAs conserve a base-paired secondary structure. Secondary structure conservation induce strong constraints on the primary sequence of an RNA in the form of pairwise correlations between paired bases (usually Watson-Crick pairs of A:U and C:G). Human experts take secondary structure conservation into account when they look at structural RNAs -- indeed, structural RNAs are usually depicted in figures as their two-dimensional secondary structures, whereas proteins are typically depicted by their one-dimensional sequence and linear domain structure. Sequence analysis methods like HMMER and BLAST only look at linear primary sequence information, so they do not consider RNA structure conservation when aligning RNAs or searching for probable homologs.
Our laboratory, at the same time as Yasu Sakakibara's lab, pioneered the use of a class of probabilistic models called "stochastic context-free grammars" (SCFGs) for capturing both the sequence and secondary structure features of RNAs. We developed a particular form of profile SCFG, analogous to profile hidden Markov models but with pairwise base-pair correlations included, that we call "covariance models". We maintain a software package called Infernal which implements covariance model alignment and database search methods for RNA. Infernal mirrors HMMER in many ways, but for RNA rather than protein. Just as HMMER underlies protein domain databases like Pfam and SMART, Infernal underlies the Rfam RNA families database.
Infernal's main limitation is its computational complexity. Our current research on Infernal focuses on accelerating its performance and making profile SCFG-based searches for RNAs as practical as possible.
Identification of Novel Noncoding RNAs
Infernal finds homologs of known RNAs. What about finding entirely new functional RNAs? We have used SCFG-based approaches to identify ncRNA genes and regulatory RNA sequences by taking advantage of comparative genome sequence analysis — that is, by comparing the DNA sequences of related organisms such as human and mouse. The pattern of mutations we observe in a human sequence compared to the related mouse sequence tells us something about the function of the sequence. We construct three statistical models describing the pattern of mutation we expect to see in RNA genes, protein genes, and other conserved sequences, and we test each conserved genomic region for the model it seems to fit best. Our first large-scale test of this approach was done in the small genome of the bacterium Escherichia coli, where our program (called QRNA) predicted a few hundred new RNA genes. We have also conducted computational screens for new ncRNA genes in other organisms, including well-known organisms such as humans, nematodes, and yeast, and in less-well-known organisms such as the deep-sea-vent extremophile Pyrococcus and the pond ciliate Oxytricha, whose genomes have unusual properties that allowed us to conduct particularly simple screens for new ncRNAs.
For those new genes where we have some indication of their function, most appear to be functioning as highly adapted regulatory molecules, which is not consistent with the idea that ncRNAs are ancient molecular fossils of the RNA world. I have argued instead for a "modern RNA world" view, where functional RNA is still actively deployed by evolution in roles where RNA is better suited than protein, such as sequence-specific recognition of other RNAs.
Molecular regulatory specification of neural cell types
At Janelia, our laboratory has begun to explore questions in the overlap of neuroscience and computational genome analysis. One area of particular focus for us is the transcriptional regulation of neuronal cell type. Neuroscientists have an ever-increasing toolbox of molecular reporters and effectors, including new "optogenetic" effector genes that can turn neurons on and off under precise experimental control. The experiments that can be done with these molecular tools are limited by ability to express the tools in precise cellular expression patterns. It would be desirable to be able to design synthetic enhancers that fire promoters only in specific cell types of interest. We are beginning to develop approaches toward that long-term synthetic goal.
All history was a palimpsest, scraped clean and re-inscribed exactly as often as necessary.
Hidden Markov models for sequence profile analysis
RNA structure analysis using covariance models
Database of protein family alignments and hidden Markov models
The Rfam database of RNA alignments, consensus secondary structures, and profile SCFGs
The Dfam database of repetitive DNA sequence elements