Main Menu (Mobile)- Block

Main Menu - Block

janelia7_blocks-janelia7_fake_breadcrumb | block
Lee Tzumin Lab / Publications
custom | custom

Filter

facetapi-Q2b17qCsTdECvJIqZJgYMaGsr8vANl1n | block

Associated Lab

facetapi-W9JlIB1X0bjs93n1Alu3wHJQTTgDCBGe | block

Associated Project Team

facetapi-PV5lg7xuz68EAY8eakJzrcmwtdGEnxR0 | block
facetapi-021SKYQnqXW6ODq5W5dPAFEDBaEJubhN | block
general_search_page-panel_pane_1 | views_panes

30 Publications

Showing 1-10 of 30 results
Your Criteria:
    Eddy/Rivas Lab
    10/01/09 | A new generation of homology search tools based on probabilistic inference.
    Eddy SR
    Genome Informatics. International Conference on Genome Informatics. 2009 Oct;23(1):205-11

    Many theoretical advances have been made in applying probabilistic inference methods to improve the power of sequence homology searches, yet the BLAST suite of programs is still the workhorse for most of the field. The main reason for this is practical: BLAST’s programs are about 100-fold faster than the fastest competing implementations of probabilistic inference methods. I describe recent work on the HMMER software suite for protein sequence analysis, which implements probabilistic inference using profile hidden Markov models. Our aim in HMMER3 is to achieve BLAST’s speed while further improving the power of probabilistic inference based methods. HMMER3 implements a new probabilistic model of local sequence alignment and a new heuristic acceleration algorithm. Combined with efficient vector-parallel implementations on modern processors, these improvements synergize. HMMER3 uses more powerful log-odds likelihood scores (scores summed over alignment uncertainty, rather than scoring a single optimal alignment); it calculates accurate expectation values (E-values) for those scores without simulation using a generalization of Karlin/Altschul theory; it computes posterior distributions over the ensemble of possible alignments and returns posterior probabilities (confidences) in each aligned residue; and it does all this at an overall speed comparable to BLAST. The HMMER project aims to usher in a new generation of more powerful homology search tools based on probabilistic inference methods.

    View Publication Page
    Eddy/Rivas Lab
    05/30/08 | A probabilistic model of local sequence alignment that simplifies statistical significance estimation.
    Sean R. Eddy
    PLoS Computational Biology. 2008 May 30;4:e1000069. doi: 10.1371/journal.pcbi.1000069

    Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (lambda) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty ("Forward" scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores ("Viterbi" scores) are Gumbel-distributed with constant lambda = log 2, and the high scoring tail of Forward scores is exponential with the same constant lambda. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments.

    View Publication Page
    Eddy/Rivas Lab
    02/01/12 | A range of complex probabilistic models for RNA secondary structure prediction that includes the nearest-neighbor model and more.
    Rivas E, Lang R, Eddy SR
    RNA. 2012 Feb;18:193-212. doi: 10.1261/rna.030049.111

    The standard approach for single-sequence RNA secondary structure prediction uses a nearest-neighbor thermodynamic model with several thousand experimentally determined energy parameters. An attractive alternative is to use statistical approaches with parameters estimated from growing databases of structural RNAs. Good results have been reported for discriminative statistical methods using complex nearest-neighbor models, including CONTRAfold, Simfold, and ContextFold. Little work has been reported on generative probabilistic models (stochastic context-free grammars [SCFGs]) of comparable complexity, although probabilistic models are generally easier to train and to use. To explore a range of probabilistic models of increasing complexity, and to directly compare probabilistic, thermodynamic, and discriminative approaches, we created TORNADO, a computational tool that can parse a wide spectrum of RNA grammar architectures (including the standard nearest-neighbor model and more) using a generalized super-grammar that can be parameterized with probabilities, energies, or arbitrary scores. By using TORNADO, we find that probabilistic nearest-neighbor models perform comparably to (but not significantly better than) discriminative methods. We find that complex statistical models are prone to overfitting RNA structure and that evaluations should use structurally nonhomologous training and test data sets. Overfitting has affected at least one published method (ContextFold). The most important barrier to improving statistical approaches for RNA secondary structure prediction is the lack of diversity of well-curated single-sequence RNA secondary structures in current RNA databases.

    View Publication Page
    01/01/17 | A statistical test for conserved RNA structure shows lack of evidence for structure in lncRNAs.
    Rivas E, Clements J, Eddy SR
    Nature Methods. 2017 Jan 31;14(1):45-8

    Many functional RNAs have an evolutionarily conserved secondary structure. Conservation of RNA base pairing induces pairwise covariations in sequence alignments. We developed a computational method, R-scape (RNA Structural Covariation Above Phylogenetic Expectation), that quantitatively tests whether covariation analysis supports the presence of a conserved RNA secondary structure. R-scape analysis finds no statistically significant support for proposed secondary structures of the long noncoding RNAs HOTAIR, SRA, and Xist.

    View Publication Page
    Eddy/Rivas Lab
    01/01/09 | A survey of nematode SmY RNAs.
    Jones TA, Otto W, Marz M, Eddy SR, Stadler PF
    RNA Biology. 2009 Jan-Mar;6(1):5-8

    SmY RNAs are a family of approximately 70-90 nt small nuclear RNAs found in nematodes. In C. elegans, SmY RNAs copurify in a small ribonucleoprotein (snRNP) complex related to the SL1 and SL2 snRNPs that are involved in nematode mRNA trans-splicing. Here we describe a comprehensive computational analysis of SmY RNA homologs found in the currently available genome sequences. We identify homologs in all sequenced nematode genomes in class Chromadorea. We are unable to identify homologs in a more distantly related nematode species, Trichinella spiralis (class: Dorylaimia), and in representatives of non-nematode phyla that use trans-splicing. Using comparative RNA sequence analysis, we infer a conserved consensus SmY RNA secondary structure consisting of two stems flanking a consensus Sm protein binding site. A representative seed alignment of the SmY RNA family, annotated with the inferred consensus secondary structure, has been deposited with the Rfam RNA families database.

    View Publication Page
    Eddy/Rivas Lab
    07/01/09 | A tool for identification of genes expressed in patterns of interest using the Allen Brain Atlas.
    Davis FP, Eddy SR
    Bioinformatics. 2009 Jul 1;25(13):1647-54. doi: 10.1093/bioinformatics/btp288

    Gene expression patterns can be useful in understanding the structural organization of the brain and the regulatory logic that governs its myriad cell types. A particularly rich source of spatial expression data is the Allen Brain Atlas (ABA), a comprehensive genome-wide in situ hybridization study of the adult mouse brain. Here, we present an open-source program, ALLENMINER, that searches the ABA for genes that are expressed, enriched, patterned or graded in a user-specified region of interest.

    View Publication Page
    Eddy/Rivas Lab
    01/01/14 | Annotating functional RNAs in genomes using infernal.
    Nawrocki EP
    Methods in Molecular Biology. 2014;1097:163-97. doi: 10.1007/978-1-62703-709-9_9

    Many different types of functional non-coding RNAs participate in a wide range of important cellular functions but the large majority of these RNAs are not routinely annotated in published genomes. Several programs have been developed for identifying RNAs, including specific tools tailored to a particular RNA family as well as more general ones designed to work for any family. Many of these tools utilize covariance models (CMs), statistical models of the conserved sequence, and structure of an RNA family. In this chapter, as an illustrative example, the Infernal software package and CMs from the Rfam database are used to identify RNAs in the genome of the archaeon Methanobrevibacter ruminantium, uncovering some additional RNAs not present in the genome’s initial annotation. Analysis of the results and comparison with family-specific methods demonstrate some important strengths and weaknesses of this general approach.

    View Publication Page
    Eddy/Rivas Lab
    09/02/15 | Combinatorial DNA rearrangement facilitates the origin of new genes in ciliates.
    Chen X, Jung S, Beh LY, Eddy SR, Landweber LF
    Genome Biology and Evolution. 2015 Sep 2;7(10):2859-70. doi: 10.1093/gbe/evv172

    Programmed genome rearrangements in the unicellular eukaryote Oxytricha trifallax produce a transcriptionally active somatic nucleus from a copy of its germline nucleus during development. This process eliminates noncoding sequences that interrupt coding regions in the germline genome, and joins over 225,000 remaining DNA segments, some of which require inversion or complex permutation to build functional genes. This dynamic genomic organization permits some single DNA segments in the germline to contribute to multiple, distinct somatic genes via alternative processing. Like alternative mRNA splicing, the combinatorial assembly of DNA segments contributes to genetic variation and facilitates the evolution of new genes. In this study, we use comparative genomic analysis to demonstrate that the emergence of alternative DNA splicing is associated with the origin of new genes. Short duplications give rise to alternative gene segments that are spliced to the shared gene segments. Alternative gene segments evolve faster than shared, constitutive segments. Genes with shared segments frequently have different expression profiles, permitting functional divergence. This study reports alternative DNA splicing as a mechanism of new gene origination, illustrating how the process of programmed genome rearrangement gives rise to evolutionary innovation.

    View Publication Page
    Eddy/Rivas Lab
    01/01/14 | Computational analysis of conserved RNA secondary structure in transcriptomes and genomes.
    Eddy SR
    Annual Review of Biophysics and Biomolecular Structure. 2014;43:433-56. doi: 10.1146/annurev-biophys-051013-022950

    Transcriptomics experiments and computational predictions both enable systematic discovery of new functional RNAs. However, many putative noncoding transcripts arise instead from artifacts and biological noise, and current computational prediction methods have high false positive rates. I discuss prospects for improving computational methods for analyzing and identifying functional RNAs, with a focus on detecting signatures of conserved RNA secondary structure. An interesting new front is the application of chemical and enzymatic experiments that probe RNA structure on a transcriptome-wide scale. I review several proposed approaches for incorporating structure probing data into the computational prediction of RNA secondary structure. Using probabilistic inference formalisms, I show how all these approaches can be unified in a well-principled framework, which in turn allows RNA probing data to be easily integrated into a wide range of analyses that depend on RNA secondary structure inference. Such analyses include homology search and genome-wide detection of new structural RNAs.

    View Publication Page
    Eddy/Rivas Lab
    01/01/13 | Dfam: a database of repetitive DNA based on profile hidden Markov models.
    Wheeler TJ, Clements J, Eddy SR, Hubley R, Jones TA, Jurka J, Smit AF, Finn RD
    Nucleic Acids Research. 2013 Jan;41:D70-82. doi: 10.1093/nar/gks1265

    We present a database of repetitive DNA elements, called Dfam (http://dfam.janelia.org). Many genomes contain a large fraction of repetitive DNA, much of which is made up of remnants of transposable elements (TEs). Accurate annotation of TEs enables research into their biology and can shed light on the evolutionary processes that shape genomes. Identification and masking of TEs can also greatly simplify many downstream genome annotation and sequence analysis tasks. The commonly used TE annotation tools RepeatMasker and Censor depend on sequence homology search tools such as cross_match and BLAST variants, as well as Repbase, a collection of known TE families each represented by a single consensus sequence. Dfam contains entries corresponding to all Repbase TE entries for which instances have been found in the human genome. Each Dfam entry is represented by a profile hidden Markov model, built from alignments generated using RepeatMasker and Repbase. When used in conjunction with the hidden Markov model search tool nhmmer, Dfam produces a 2.9% increase in coverage over consensus sequence search methods on a large human benchmark, while maintaining low false discovery rates, and coverage of the full human genome is 54.5%. The website provides a collection of tools and data views to support improved TE curation and annotation efforts. Dfam is also available for download in flat file format or in the form of MySQL table dumps.

    View Publication Page