Main Menu (Mobile)- Block

Main Menu - Block

janelia7_blocks-janelia7_fake_breadcrumb | block
Koyama Lab / Publications
custom | custom


facetapi-Q2b17qCsTdECvJIqZJgYMaGsr8vANl1n | block

Associated Lab

facetapi-W9JlIB1X0bjs93n1Alu3wHJQTTgDCBGe | block

Associated Project Team

facetapi-PV5lg7xuz68EAY8eakJzrcmwtdGEnxR0 | block
facetapi-021SKYQnqXW6ODq5W5dPAFEDBaEJubhN | block
general_search_page-panel_pane_1 | views_panes

30 Publications

Showing 11-20 of 30 results
Your Criteria:
    Eddy/Rivas Lab
    01/01/13 | Dfam: a database of repetitive DNA based on profile hidden Markov models.
    Wheeler TJ, Clements J, Eddy SR, Hubley R, Jones TA, Jurka J, Smit AF, Finn RD
    Nucleic Acids Research. 2013 Jan;41:D70-82. doi: 10.1093/nar/gks1265

    We present a database of repetitive DNA elements, called Dfam ( Many genomes contain a large fraction of repetitive DNA, much of which is made up of remnants of transposable elements (TEs). Accurate annotation of TEs enables research into their biology and can shed light on the evolutionary processes that shape genomes. Identification and masking of TEs can also greatly simplify many downstream genome annotation and sequence analysis tasks. The commonly used TE annotation tools RepeatMasker and Censor depend on sequence homology search tools such as cross_match and BLAST variants, as well as Repbase, a collection of known TE families each represented by a single consensus sequence. Dfam contains entries corresponding to all Repbase TE entries for which instances have been found in the human genome. Each Dfam entry is represented by a profile hidden Markov model, built from alignments generated using RepeatMasker and Repbase. When used in conjunction with the hidden Markov model search tool nhmmer, Dfam produces a 2.9% increase in coverage over consensus sequence search methods on a large human benchmark, while maintaining low false discovery rates, and coverage of the full human genome is 54.5%. The website provides a collection of tools and data views to support improved TE curation and annotation efforts. Dfam is also available for download in flat file format or in the form of MySQL table dumps.

    View Publication Page
    Eddy/Rivas Lab
    02/01/12 | A range of complex probabilistic models for RNA secondary structure prediction that includes the nearest-neighbor model and more.
    Rivas E, Lang R, Eddy SR
    RNA. 2012 Feb;18:193-212. doi: 10.1261/rna.030049.111

    The standard approach for single-sequence RNA secondary structure prediction uses a nearest-neighbor thermodynamic model with several thousand experimentally determined energy parameters. An attractive alternative is to use statistical approaches with parameters estimated from growing databases of structural RNAs. Good results have been reported for discriminative statistical methods using complex nearest-neighbor models, including CONTRAfold, Simfold, and ContextFold. Little work has been reported on generative probabilistic models (stochastic context-free grammars [SCFGs]) of comparable complexity, although probabilistic models are generally easier to train and to use. To explore a range of probabilistic models of increasing complexity, and to directly compare probabilistic, thermodynamic, and discriminative approaches, we created TORNADO, a computational tool that can parse a wide spectrum of RNA grammar architectures (including the standard nearest-neighbor model and more) using a generalized super-grammar that can be parameterized with probabilities, energies, or arbitrary scores. By using TORNADO, we find that probabilistic nearest-neighbor models perform comparably to (but not significantly better than) discriminative methods. We find that complex statistical models are prone to overfitting RNA structure and that evaluations should use structurally nonhomologous training and test data sets. Overfitting has affected at least one published method (ContextFold). The most important barrier to improving statistical approaches for RNA secondary structure prediction is the lack of diversity of well-curated single-sequence RNA secondary structures in current RNA databases.

    View Publication Page
    Eddy/Rivas Lab
    12/07/11 | Phosphorylation at the interface.
    Davis FP
    Structure . 2011 Dec 7;19:1726-7. doi: 10.1016/j.str.2011.11.006

    Proteomic studies have identified thousands of eukaryotic phosphorylation sites (phosphosites), but few are functionally characterized. Nishi et al., in this issue of Structure, characterize phosphosites at protein-protein interfaces and estimate the effect of their phosphorylation on interaction affinity, by combining proteomics data with protein structures.

    View Publication Page
    Eddy/Rivas Lab
    11/15/11 | Fast filtering for RNA homology search.
    Kolbe DL, Eddy SR
    Bioinformatics. 2011 Nov 15;27(22):3102-9. doi: 10.1093/bioinformatics/btr545

    MOTIVATION: Homology search for RNAs can use secondary structure information to increase power by modeling base pairs, as in covariance models, but the resulting computational costs are high. Typical acceleration strategies rely on at least one filtering stage using sequence-only search. RESULTS: Here we present the multi-segment CYK (MSCYK) filter, which implements a heuristic of ungapped structural alignment for RNA homology search. Compared to gapped alignment, this approximation has lower computation time requirements (O(N⁴) reduced to O(N³), and space requirements (O(N³) reduced to O(N²). A vector-parallel implementation of this method gives up to 100-fold speed-up; vector-parallel implementations of standard gapped alignment at two levels of precision give 3- and 6-fold speed-ups. These approaches are combined to create a filtering pipeline that scores RNA secondary structure at all stages, with results that are synergistic with existing methods.

    View Publication Page
    Eddy/Rivas Lab
    09/01/11 | Exploiting Oxytricha trifallax nanochromosomes to screen for non-coding RNA genes.
    Jung S, Swart EC, Minx PJ, Magrini V, Mardis ER, Landweber LF, Eddy SR
    Nucleic Acids Research. 2011 Sep 1;39:7529-47. doi: 10.1093/nar/gkr501

    We took advantage of the unusual genomic organization of the ciliate Oxytricha trifallax to screen for eukaryotic non-coding RNA (ncRNA) genes. Ciliates have two types of nuclei: a germ line micronucleus that is usually transcriptionally inactive, and a somatic macronucleus that contains a reduced, fragmented and rearranged genome that expresses all genes required for growth and asexual reproduction. In some ciliates including Oxytricha, the macronuclear genome is particularly extreme, consisting of thousands of tiny ’nanochromosomes’, each of which usually contains only a single gene. Because the organism itself identifies and isolates most of its genes on single-gene nanochromosomes, nanochromosome structure could facilitate the discovery of unusual genes or gene classes, such as ncRNA genes. Using a draft Oxytricha genome assembly and a custom-written protein-coding genefinding program, we identified a subset of nanochromosomes that lack any detectable protein-coding gene, thereby strongly enriching for nanochromosomes that carry ncRNA genes. We found only a small proportion of non-coding nanochromosomes, suggesting that Oxytricha has few independent ncRNA genes besides homologs of already known RNAs. Other than new members of known ncRNA classes including C/D and H/ACA snoRNAs, our screen identified one new family of small RNA genes, named the Arisong RNAs, which share some of the features of small nuclear RNAs.

    View Publication Page
    Eddy/Rivas Lab
    07/01/11 | HMMER web server: interactive sequence similarity searching.
    Finn RD, Clements J, Eddy SR
    Nucleic Acids Research. 2011 Jul;39:W29-37. doi: 10.1093/nar/gkr367

    HMMER is a software suite for protein sequence similarity searches using probabilistic methods. Previously, HMMER has mainly been available only as a computationally intensive UNIX command-line tool, restricting its use. Recent advances in the software, HMMER3, have resulted in a 100-fold speed gain relative to previous versions. It is now feasible to make efficient profile hidden Markov model (profile HMM) searches via the web. A HMMER web server ( has been designed and implemented such that most protein database searches return within a few seconds. Methods are available for searching either a single protein sequence, multiple protein sequence alignment or profile HMM against a target sequence database, and for searching a protein sequence against Pfam. The web server is designed to cater to a range of different user expertise and accepts batch uploading of multiple queries at once. All search methods are also available as RESTful web services, thereby allowing them to be readily integrated as remotely executed tasks in locally scripted workflows. We have focused on minimizing search times and the ability to rapidly display tabular results, regardless of the number of matches found, developing graphical summaries of the search results to provide quick, intuitive appraisement of them.

    View Publication Page
    Eddy/Rivas Lab
    01/01/11 | Rfam: Wikipedia, clans and the "decimal" release.
    Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A
    Nucleic Acids Research. 2011 Jan;39(Database issue):D141-5. doi: 10.1093/nar/gkq1129

    The Rfam database aims to catalogue non-coding RNAs through the use of sequence alignments and statistical profile models known as covariance models. In this contribution, we discuss the pros and cons of using the online encyclopedia, Wikipedia, as a source of community-derived annotation. We discuss the addition of groupings of related RNA families into clans and new developments to the website. Rfam is available on the Web at

    View Publication Page
    Eddy/Rivas Lab
    02/01/10 | The overlap of small molecule and protein binding sites within families of protein structures.
    Davis FP, Sali A
    PLoS Computational Biology. 2010 Feb;6(2):e1000668. doi: 10.1371/journal.pcbi.1000668

    Protein-protein interactions are challenging targets for modulation by small molecules. Here, we propose an approach that harnesses the increasing structural coverage of protein complexes to identify small molecules that may target protein interactions. Specifically, we identify ligand and protein binding sites that overlap upon alignment of homologous proteins. Of the 2,619 protein structure families observed to bind proteins, 1,028 also bind small molecules (250-1000 Da), and 197 exhibit a statistically significant (p<0.01) overlap between ligand and protein binding positions. These "bi-functional positions", which bind both ligands and proteins, are particularly enriched in tyrosine and tryptophan residues, similar to "energetic hotspots" described previously, and are significantly less conserved than mono-functional and solvent exposed positions. Homology transfer identifies ligands whose binding sites overlap at least 20% of the protein interface for 35% of domain-domain and 45% of domain-peptide mediated interactions. The analysis recovered known small-molecule modulators of protein interactions as well as predicted new interaction targets based on the sequence similarity of ligand binding sites. We illustrate the predictive utility of the method by suggesting structural mechanisms for the effects of sanglifehrin A on HIV virion production, bepridil on the cellular entry of anthrax edema factor, and fusicoccin on vertebrate developmental pathways. The results, available at, represent a comprehensive collection of structurally characterized modulators of protein interactions, and suggest that homologous structures are a useful resource for the rational design of interaction modulators.

    View Publication Page
    Eddy/Rivas Lab
    01/01/10 | Hidden Markov model speed heuristic and iterative HMM search procedure.
    Johnson LS, Eddy SR, Portugaly E
    BMC Bioinformatics. 2010;11:431. doi: 10.1186/1471-2105-11-431

    Profile hidden Markov models (profile-HMMs) are sensitive tools for remote protein homology detection, but the main scoring algorithms, Viterbi or Forward, require considerable time to search large sequence databases.

    View Publication Page
    Eddy/Rivas Lab
    01/01/10 | The Pfam protein families database.
    Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR, Bateman A
    Nucleic Acids Research. 2010 Jan;38:D211-22. doi: 10.1093/nar/gkp985

    Pfam is a widely used database of protein families and domains. This article describes a set of major updates that we have implemented in the latest release (version 24.0). The most important change is that we now use HMMER3, the latest version of the popular profile hidden Markov model package. This software is approximately 100 times faster than HMMER2 and is more sensitive due to the routine use of the forward algorithm. The move to HMMER3 has necessitated numerous changes to Pfam that are described in detail. Pfam release 24.0 contains 11,912 families, of which a large number have been significantly updated during the past two years. Pfam is available via servers in the UK (, the USA ( and Sweden (

    View Publication Page