Statistical Compression of Protein Folding Patterns for Inference of Recurrent Substructural Themes

Ramanan Subramanian, Lloyd Allison, Peter J Stuckey, Maria Garcia de la Banda, David Abramson, Arthur M Lesk, Arun S Konagurthu

    Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

    5 Citations (Scopus)


    Computational analyses of the growing corpus of three-dimensional (3D) structures of proteins have revealed a limited set of recurrent substructural themes, termed super-secondary structures. Knowledge of super-secondary structures is important for the study of protein evolution and for the modeling of proteins with unknown structures. Characterizing a comprehensive dictionary of these super-secondary structures has been an unanswered computational challenge in protein structural studies. This paper presents an unsupervised method for learning such a comprehensive dictionary using the statistical framework of lossless compression on a database comprised of concise geometric representations of protein 3D folding patterns. The best dictionary is defined as the one that yields the most compression of the database. Here we describe the inference methodology and the statistical models used to estimate the encoding lengths. An interactive website for this dictionary is available at

    Original languageEnglish
    Title of host publicationProceedings - DCC 2017, 2017 Data Compression Conference
    Subtitle of host publication4 - 7 April 2017, Snowbird, Utah, USA
    EditorsAli Bilgin, Michael W. Marcellin, Joan Serra-Sagrista, James A. Storer
    Place of PublicationPiscataway, NJ
    PublisherIEEE, Institute of Electrical and Electronics Engineers
    Number of pages10
    ISBN (Electronic)9781509067213
    ISBN (Print)9781509067220
    Publication statusPublished - 8 May 2017
    EventData Compression Conference 2017 - Snowbird, United States of America
    Duration: 4 Apr 20177 Apr 2017
    Conference number: 27th (IEEE Conference Proceedings)

    Publication series

    NameData Compression Conference. Proceedings
    PublisherI E E E Computer Society
    ISSN (Print)1068-0314


    ConferenceData Compression Conference 2017
    Abbreviated titleDCC 2017
    Country/TerritoryUnited States of America
    Internet address


    • Minimum Message Length
    • MML
    • Protein structure
    • super-secondary structural patterns

    Cite this