Statistical Compression of Protein Folding Patterns for Inference of Recurrent Substructural Themes

Ramanan Subramanian, Lloyd Allison, Peter J Stuckey, Maria Garcia de la Banda, David Abramson, Arthur M Lesk, Arun S Konagurthu

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

Computational analyses of the growing corpus of three-dimensional (3D) structures of proteins have revealed a limited set of recurrent substructural themes, termed super-secondary structures. Knowledge of super-secondary structures is important for the study of protein evolution and for the modeling of proteins with unknown structures. Characterizing a comprehensive dictionary of these super-secondary structures has been an unanswered computational challenge in protein structural studies. This paper presents an unsupervised method for learning such a comprehensive dictionary using the statistical framework of lossless compression on a database comprised of concise geometric representations of protein 3D folding patterns. The best dictionary is defined as the one that yields the most compression of the database. Here we describe the inference methodology and the statistical models used to estimate the encoding lengths. An interactive website for this dictionary is available at http://lcb.infotech.monash.edu.au/proteinConcepts/scop100/dictionary.HTML.

LanguageEnglish
Title of host publicationProceedings - DCC 2017, 2017 Data Compression Conference
Subtitle of host publication4 - 7 April 2017, Snowbird, Utah, USA
EditorsAli Bilgin, Michael W. Marcellin, Joan Serra-Sagrista, James A. Storer
Place of PublicationPiscataway, NJ
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages340-349
Number of pages10
ISBN (Electronic)9781509067213
ISBN (Print)9781509067220
DOIs
StatePublished - 8 May 2017
EventData Compression Conference 2017 - Snowbird, United States
Duration: 4 Apr 20177 Apr 2017
http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=7921793 (IEEE Conference Proceedings)

Publication series

NameData Compression Conference. Proceedings
PublisherI E E E Computer Society
ISSN (Print)1068-0314

Conference

ConferenceData Compression Conference 2017
Abbreviated titleDCC 2017
CountryUnited States
CitySnowbird
Period4/04/177/04/17
Internet address

Keywords

  • Minimum Message Length
  • MML
  • Protein structure
  • super-secondary structural patterns

Cite this

Subramanian, R., Allison, L., Stuckey, P. J., de la Banda, M. G., Abramson, D., Lesk, A. M., & Konagurthu, A. S. (2017). Statistical Compression of Protein Folding Patterns for Inference of Recurrent Substructural Themes. In A. Bilgin, M. W. Marcellin, J. Serra-Sagrista, & J. A. Storer (Eds.), Proceedings - DCC 2017, 2017 Data Compression Conference: 4 - 7 April 2017, Snowbird, Utah, USA (pp. 340-349). [7923707] (Data Compression Conference. Proceedings). Piscataway, NJ: IEEE, Institute of Electrical and Electronics Engineers. DOI: 10.1109/DCC.2017.46
Subramanian, Ramanan ; Allison, Lloyd ; Stuckey, Peter J ; de la Banda, Maria Garcia ; Abramson, David ; Lesk, Arthur M ; Konagurthu, Arun S. / Statistical Compression of Protein Folding Patterns for Inference of Recurrent Substructural Themes. Proceedings - DCC 2017, 2017 Data Compression Conference: 4 - 7 April 2017, Snowbird, Utah, USA. editor / Ali Bilgin ; Michael W. Marcellin ; Joan Serra-Sagrista ; James A. Storer. Piscataway, NJ : IEEE, Institute of Electrical and Electronics Engineers, 2017. pp. 340-349 (Data Compression Conference. Proceedings).
@inproceedings{f8344ee36f9b4906a97676443f3e2bcb,
title = "Statistical Compression of Protein Folding Patterns for Inference of Recurrent Substructural Themes",
abstract = "Computational analyses of the growing corpus of three-dimensional (3D) structures of proteins have revealed a limited set of recurrent substructural themes, termed super-secondary structures. Knowledge of super-secondary structures is important for the study of protein evolution and for the modeling of proteins with unknown structures. Characterizing a comprehensive dictionary of these super-secondary structures has been an unanswered computational challenge in protein structural studies. This paper presents an unsupervised method for learning such a comprehensive dictionary using the statistical framework of lossless compression on a database comprised of concise geometric representations of protein 3D folding patterns. The best dictionary is defined as the one that yields the most compression of the database. Here we describe the inference methodology and the statistical models used to estimate the encoding lengths. An interactive website for this dictionary is available at http://lcb.infotech.monash.edu.au/proteinConcepts/scop100/dictionary.HTML.",
keywords = "Minimum Message Length, MML, Protein structure, super-secondary structural patterns",
author = "Ramanan Subramanian and Lloyd Allison and Stuckey, {Peter J} and {de la Banda}, {Maria Garcia} and David Abramson and Lesk, {Arthur M} and Konagurthu, {Arun S}",
year = "2017",
month = "5",
day = "8",
doi = "10.1109/DCC.2017.46",
language = "English",
isbn = "9781509067220",
series = "Data Compression Conference. Proceedings",
publisher = "IEEE, Institute of Electrical and Electronics Engineers",
pages = "340--349",
editor = "Ali Bilgin and Marcellin, {Michael W.} and Serra-Sagrista, {Joan } and Storer, {James A.}",
booktitle = "Proceedings - DCC 2017, 2017 Data Compression Conference",
address = "United States",

}

Subramanian, R, Allison, L, Stuckey, PJ, de la Banda, MG, Abramson, D, Lesk, AM & Konagurthu, AS 2017, Statistical Compression of Protein Folding Patterns for Inference of Recurrent Substructural Themes. in A Bilgin, MW Marcellin, J Serra-Sagrista & JA Storer (eds), Proceedings - DCC 2017, 2017 Data Compression Conference: 4 - 7 April 2017, Snowbird, Utah, USA., 7923707, Data Compression Conference. Proceedings, IEEE, Institute of Electrical and Electronics Engineers, Piscataway, NJ, pp. 340-349, Data Compression Conference 2017, Snowbird, United States, 4/04/17. DOI: 10.1109/DCC.2017.46

Statistical Compression of Protein Folding Patterns for Inference of Recurrent Substructural Themes. / Subramanian, Ramanan; Allison, Lloyd; Stuckey, Peter J; de la Banda, Maria Garcia; Abramson, David; Lesk, Arthur M; Konagurthu, Arun S.

Proceedings - DCC 2017, 2017 Data Compression Conference: 4 - 7 April 2017, Snowbird, Utah, USA. ed. / Ali Bilgin; Michael W. Marcellin; Joan Serra-Sagrista; James A. Storer. Piscataway, NJ : IEEE, Institute of Electrical and Electronics Engineers, 2017. p. 340-349 7923707 (Data Compression Conference. Proceedings).

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

TY - GEN

T1 - Statistical Compression of Protein Folding Patterns for Inference of Recurrent Substructural Themes

AU - Subramanian,Ramanan

AU - Allison,Lloyd

AU - Stuckey,Peter J

AU - de la Banda,Maria Garcia

AU - Abramson,David

AU - Lesk,Arthur M

AU - Konagurthu,Arun S

PY - 2017/5/8

Y1 - 2017/5/8

N2 - Computational analyses of the growing corpus of three-dimensional (3D) structures of proteins have revealed a limited set of recurrent substructural themes, termed super-secondary structures. Knowledge of super-secondary structures is important for the study of protein evolution and for the modeling of proteins with unknown structures. Characterizing a comprehensive dictionary of these super-secondary structures has been an unanswered computational challenge in protein structural studies. This paper presents an unsupervised method for learning such a comprehensive dictionary using the statistical framework of lossless compression on a database comprised of concise geometric representations of protein 3D folding patterns. The best dictionary is defined as the one that yields the most compression of the database. Here we describe the inference methodology and the statistical models used to estimate the encoding lengths. An interactive website for this dictionary is available at http://lcb.infotech.monash.edu.au/proteinConcepts/scop100/dictionary.HTML.

AB - Computational analyses of the growing corpus of three-dimensional (3D) structures of proteins have revealed a limited set of recurrent substructural themes, termed super-secondary structures. Knowledge of super-secondary structures is important for the study of protein evolution and for the modeling of proteins with unknown structures. Characterizing a comprehensive dictionary of these super-secondary structures has been an unanswered computational challenge in protein structural studies. This paper presents an unsupervised method for learning such a comprehensive dictionary using the statistical framework of lossless compression on a database comprised of concise geometric representations of protein 3D folding patterns. The best dictionary is defined as the one that yields the most compression of the database. Here we describe the inference methodology and the statistical models used to estimate the encoding lengths. An interactive website for this dictionary is available at http://lcb.infotech.monash.edu.au/proteinConcepts/scop100/dictionary.HTML.

KW - Minimum Message Length

KW - MML

KW - Protein structure

KW - super-secondary structural patterns

UR - http://www.scopus.com/inward/record.url?scp=85019993661&partnerID=8YFLogxK

U2 - 10.1109/DCC.2017.46

DO - 10.1109/DCC.2017.46

M3 - Conference Paper

SN - 9781509067220

T3 - Data Compression Conference. Proceedings

SP - 340

EP - 349

BT - Proceedings - DCC 2017, 2017 Data Compression Conference

PB - IEEE, Institute of Electrical and Electronics Engineers

CY - Piscataway, NJ

ER -

Subramanian R, Allison L, Stuckey PJ, de la Banda MG, Abramson D, Lesk AM et al. Statistical Compression of Protein Folding Patterns for Inference of Recurrent Substructural Themes. In Bilgin A, Marcellin MW, Serra-Sagrista J, Storer JA, editors, Proceedings - DCC 2017, 2017 Data Compression Conference: 4 - 7 April 2017, Snowbird, Utah, USA. Piscataway, NJ: IEEE, Institute of Electrical and Electronics Engineers. 2017. p. 340-349. 7923707. (Data Compression Conference. Proceedings). Available from, DOI: 10.1109/DCC.2017.46