Graph-induced restricted Boltzmann machines for document modeling

Tu Dinh Nguyen, Truyen Tran, Dinh Phung, Svetha Venkatesh

Research output: Contribution to journalArticleResearchpeer-review

6 Citations (Scopus)

Abstract

Discovering knowledge from unstructured texts is a central theme in data mining and machine learning. We focus on fast discovery of thematic structures from a corpus. Our approach is based on a versatile probabilistic formulation - the restricted Boltzmann machine (RBM) - where the underlying graphical model is an undirected bipartite graph. Inference is efficient - document representation can be computed with a single matrix projection, making RBMs suitable for massive text corpora available today. Standard RBMs, however, operate on bag-of-words assumption, ignoring the inherent underlying relational structures among words. This results in less coherent word thematic grouping. We introduce graph-based regularization schemes that exploit the linguistic structures, which in turn can be constructed from either corpus statistics or domain knowledge. We demonstrate that the proposed technique improves the group coherence, facilitates visualization, provides means for estimation of intrinsic dimensionality, reduces overfitting, and possibly leads to better classification accuracy.

Original languageEnglish
Pages (from-to)60-75
Number of pages16
JournalInformation Sciences
Volume328
DOIs
Publication statusPublished - Jan 2016
Externally publishedYes

Keywords

  • Document modeling
  • Feature group discovery
  • Restricted Boltzmann machine
  • Topic coherence
  • Word graphs

Cite this

Nguyen, Tu Dinh ; Tran, Truyen ; Phung, Dinh ; Venkatesh, Svetha. / Graph-induced restricted Boltzmann machines for document modeling. In: Information Sciences. 2016 ; Vol. 328. pp. 60-75.
@article{1b2bc5ce948d4210aba215bd34895b80,
title = "Graph-induced restricted Boltzmann machines for document modeling",
abstract = "Discovering knowledge from unstructured texts is a central theme in data mining and machine learning. We focus on fast discovery of thematic structures from a corpus. Our approach is based on a versatile probabilistic formulation - the restricted Boltzmann machine (RBM) - where the underlying graphical model is an undirected bipartite graph. Inference is efficient - document representation can be computed with a single matrix projection, making RBMs suitable for massive text corpora available today. Standard RBMs, however, operate on bag-of-words assumption, ignoring the inherent underlying relational structures among words. This results in less coherent word thematic grouping. We introduce graph-based regularization schemes that exploit the linguistic structures, which in turn can be constructed from either corpus statistics or domain knowledge. We demonstrate that the proposed technique improves the group coherence, facilitates visualization, provides means for estimation of intrinsic dimensionality, reduces overfitting, and possibly leads to better classification accuracy.",
keywords = "Document modeling, Feature group discovery, Restricted Boltzmann machine, Topic coherence, Word graphs",
author = "Nguyen, {Tu Dinh} and Truyen Tran and Dinh Phung and Svetha Venkatesh",
year = "2016",
month = "1",
doi = "10.1016/j.ins.2015.08.023",
language = "English",
volume = "328",
pages = "60--75",
journal = "Information Sciences",
issn = "0020-0255",
publisher = "Elsevier",

}

Graph-induced restricted Boltzmann machines for document modeling. / Nguyen, Tu Dinh; Tran, Truyen; Phung, Dinh; Venkatesh, Svetha.

In: Information Sciences, Vol. 328, 01.2016, p. 60-75.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Graph-induced restricted Boltzmann machines for document modeling

AU - Nguyen, Tu Dinh

AU - Tran, Truyen

AU - Phung, Dinh

AU - Venkatesh, Svetha

PY - 2016/1

Y1 - 2016/1

N2 - Discovering knowledge from unstructured texts is a central theme in data mining and machine learning. We focus on fast discovery of thematic structures from a corpus. Our approach is based on a versatile probabilistic formulation - the restricted Boltzmann machine (RBM) - where the underlying graphical model is an undirected bipartite graph. Inference is efficient - document representation can be computed with a single matrix projection, making RBMs suitable for massive text corpora available today. Standard RBMs, however, operate on bag-of-words assumption, ignoring the inherent underlying relational structures among words. This results in less coherent word thematic grouping. We introduce graph-based regularization schemes that exploit the linguistic structures, which in turn can be constructed from either corpus statistics or domain knowledge. We demonstrate that the proposed technique improves the group coherence, facilitates visualization, provides means for estimation of intrinsic dimensionality, reduces overfitting, and possibly leads to better classification accuracy.

AB - Discovering knowledge from unstructured texts is a central theme in data mining and machine learning. We focus on fast discovery of thematic structures from a corpus. Our approach is based on a versatile probabilistic formulation - the restricted Boltzmann machine (RBM) - where the underlying graphical model is an undirected bipartite graph. Inference is efficient - document representation can be computed with a single matrix projection, making RBMs suitable for massive text corpora available today. Standard RBMs, however, operate on bag-of-words assumption, ignoring the inherent underlying relational structures among words. This results in less coherent word thematic grouping. We introduce graph-based regularization schemes that exploit the linguistic structures, which in turn can be constructed from either corpus statistics or domain knowledge. We demonstrate that the proposed technique improves the group coherence, facilitates visualization, provides means for estimation of intrinsic dimensionality, reduces overfitting, and possibly leads to better classification accuracy.

KW - Document modeling

KW - Feature group discovery

KW - Restricted Boltzmann machine

KW - Topic coherence

KW - Word graphs

UR - http://www.scopus.com/inward/record.url?scp=84945529896&partnerID=8YFLogxK

U2 - 10.1016/j.ins.2015.08.023

DO - 10.1016/j.ins.2015.08.023

M3 - Article

VL - 328

SP - 60

EP - 75

JO - Information Sciences

JF - Information Sciences

SN - 0020-0255

ER -