Classifying networked text data with positive and unlabeled examples

Mei Li, Shirui Pan, Yang Zhang, Xiaoyan Cai

Research output: Contribution to journalArticleResearchpeer-review

Abstract

The rapid growth in the number of networked applications that naturally generate complex text data, which contains not only inner features but also inter-dependent relations, has created the demand of efficiently classifying such data. Many classification algorithms have been proposed, but they usually require as input fully labeled text examples. In many networked applications, however, the cost to label a text data may be expensive and hence a large amount of text may be unlabeled. In this paper we study the problem of classifying networked text data with only positive and unlabeled examples available. We present a non-negative matrix factorization-based approach to networked text classification by factorizing content matrix of the nodes and topological network structures, and by incorporating supervised information into the learning of objective function via a consensus principle. We propose a novel learning algorithm, namely puNet (positive and unlabeled learning algorithm for Networked text data), for efficiently classifying networked text, even if training datasets contain only a small amount of positive examples and a large amount of unlabeled ones. We conduct a series of experiments on benchmark networked datasets and illustrate the effectiveness of our algorithm.

Original languageEnglish
Pages (from-to)1-7
Number of pages7
JournalPattern Recognition Letters
Volume77
DOIs
Publication statusPublished - 1 Jul 2016
Externally publishedYes

Keywords

  • Graph clustering
  • Matrix factorization
  • Networked text data
  • PU learning
  • Semi-supervised learning

Cite this

@article{12692b1f146d464db7dc5a2030744352,
title = "Classifying networked text data with positive and unlabeled examples",
abstract = "The rapid growth in the number of networked applications that naturally generate complex text data, which contains not only inner features but also inter-dependent relations, has created the demand of efficiently classifying such data. Many classification algorithms have been proposed, but they usually require as input fully labeled text examples. In many networked applications, however, the cost to label a text data may be expensive and hence a large amount of text may be unlabeled. In this paper we study the problem of classifying networked text data with only positive and unlabeled examples available. We present a non-negative matrix factorization-based approach to networked text classification by factorizing content matrix of the nodes and topological network structures, and by incorporating supervised information into the learning of objective function via a consensus principle. We propose a novel learning algorithm, namely puNet (positive and unlabeled learning algorithm for Networked text data), for efficiently classifying networked text, even if training datasets contain only a small amount of positive examples and a large amount of unlabeled ones. We conduct a series of experiments on benchmark networked datasets and illustrate the effectiveness of our algorithm.",
keywords = "Graph clustering, Matrix factorization, Networked text data, PU learning, Semi-supervised learning",
author = "Mei Li and Shirui Pan and Yang Zhang and Xiaoyan Cai",
year = "2016",
month = "7",
day = "1",
doi = "10.1016/j.patrec.2016.03.006",
language = "English",
volume = "77",
pages = "1--7",
journal = "Pattern Recognition Letters",
issn = "0167-8655",
publisher = "Elsevier",

}

Classifying networked text data with positive and unlabeled examples. / Li, Mei; Pan, Shirui; Zhang, Yang; Cai, Xiaoyan.

In: Pattern Recognition Letters, Vol. 77, 01.07.2016, p. 1-7.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Classifying networked text data with positive and unlabeled examples

AU - Li, Mei

AU - Pan, Shirui

AU - Zhang, Yang

AU - Cai, Xiaoyan

PY - 2016/7/1

Y1 - 2016/7/1

N2 - The rapid growth in the number of networked applications that naturally generate complex text data, which contains not only inner features but also inter-dependent relations, has created the demand of efficiently classifying such data. Many classification algorithms have been proposed, but they usually require as input fully labeled text examples. In many networked applications, however, the cost to label a text data may be expensive and hence a large amount of text may be unlabeled. In this paper we study the problem of classifying networked text data with only positive and unlabeled examples available. We present a non-negative matrix factorization-based approach to networked text classification by factorizing content matrix of the nodes and topological network structures, and by incorporating supervised information into the learning of objective function via a consensus principle. We propose a novel learning algorithm, namely puNet (positive and unlabeled learning algorithm for Networked text data), for efficiently classifying networked text, even if training datasets contain only a small amount of positive examples and a large amount of unlabeled ones. We conduct a series of experiments on benchmark networked datasets and illustrate the effectiveness of our algorithm.

AB - The rapid growth in the number of networked applications that naturally generate complex text data, which contains not only inner features but also inter-dependent relations, has created the demand of efficiently classifying such data. Many classification algorithms have been proposed, but they usually require as input fully labeled text examples. In many networked applications, however, the cost to label a text data may be expensive and hence a large amount of text may be unlabeled. In this paper we study the problem of classifying networked text data with only positive and unlabeled examples available. We present a non-negative matrix factorization-based approach to networked text classification by factorizing content matrix of the nodes and topological network structures, and by incorporating supervised information into the learning of objective function via a consensus principle. We propose a novel learning algorithm, namely puNet (positive and unlabeled learning algorithm for Networked text data), for efficiently classifying networked text, even if training datasets contain only a small amount of positive examples and a large amount of unlabeled ones. We conduct a series of experiments on benchmark networked datasets and illustrate the effectiveness of our algorithm.

KW - Graph clustering

KW - Matrix factorization

KW - Networked text data

KW - PU learning

KW - Semi-supervised learning

UR - http://www.scopus.com/inward/record.url?scp=84964059140&partnerID=8YFLogxK

U2 - 10.1016/j.patrec.2016.03.006

DO - 10.1016/j.patrec.2016.03.006

M3 - Article

VL - 77

SP - 1

EP - 7

JO - Pattern Recognition Letters

JF - Pattern Recognition Letters

SN - 0167-8655

ER -