Identifying multi-word terms by text-segments

Jisong Chen, Chung Hsing Yeh, Rowena Chau

    Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

    11 Citations (Scopus)

    Abstract

    Traditional statistical approaches for identifying multi-word terms have to handle a large amount of noisy data and are extremely time consuming. This paper presents a new statistical approach for identifying multi-word terms based on the co-related text-segments existing in a group of documents. The approach involves three stages: (a) using a short predefined stoplist as an initial input to segment a set of text documents into text-segments, (b) calculating the segment-weights of all text-segments, and (c) applying the short text-segments to segment the longer text-segments based on the weight values. The newly generated text-segments then segment each other again until all text-segments cannot be further divided. The resultant text-segments are identified as terms based on a specified threshold. The initial experimental result on a set of traditional Chinese documents shows that this approach can achieve a minimum of 76.39% of recall rate and a minimum of 91.05% of precision rate on retrieving multiple occurrences terms, including 18.30% of new identified terms. The approach can be applied to identify multi-word terms in any languages.

    Original languageEnglish
    Title of host publicationSeventh International Conference on Web-Age Information Management Workshops, WAIM 2006
    DOIs
    Publication statusPublished - 1 Dec 2006
    Event7th International Conference on Web-Age Information Management Workshops, WAIM 2006 - Hong Kong, China
    Duration: 17 Jun 200619 Jun 2006

    Conference

    Conference7th International Conference on Web-Age Information Management Workshops, WAIM 2006
    Country/TerritoryChina
    CityHong Kong
    Period17/06/0619/06/06

    Cite this