Unsupervised text segmentation based on native language characteristics

Shervin Malmasi, Mark Dras, Mark Johnson, Lan Du, Magdalena Wolska

    Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

    7 Citations (Scopus)

    Abstract

    Most work on segmenting text does so on the basis of topic changes, but it can be of interest to segment by other, stylistically expressed characteristics such as change of authorship or native language. We propose a Bayesian unsupervised text segmentation approach to the latter. While baseline models achieve essentially random segmentation on our task, indicating its difficulty, a Bayesian model that incorporates appropriately compact language models and alternating asymmetric priors can achieve scores on the standard metrics around halfway to perfect segmentation.

    Original languageEnglish
    Title of host publicationACL 2017
    Subtitle of host publicationThe 55th Annual Meeting of the Association for Computational Linguistics - Proceedings of the Conference, Vol. 1 (Long Papers)
    EditorsRegina Barzilay, Min-Yen Kan
    Place of PublicationStroudsburg PA USA
    PublisherAssociation for Computational Linguistics (ACL)
    Pages1457-1469
    Number of pages13
    ISBN (Print)9781945626753
    DOIs
    Publication statusPublished - 2017
    EventAnnual Meeting of the Association of Computational Linguistics 2017 - Vancouver, Canada
    Duration: 30 Jul 20174 Aug 2017
    Conference number: 55th
    https://www.aclweb.org/anthology/events/acl-2017/ (Proceedings)

    Conference

    ConferenceAnnual Meeting of the Association of Computational Linguistics 2017
    Abbreviated titleACL 2017
    Country/TerritoryCanada
    CityVancouver
    Period30/07/174/08/17
    Internet address

    Cite this