Unsupervised software-specific morphological forms inference from informal discussions

Chunyang Chen, Zhenchang Xing, Ximing Wang

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

57 Citations (Scopus)

Abstract

Informal discussions on social platforms (e.g., Stack Overflow) accumulates a large body of programming knowledge in natural language text. Natural language process (NLP) techniques can be exploited to harvest this knowledge base for software engineering tasks. To make an effective use of NLP techniques, consistent vocabulary is essential. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms in informal discussions, such as abbreviations, synonyms and misspellings. Existing techniques to deal with such morphological forms are either designed for general English or predominantly rely on domain-specific lexical rules. A thesaurus of software-specific terms and commonly-used morphological forms is desirable for normalizing software engineering text, but very difficult to build manually. In this work, we propose an automatic approach to build such a thesaurus. Our approach identifies software-specific terms by contrasting software-specific and general corpuses, and infers morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations, and graph analysis of morphological relations. We evaluate the coverage and accuracy of the resulting thesaurus against community-curated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our thesaurus in a case study of normalizing questions from Stack Overflow and CodeProject.

Original languageEnglish
Title of host publicationProceedings of the 39th International Conference on Software Engineering
Subtitle of host publication20-28 May 2017, Buenos Aires, Argentina
EditorsAlessandro Orso, Martin Robillard
Place of PublicationLos Alamitos CA USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages450-461
Number of pages12
ISBN (Print)9781538638682
DOIs
Publication statusPublished - 19 Jul 2017
Externally publishedYes
EventInternational Conference on Software Engineering 2017 - Buenos Aires, Argentina
Duration: 20 May 201728 May 2017
Conference number: 39th
http://icse2017.gatech.edu/
https://ieeexplore.ieee.org/xpl/conhome/7976701/proceeding (Proceedings)

Conference

ConferenceInternational Conference on Software Engineering 2017
Abbreviated titleICSE 2017
Country/TerritoryArgentina
CityBuenos Aires
Period20/05/1728/05/17
OtherIEEE/ACM International Conference on Software Engineering Companion (ICSE-C 2017)
Internet address

Keywords

  • Abbreviation
  • Morphological Form
  • Stack Overflow
  • Synonym
  • Word embedding

Cite this