Abstract
Informal discussions on social platforms (e.g., Stack Overflow) accumulates a large body of programming knowledge in natural language text. Natural language process (NLP) techniques can be exploited to harvest this knowledge base for software engineering tasks. To make an effective use of NLP techniques, consistent vocabulary is essential. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms in informal discussions, such as abbreviations, synonyms and misspellings. Existing techniques to deal with such morphological forms are either designed for general English or predominantly rely on domain-specific lexical rules. A thesaurus of software-specific terms and commonly-used morphological forms is desirable for normalizing software engineering text, but very difficult to build manually. In this work, we propose an automatic approach to build such a thesaurus. Our approach identifies software-specific terms by contrasting software-specific and general corpuses, and infers morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations, and graph analysis of morphological relations. We evaluate the coverage and accuracy of the resulting thesaurus against community-curated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our thesaurus in a case study of normalizing questions from Stack Overflow and CodeProject.
Original language | English |
---|---|
Title of host publication | Proceedings of the 39th International Conference on Software Engineering |
Subtitle of host publication | 20-28 May 2017, Buenos Aires, Argentina |
Editors | Alessandro Orso, Martin Robillard |
Place of Publication | Los Alamitos CA USA |
Publisher | IEEE, Institute of Electrical and Electronics Engineers |
Pages | 450-461 |
Number of pages | 12 |
ISBN (Print) | 9781538638682 |
DOIs | |
Publication status | Published - 19 Jul 2017 |
Externally published | Yes |
Event | International Conference on Software Engineering 2017 - Buenos Aires, Argentina Duration: 20 May 2017 → 28 May 2017 Conference number: 39th http://icse2017.gatech.edu/ https://ieeexplore.ieee.org/xpl/conhome/7976701/proceeding (Proceedings) |
Conference
Conference | International Conference on Software Engineering 2017 |
---|---|
Abbreviated title | ICSE 2017 |
Country/Territory | Argentina |
City | Buenos Aires |
Period | 20/05/17 → 28/05/17 |
Other | IEEE/ACM International Conference on Software Engineering Companion (ICSE-C 2017) |
Internet address |
Keywords
- Abbreviation
- Morphological Form
- Stack Overflow
- Synonym
- Word embedding