TY - JOUR
T1 - Automated extraction and clustering of requirements glossary terms
AU - Arora, Chetan
AU - Sabetzadeh, Mehrdad
AU - Briand, Lionel
AU - Zimmer, Frank
N1 - Funding Information:
This project has received funding from Luxembourg’s National Research Fund (grant agreement numbers FNR/ P10/03 and FNR-6911386), and from the European Research Council under the European Union’s Horizon 2020 research and innovation program (grant agreement number 694277). We are grateful to Dan Isaac, Konstantinos Liolis, Sunil Nair, and Jose Luis de la Vara for useful discussions and contributions to our case studies. We would further like to thank the anonymous reviewers of IEEE TSE for their valuable comments.
Publisher Copyright:
© 1976-2012 IEEE.
PY - 2017/10/1
Y1 - 2017/10/1
N2 - A glossary is an important part of any software requirements document. By making explicit the technical terms in a domain and providing definitions for them, a glossary helps mitigate imprecision and ambiguity. A key step in building a glossary is to decide upon the terms to include in the glossary and to find any related terms. Doing so manually is laborious, particularly for large requirements documents. In this article, we develop an automated approach for extracting candidate glossary terms and their related terms from natural language requirements documents. Our approach differs from existing work on term extraction mainly in that it clusters the extracted terms by relevance, instead of providing a flat list of terms. We provide an automated, mathematically-based procedure for selecting the number of clusters. This procedure makes the underlying clustering algorithm transparent to users, thus alleviating the need for any user-specified parameters. To evaluate our approach, we report on three industrial case studies, as part of which we also examine the perceptions of the involved subject matter experts about the usefulness of our approach. Our evaluation notably suggests that: (1) Over requirements documents, our approach is more accurate than major generic term extraction tools. Specifically, in our case studies, our approach leads to gains of 20 percent or more in terms of recall when compared to existing tools, while at the same time either improving precision or leaving it virtually unchanged. And, (2) the experts involved in our case studies find the clusters generated by our approach useful as an aid for glossary construction.
AB - A glossary is an important part of any software requirements document. By making explicit the technical terms in a domain and providing definitions for them, a glossary helps mitigate imprecision and ambiguity. A key step in building a glossary is to decide upon the terms to include in the glossary and to find any related terms. Doing so manually is laborious, particularly for large requirements documents. In this article, we develop an automated approach for extracting candidate glossary terms and their related terms from natural language requirements documents. Our approach differs from existing work on term extraction mainly in that it clusters the extracted terms by relevance, instead of providing a flat list of terms. We provide an automated, mathematically-based procedure for selecting the number of clusters. This procedure makes the underlying clustering algorithm transparent to users, thus alleviating the need for any user-specified parameters. To evaluate our approach, we report on three industrial case studies, as part of which we also examine the perceptions of the involved subject matter experts about the usefulness of our approach. Our evaluation notably suggests that: (1) Over requirements documents, our approach is more accurate than major generic term extraction tools. Specifically, in our case studies, our approach leads to gains of 20 percent or more in terms of recall when compared to existing tools, while at the same time either improving precision or leaving it virtually unchanged. And, (2) the experts involved in our case studies find the clusters generated by our approach useful as an aid for glossary construction.
KW - case study research
KW - clustering
KW - natural language processing
KW - Requirements glossaries
KW - term extraction
UR - https://www.scopus.com/pages/publications/85037034683
U2 - 10.1109/TSE.2016.2635134
DO - 10.1109/TSE.2016.2635134
M3 - Article
AN - SCOPUS:85037034683
SN - 0098-5589
VL - 43
SP - 918
EP - 945
JO - IEEE Transactions on Software Engineering
JF - IEEE Transactions on Software Engineering
IS - 10
ER -