TY - JOUR
T1 - Fusing multi-abstraction vector space models for concern localization
AU - Zhang, Yun
AU - Lo, David
AU - Xia, Xin
AU - Scanniello, Giuseppe
AU - Le, Tien Duy B.
AU - Sun, Jianling
PY - 2018/8
Y1 - 2018/8
N2 - Concern localization refers to the process of locating code units that match a particular textual description. It takes as input textual documents such as bug reports and feature requests and outputs a list of candidate code units that are relevant to the bug reports or feature requests. Many information retrieval (IR) based concern localization techniques have been proposed in the literature. These techniques typically represent code units and textual descriptions as a bag of tokens at one level of abstraction, e.g., each token is a word, or each token is a topic. In this work, we propose a multi-abstraction concern localization technique named M ULAB. M ULAB represents a code unit and a textual description at multiple abstraction levels. Similarity of a textual description and a code unit is now made by considering all these abstraction levels. We combine a vector space model (VSM) and multiple topic models to compute the similarity and apply a genetic algorithm to infer semi-optimal topic model configurations. We also propose 12 variants of M ULAB by using different data fusion methods. We have evaluated our solution on 175 concerns from 9 open source Java software systems. The experimental results show that variant CombMNZ-Def performs better than other variants, and also outperforms the state-of-art baseline called P R (PageRank based algorithm), which is proposed by Scanniello et al. (Empir Softw Eng 20(6):1666–1720 2015) in terms of effectiveness and rank.
AB - Concern localization refers to the process of locating code units that match a particular textual description. It takes as input textual documents such as bug reports and feature requests and outputs a list of candidate code units that are relevant to the bug reports or feature requests. Many information retrieval (IR) based concern localization techniques have been proposed in the literature. These techniques typically represent code units and textual descriptions as a bag of tokens at one level of abstraction, e.g., each token is a word, or each token is a topic. In this work, we propose a multi-abstraction concern localization technique named M ULAB. M ULAB represents a code unit and a textual description at multiple abstraction levels. Similarity of a textual description and a code unit is now made by considering all these abstraction levels. We combine a vector space model (VSM) and multiple topic models to compute the similarity and apply a genetic algorithm to infer semi-optimal topic model configurations. We also propose 12 variants of M ULAB by using different data fusion methods. We have evaluated our solution on 175 concerns from 9 open source Java software systems. The experimental results show that variant CombMNZ-Def performs better than other variants, and also outperforms the state-of-art baseline called P R (PageRank based algorithm), which is proposed by Scanniello et al. (Empir Softw Eng 20(6):1666–1720 2015) in terms of effectiveness and rank.
KW - Concern localization
KW - Data fusion
KW - Multi-Abstraction
KW - Text retrieval
KW - Topic modeling
UR - http://www.scopus.com/inward/record.url?scp=85039838834&partnerID=8YFLogxK
U2 - 10.1007/s10664-017-9585-2
DO - 10.1007/s10664-017-9585-2
M3 - Article
AN - SCOPUS:85039838834
SN - 1382-3256
VL - 23
SP - 2279
EP - 2322
JO - Empirical Software Engineering
JF - Empirical Software Engineering
IS - 4
ER -