Skip to main navigation Skip to search Skip to main content

Answer Summarization for Technical Queries: Benchmark and New Approach

  • Chengran Yang
  • , Bowen Xu
  • , Ferdian Thung
  • , Yucen Shi
  • , Ting Zhang
  • , Zhou Yang
  • , Xin Zhou
  • , Jieke Shi
  • , Junda He
  • , Donggyun Han
  • , David Lo

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

Prior studies have demonstrated that approaches to generate an answer summary for a given technical query in Software Question and Answer (SQA) sites are desired. We find that existing approaches are assessed solely through user studies. Hence, a new user study needs to be performed every time a new approach is introduced; this is time-consuming, slows down the development of the new approach, and results from different user studies may not be comparable to each other. There is a need for a benchmark with ground truth summaries as a complement assessment through user studies. Unfortunately, such a benchmark is non-existent for answer summarization for technical queries from SQA sites. To fill the gap, we manually construct a high-quality benchmark to enable automatic evaluation of answer summarization for the technical queries for SQA sites. It contains 111 query-summary pairs extracted from 382 Stack Overflow answers with 2,014 sentence candidates. Using the benchmark, we comprehensively evaluate the performance of existing approaches and find that there is still a big room for improvements. Motivated by the results, we propose a new approach TechSumBot with three key modules:1) Usefulness Ranking module; 2) Centrality Estimation module; and 3) Redundancy Removal module. We evaluate TechSumBot in both automatic (i.e., using our benchmark) and manual (i.e., via a user study) manners. The results from both evaluations consistently demonstrate that TechSumBot outperforms the best performing baseline approaches from both SE and NLP domains by a large margin, i.e., 10.83%-14.90%, 32.75%-36.59%, and 12.61%-17.54%, in terms of ROUGE-1, ROUGE-2, and ROUGE-L on automatic evaluation, and 5.79%-9.23% and 17.03%-17.68%, in terms of average usefulness and diversity score on human evaluation. This highlights that automatic evaluation on our benchmark can uncover findings similar to the ones found through user studies. More importantly, the automatic evaluation has a much lower cost, especially when it is used to assess a new approach. Additionally, we also conducted an ablation study, which demonstrates that each module in TechSumBot contributes to boosting the overall performance of TechSumBot. We release the benchmark as well as the replication package of our experiment at https://github.com/TechSumBot/TechSumBot.

Original languageEnglish
Title of host publication37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022
EditorsJulia Rubin, Shahar Maoz
Place of PublicationNew York NY USA
PublisherAssociation for Computing Machinery (ACM)
Number of pages13
ISBN (Electronic)9781450396240
DOIs
Publication statusPublished - 2022
Externally publishedYes
EventAutomated Software Engineering Conference 2022 - Rochester, United States of America
Duration: 10 Oct 202214 Oct 2022
Conference number: 37th
https://dl.acm.org/doi/proceedings/10.1145/3551349 (Proceedings)
https://ase-conferences.org/ (Website)

Conference

ConferenceAutomated Software Engineering Conference 2022
Abbreviated titleASE 2022
Country/TerritoryUnited States of America
CityRochester
Period10/10/2214/10/22
Internet address

Keywords

  • Pre-Trained Models
  • Question Retrieval
  • Summarization

Cite this