TY - JOUR
T1 - Generate, annotate, and learn
T2 - NLP with synthetic text
AU - He, Xuanli
AU - Nassar, Islam
AU - Kiros, Jamie
AU - Haffari, Gholamreza
AU - Norouzi, Mohammad
N1 - Funding Information:
We would like to thank the anonymous reviewers and action editor André F.T. Martins for their comments and suggestions on this work. The computational resources of this work are partly supported by the Multi-modal Australian ScienceS Imaging and Visualisation Environment (MASSIVE) (www.massive.org.au). This material is partly based on research sponsored by Air Force Research Laboratory and DARPA under agreement number FA8750-19-2-0501. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.
Publisher Copyright:
©2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
PY - 2022
Y1 - 2022
N2 - This paper studies the use of language models as a source of synthetic unlabeled text for NLP. We formulate a general framework called ‘‘generate, annotate, and learn (GAL)’’ to take advantage of synthetic text within knowledge distillation, self-training, and few-shot learning applications. To generate high-quality task-specific text, we either fine-tune LMs on inputs from the task of interest, or prompt large LMs with few examples. We use the best available classifier to annotate synthetic text with soft pseudo labels for knowledge distillation and self-training, and use LMs to obtain hard labels for few-shot learning. We train new supervised models on the combination of labeled and pseudo-labeled data, which results in significant gains across several applications. We investigate key components of GAL and present theoretical and empirical arguments against the use of class-conditional LMs to generate synthetic labeled text instead of unlabeled text. GAL achieves new state-of-the-art knowledge distillation results for 6-layer transformers on the GLUE leaderboard.
AB - This paper studies the use of language models as a source of synthetic unlabeled text for NLP. We formulate a general framework called ‘‘generate, annotate, and learn (GAL)’’ to take advantage of synthetic text within knowledge distillation, self-training, and few-shot learning applications. To generate high-quality task-specific text, we either fine-tune LMs on inputs from the task of interest, or prompt large LMs with few examples. We use the best available classifier to annotate synthetic text with soft pseudo labels for knowledge distillation and self-training, and use LMs to obtain hard labels for few-shot learning. We train new supervised models on the combination of labeled and pseudo-labeled data, which results in significant gains across several applications. We investigate key components of GAL and present theoretical and empirical arguments against the use of class-conditional LMs to generate synthetic labeled text instead of unlabeled text. GAL achieves new state-of-the-art knowledge distillation results for 6-layer transformers on the GLUE leaderboard.
UR - http://www.scopus.com/inward/record.url?scp=85138749499&partnerID=8YFLogxK
U2 - 10.1162/tacl_a_00492
DO - 10.1162/tacl_a_00492
M3 - Article
AN - SCOPUS:85138749499
SN - 2307-387X
VL - 10
SP - 826
EP - 842
JO - Transactions of the Association for Computational Linguistics
JF - Transactions of the Association for Computational Linguistics
ER -