Abstract
In this chapter, we provide an overview of the main concepts relating to corpus annotation, along with some discussion of the practical aspects of creating annotated texts and working with them. Our overview is restricted to automatic annotation of electronic text, which is the most common kind of annotation in the context of contemporary corpus linguistics. We focus on the annotation of texts which typically follow established orthographic principles and consider the following four main types of annotation, using English for the purposes of illustration: (1) part-of-speech (POS) tagging, (2) lemmatization, (3) syntactic parsing, and (4) semantic annotation. The accuracy of annotation is a key factor in any evaluation of annotation schemes and we discuss annotation accuracy, including precision and recall measures. Finally, we briefly consider newer developments in two broad areas: the annotation of multimodal corpora and the annotation of Indigenous and endangered language materials. Both of these developments reflect changing priorities on the part of linguistic researchers, and both present significant challenges when it comes to automated annotation.
| Original language | English |
|---|---|
| Title of host publication | A Practical Handbook of Corpus Linguistics |
| Editors | Magali Paquot, Stefan Th. Gries |
| Place of Publication | Cham Switzerland |
| Publisher | Springer |
| Chapter | 2 |
| Pages | 25-48 |
| Number of pages | 24 |
| Edition | 1st |
| ISBN (Electronic) | 9783030462161 |
| ISBN (Print) | 9783030462154 |
| DOIs | |
| Publication status | Published - 2020 |
Keywords
- Linguistics
- annotation
- parsing
- Corpus linguistics