Corpus annotation

John Newman, Christopher Cox

Research output: Chapter in Book/Report/Conference proceedingChapter (Book)Otherpeer-review

Abstract

In this chapter, we provide an overview of the main concepts relating to corpus annotation, along with some discussion of the practical aspects of creating annotated texts and working with them. Our overview is restricted to automatic annotation of electronic text, which is the most common kind of annotation in the context of contemporary corpus linguistics. We focus on the annotation of texts which typically follow established orthographic principles and consider the following four main types of annotation, using English for the purposes of illustration: (1) part-of-speech (POS) tagging, (2) lemmatization, (3) syntactic parsing, and (4) semantic annotation. The accuracy of annotation is a key factor in any evaluation of annotation schemes and we discuss annotation accuracy, including precision and recall measures. Finally, we briefly consider newer developments in two broad areas: the annotation of multimodal corpora and the annotation of Indigenous and endangered language materials. Both of these developments reflect changing priorities on the part of linguistic researchers, and both present significant challenges when it comes to automated annotation.
Original languageEnglish
Title of host publicationA Practical Handbook of Corpus Linguistics
EditorsMagali Paquot, Stefan Th. Gries
Place of PublicationCham Switzerland
PublisherSpringer
Chapter2
Pages25-48
Number of pages24
Edition1st
ISBN (Electronic)9783030462161
ISBN (Print)9783030462154
DOIs
Publication statusPublished - 2020

Keywords

  • Linguistics
  • annotation
  • parsing
  • Corpus linguistics

Cite this