Towards automated content analysis of rhetorical structure of written essays using sequential content-independent features in Portuguese

Rafael Ferreira Mello, Giuseppe Fiorentino, Hilário Oliveira, Péricles Miranda, Mladen Rakovic, Dragan Gasevic

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review


Brazilian universities have included essay writing assignments in the entrance examination procedure to select prospective students. The essay scorers manually look for the presence of required Rhetorical Structure Theory (RST) categories and evaluate essay coherence. However, identifying RST categories is a time-consuming task. The literature reported several attempts to automate the identification of RST categories in essays with machine learning. Still, previous studies have focused on using machine learning algorithms trained on content-dependent features that can diminish classification performance, leading to over-fitting and hindering model generalisability. Therefore, this paper proposes: (i) the analysis of state-of-the-art classifiers and content-independent features to the task of RST rhetorical moves; (ii) a new approach that considers the sequence of the text to extract features - i.e. sequential content-independent features; (iii) an empirical study about the generalisability of the machine learning models and sequential content-independent features for this context; (iv) the identification of the most predictive features for automated identification of RST categories in essays written in Portuguese. The best performing classifier, XGBoost, based on sequential content-independent features, outperformed the classifiers used in the literature and are based on traditional content-dependent features. The XGBoost classifier based on sequential content-independent features also reached promising accuracy when tested for generalisability.

Original languageEnglish
Title of host publicationLAK22 Conference Proceedings
EditorsAlyssa Friend Wise, Roberto Martinez-Maldonado, Isabel Hilliger
Place of PublicationNew York NY USA
PublisherAssociation for Computing Machinery (ACM)
Number of pages11
ISBN (Electronic)9781450395731
Publication statusPublished - 2022
EventInternational Conference on Learning Analytics and Knowledge 2022: Learning Analytics for Transition, Disruption and Social Change - Online, United States of America
Duration: 21 Mar 202225 Mar 2022
Conference number: 12th (Proceedings)


ConferenceInternational Conference on Learning Analytics and Knowledge 2022
Abbreviated titleLAK 2022
Country/TerritoryUnited States of America
Internet address


  • content analytics
  • context analysis
  • Essay analysis
  • natural language processing.
  • rhetoric structure

Cite this