Improved automated essay scoring using Gaussian Multi-Class SMOTE for dataset sampling

Jih Soong Tan, Ian K.T. Tan, Lay Ki Soon, Huey Fang Ong

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

1 Citation (Scopus)


Automated Essay Scoring (AES) research efforts primarily focus on feature engineering and the building of machine learning models to attain higher consensus with human graders. In academic grading such as essay scoring, the scores will naturally result in a normal distribution, more commonly referred to as the bell curve. However, the datasets used do not always have such distribution and are often overlooked in most machine learning environments. This paper proposes a Gaussian Multi-Class Synthetic Minority Over-sampling Technique (GMC-SMOTE) for imbalanced datasets. The proposed GMC-SMOTE generates new synthetic data to complement the existing datasets to produce scores that are in a normal distribution. Using several labeled essay sets, some of which already have a substantial agreement between the machine learning model and human graders, learning from normal distribution datasets yields significant improvements. Improvements of 0.038 QWK score (5.8%) over the imbalanced dataset were observed. The experimental result has also shown that naturally occurring distribution in the automated essay scoring domain contributes to the most appropriate training dataset for machine learning purposes.

Original languageEnglish
Title of host publicationProceedings of the 15th International Conference on Educational Data Mining
EditorsAntonija Mitrovic, Nigel Bosch
Place of PublicationDurham UK
PublisherInternational Educational Data Mining Society
Number of pages5
ISBN (Electronic)9781733673631
Publication statusPublished - 2022
EventEducational Data Mining 2022 - Durham, United Kingdom
Duration: 24 Jul 202227 Jul 2022
Conference number: 15th (Website) (Proceedings)


ConferenceEducational Data Mining 2022
Abbreviated titleEDM 2022
Country/TerritoryUnited Kingdom
Internet address


  • Automated Essay Scoring
  • Boosting
  • Data Pre-Processing
  • Data Sampling
  • Gaussian Distribution

Cite this