Abstract
Automated Essay Scoring (AES) research efforts primarily focus on feature engineering and the building of machine learning models to attain higher consensus with human graders. In academic grading such as essay scoring, the scores will naturally result in a normal distribution, more commonly referred to as the bell curve. However, the datasets used do not always have such distribution and are often overlooked in most machine learning environments. This paper proposes a Gaussian Multi-Class Synthetic Minority Over-sampling Technique (GMC-SMOTE) for imbalanced datasets. The proposed GMC-SMOTE generates new synthetic data to complement the existing datasets to produce scores that are in a normal distribution. Using several labeled essay sets, some of which already have a substantial agreement between the machine learning model and human graders, learning from normal distribution datasets yields significant improvements. Improvements of 0.038 QWK score (5.8%) over the imbalanced dataset were observed. The experimental result has also shown that naturally occurring distribution in the automated essay scoring domain contributes to the most appropriate training dataset for machine learning purposes.
Original language | English |
---|---|
Title of host publication | Proceedings of the 15th International Conference on Educational Data Mining |
Editors | Antonija Mitrovic, Nigel Bosch |
Place of Publication | Durham UK |
Publisher | International Educational Data Mining Society |
Number of pages | 5 |
ISBN (Electronic) | 9781733673631 |
DOIs | |
Publication status | Published - 2022 |
Event | Educational Data Mining 2022 - Durham, United Kingdom Duration: 24 Jul 2022 → 27 Jul 2022 Conference number: 15th https://educationaldatamining.org/edm2022/ (Website) https://educationaldatamining.org/edm2022/proceedings/ (Proceedings) |
Conference
Conference | Educational Data Mining 2022 |
---|---|
Abbreviated title | EDM 2022 |
Country/Territory | United Kingdom |
City | Durham |
Period | 24/07/22 → 27/07/22 |
Internet address |
|
Keywords
- Automated Essay Scoring
- Boosting
- Data Pre-Processing
- Data Sampling
- Gaussian Distribution