You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
You agree to the license and will use this model for research purposes only.
Log in or Sign Up to review the conditions and access this model content.
SFE Latxa 8B & 70B (HABE-HiTZ C1 AES)
Model Details
- Model Names: SFE Latxa 8B, SFE Latxa 70B
- Base Models: Latxa 8B, Latxa 70B
- Language: Basque (eu)
- Task: Automatic Essay Scoring (AES) and Feedback Generation
- Proficiency Level Supported: CEFR C1
- Developers: Ekhi Azurmendi, Xabier Arregi, Oier Lopez de Lacalle (HiTZ Center, University of the Basque Country)
- Reference Paper: Automatic Essay Scoring and Feedback Generation in Basque Language Learning (Accepted in LREC 2026)
Intended Use
These models are built for educational Natural Language Processing (NLP) in Basque. They are specifically designed to assess and provide feedback on learner essays at the C1 proficiency level.
- Automatic Essay Scoring (AES): Evaluates essays in correctness criteria.
- Feedback Generation: Produces pedagogically grounded, natural-language feedback and extracts explicit erroneous sentences from the text to help learners understand their mistakes.
Training Data
The models were fine-tuned on the HABE-HiTZ C1 Dataset, which comprises 3,200 Basque essays written by language learners. The dataset includes rich annotations by expert evaluators:
- Correctness score (A through E)
- Natural language feedback for each criterion
- Identified learner errors (error examples) and their categorizations
Note: These models were only fine-tuned in correctness criterion.
Training Procedure & Hyperparameters
The models were adapted to the task using Supervised Fine-Tuning (SFT). The following hyperparameters were used for fine-tuning both the 8B and 70B variants:
| Hyperparameter | Value |
|---|---|
| Batch size | 64 |
| Learning Rate | 5e-6 |
| Weight Decay | 0.1 |
| Epochs | 10 |
| LR Decay | Cosine |
| Warmup ratio | 0.1 |
Prompt Engineering Configurations
The models were fine-tuned using various input/output structural configurations to predict Scores (S), Feedback (F), and Error-examples (E). Experimental results showed that generating the Score first, followed by Feedback and Error-examples (the SFE configuration), yielded superior performance in scoring and consistency.
Evaluation Results
The fine-tuned Latxa models were evaluated against open-source encoder models (RoBERTa-EusCrawl) and SoTA closed-source systems (GPT-5, Claude Sonnet 4.5):
- Scoring Consistency: SFE Latxa 70B achieves high consistency (~96.84%) between the generated correctness score and the provided feedback. In contrast, GPT-5 and Claude Sonnet 4.5 showed poor alignment (44.07% and 78.46%, respectively).
- Error Extraction and Categorization: In manual evaluation, the SFE Latxa 70B model extracted erroneous sentences with a Fidelity Rate (FR) of 98.08%, an Extraction Accuracy (EA) of 66.19%, and a Categorization Accuracy (CA) of 71.63%.
- Qualitative Advantage: The SFE Latxa models identified a much broader and pedagogically useful range of error types across all grammatical and structural categories compared to proprietary models, which primarily focused on simple spelling and lexical errors.
Limitations
- Language & Level Restriction: The models are highly specialized for Basque C1 essays and bad performance is expected when evaluating other proficiency levels (e.g., A1-B2) or texts in other languages.
- Generative Hallucinations: Generative LLMs can occasionally hallucinate error examples or miscategorize mistakes.
Paper abstract
This paper introduces the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level. The dataset comprises 3,200 essays from HABE, each annotated by expert evaluators with criterion specific scores covering correctness, richness, coherence, cohesion, and task alignment enriched with detailed feedback and error examples. We fine-tune open-source models, including RoBERTa-EusCrawl and Latxa 8B/70B, for scoring. We focused on correctness criteria for the explanation generation, adapting Latxa to correctly predict both, scores and explanations. Our experiments show that encoder models remain highly reliable for AES, while supervised fine-tuning (SFT) of Latxa significantly enhances performance, surpassing state-of-the-art (SoTA) closed-source systems such as GPT-5 and Claude Sonnet 4.5 in scoring consistency and feedback quality. We also propose a novel evaluation methodology for assessing feedback generation, combining automatic consistency metrics with expert-based validation of extracted learner errors. Results demonstrate that the fine-tuned Latxa model produces criterion-aligned, pedagogically meaningful feedback and identifies a wider range of error types than proprietary models. This resource and benchmark establish a foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages such as Basque.
Citation:
To cite our work, please use:
@misc{azurmendi2025automaticessayscoringfeedback,
title={Automatic Essay Scoring and Feedback Generation in Basque Language Learning},
author={Ekhi Azurmendi and Xabier Arregi and Oier Lopez de Lacalle},
year={2025},
eprint={2512.08713},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.08713},
}
- Downloads last month
- -