You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

You agree to the license and will use this model for research purposes only.

SFE Latxa 8B & 70B (HABE-HiTZ C1 AES)

Model Details

Model Names: SFE Latxa 8B, SFE Latxa 70B
Base Models: Latxa 8B, Latxa 70B
Language: Basque (eu)
Task: Automatic Essay Scoring (AES) and Feedback Generation
Proficiency Level Supported: CEFR C1
Developers: Ekhi Azurmendi, Xabier Arregi, Oier Lopez de Lacalle (HiTZ Center, University of the Basque Country)
Reference Paper: Automatic Essay Scoring and Feedback Generation in Basque Language Learning (Accepted in LREC 2026)

Intended Use

These models are built for educational Natural Language Processing (NLP) in Basque. They are specifically designed to assess and provide feedback on learner essays at the C1 proficiency level.

Automatic Essay Scoring (AES): Evaluates essays in correctness criteria.
Feedback Generation: Produces pedagogically grounded, natural-language feedback and extracts explicit erroneous sentences from the text to help learners understand their mistakes.

Training Data

The models were fine-tuned on the HABE-HiTZ C1 Dataset, which comprises 3,200 Basque essays written by language learners. The dataset includes rich annotations by expert evaluators:

Correctness score (A through E)
Natural language feedback for each criterion
Identified learner errors (error examples) and their categorizations

Note: These models were only fine-tuned in correctness criterion.

Training Procedure & Hyperparameters

The models were adapted to the task using Supervised Fine-Tuning (SFT). The following hyperparameters were used for fine-tuning both the 8B and 70B variants:

Hyperparameter	Value
Batch size	64
Learning Rate	5e-6
Weight Decay	0.1
Epochs	10
LR Decay	Cosine
Warmup ratio	0.1

Prompt Engineering Configurations

The models were fine-tuned using various input/output structural configurations to predict Scores (S), Feedback (F), and Error-examples (E). Experimental results showed that generating the Score first, followed by Feedback and Error-examples (the SFE configuration), yielded superior performance in scoring and consistency.

Evaluation Results

The fine-tuned Latxa models were evaluated against open-source encoder models (RoBERTa-EusCrawl) and SoTA closed-source systems (GPT-5, Claude Sonnet 4.5):

Scoring Consistency: SFE Latxa 70B achieves high consistency (~96.84%) between the generated correctness score and the provided feedback. In contrast, GPT-5 and Claude Sonnet 4.5 showed poor alignment (44.07% and 78.46%, respectively).
Error Extraction and Categorization: In manual evaluation, the SFE Latxa 70B model extracted erroneous sentences with a Fidelity Rate (FR) of 98.08%, an Extraction Accuracy (EA) of 66.19%, and a Categorization Accuracy (CA) of 71.63%.
Qualitative Advantage: The SFE Latxa models identified a much broader and pedagogically useful range of error types across all grammatical and structural categories compared to proprietary models, which primarily focused on simple spelling and lexical errors.

Limitations

Language & Level Restriction: The models are highly specialized for Basque C1 essays and bad performance is expected when evaluating other proficiency levels (e.g., A1-B2) or texts in other languages.
Generative Hallucinations: Generative LLMs can occasionally hallucinate error examples or miscategorize mistakes.

Paper abstract

This paper introduces the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level. The dataset comprises 3,200 essays from HABE, each annotated by expert evaluators with criterion specific scores covering correctness, richness, coherence, cohesion, and task alignment enriched with detailed feedback and error examples. We fine-tune open-source models, including RoBERTa-EusCrawl and Latxa 8B/70B, for scoring. We focused on correctness criteria for the explanation generation, adapting Latxa to correctly predict both, scores and explanations. Our experiments show that encoder models remain highly reliable for AES, while supervised fine-tuning (SFT) of Latxa significantly enhances performance, surpassing state-of-the-art (SoTA) closed-source systems such as GPT-5 and Claude Sonnet 4.5 in scoring consistency and feedback quality. We also propose a novel evaluation methodology for assessing feedback generation, combining automatic consistency metrics with expert-based validation of extracted learner errors. Results demonstrate that the fine-tuned Latxa model produces criterion-aligned, pedagogically meaningful feedback and identifies a wider range of error types than proprietary models. This resource and benchmark establish a foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages such as Basque.

Citation:

To cite our work, please use:

@misc{azurmendi2025automaticessayscoringfeedback,
      title={Automatic Essay Scoring and Feedback Generation in Basque Language Learning}, 
      author={Ekhi Azurmendi and Xabier Arregi and Oier Lopez de Lacalle},
      year={2025},
      eprint={2512.08713},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.08713}, 
}

Downloads last month: -

Safetensors

Model size

71B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including EkhiAzur/SFE_Latxa_70B

HABE - HiTZ C1

Collection

AES and feedback generation for Basque language • 4 items • Updated 3 days ago

Paper for EkhiAzur/SFE_Latxa_70B

Automatic Essay Scoring and Feedback Generation in Basque Language Learning

Paper • 2512.08713 • Published Dec 9, 2025