EVT Item Classifier — Expectancy-Value Theory Construct Tagger

A fine-tuned text classifier that assigns psychological test items (e.g., survey questions, scale items) to one of six categories derived from Expectancy-Value Theory (Eccles et al., 1983; Eccles & Wigfield, 2002):

Label EVT Construct Example item
ATTAINMENT_VALUE Personal importance of doing well "Doing well in math is important to who I am."
COST Perceived negative consequences of engagement "I have to give up too much to succeed in this class."
EXPECTANCY Beliefs about future success "I am confident I can master the skills taught in this course."
INTRINSIC_VALUE Enjoyment and interest "I find the content of this course very interesting."
UTILITY_VALUE Usefulness for future goals "What I learn in this class will be useful for my career."
OTHER Not classifiable as an EVT construct "I usually sit in the front row."

Intended Use

This model is intended for academic research in educational psychology, motivation science, and psychometrics. Typical use cases include:

  • Automated content analysis of existing item pools and questionnaire banks for EVT construct coverage
  • Scale development assistance — screening candidate items during the item-writing phase
  • Systematic reviews — coding large corpora of test items from published instruments
  • Construct validation — checking whether items align with their intended EVT construct

Out-of-Scope Uses

  • Clinical or diagnostic decision-making — This model classifies test items, not respondents. It should not be used to assess individuals.
  • Replacement for human coding — The model achieves substantial but not perfect agreement with human coders (κ = .77). It is intended as a tool to assist researchers, not to replace expert judgment.
  • Non-English items — The model was trained and evaluated on English-language items only.
  • Non-EVT constructs — The model will force any input into one of the six categories. Items from unrelated theoretical frameworks (e.g., Big Five personality, cognitive load) will be classified as OTHER at best, or spuriously assigned to an EVT category.

How to Use

Quick Start

from transformers import AutoTokenizer, AutoModel
import torch
 
# Load model
tokenizer = AutoTokenizer.from_pretrained("christiqn/mpnet-EVT-classifier")
model = AutoModel.from_pretrained("christiqn/mpnet-EVT-classifier", trust_remote_code=True)
model.eval()
 
# Inference
text = "I expect to do well in this course."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
 
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.sigmoid(outputs.logits)
 
label = model.config.id2label[probs[0].argmax().item()]
print(f"Predicted: {label}")

Adjusting the Decision Threshold

The default threshold of 0.50 balances precision and recall. You can adjust it:

  • Lower threshold (e.g., 0.35): More inclusive — fewer items classified as OTHER, higher recall on EVT constructs, but more false positives.
  • Higher threshold (e.g., 0.65): More conservative — more items fall to OTHER, higher precision on EVT constructs, but more false negatives.

Choose based on your application: use a lower threshold for exploratory screening, a higher one when precision matters.

Model Description

Architecture

The model uses a custom classification head on top of sentence-transformers/all-mpnet-base-v2 (110M parameters):

MPNet encoder
    ↓
ConcatPooling (mean + max over token embeddings → 2 × hidden_size)
    ↓
Dropout(0.2) → Dense(2h → h) → GELU → Dropout(0.2)
    ↓
NormLinear head (cosine similarity classifier, τ = 20)
    ↓
5 independent sigmoid outputs (one per EVT construct)

The OTHER class is not explicitly learned. Instead, an item is classified as OTHER when no construct's sigmoid probability exceeds the decision threshold (default: 0.50). This One-vs-Rest (OvR) formulation avoids forcing the model to learn a coherent representation for the heterogeneous OTHER category.

Key Design Choices

Component Rationale
ConcatPooling Concatenating mean and max pooling captures both distributional and salient features from short text (Howard & Ruder, 2018)
NormLinear (cosine) head L2-normalised features and weights make classification direction-based, improving robustness to domain shift (Wang et al., 2018)
Asymmetric Loss (γ_neg=4, γ_pos=1) Down-weights easy negatives in the multi-label sigmoid setup, preventing the dominant negative class from overwhelming the loss (Ridnik et al., 2021)
FGM adversarial training Embedding-space perturbations regularise the model, improving generalisation from synthetic to real items (Miyato et al., 2017)
LLRD (decay = 0.9) Lower layers receive smaller learning rates, preserving pre-trained linguistic knowledge (Sun et al., 2019)
Gradual unfreezing Head-only training for epoch 1, then full end-to-end fine-tuning, to prevent catastrophic forgetting (Howard & Ruder, 2018)
SWA Averaging top-3 checkpoints produces a flatter minimum and marginally improves κ (Izmailov et al., 2018)

Training

Training Data

The model was fine-tuned on the expectancy_value_pool_v2 dataset — approximately 15,000 synthetic psychological test items generated with construct-level attribution, filtered to remove participial-phrase fragments. The data was split 85/7.5/7.5 into train/validation/test sets with stratified sampling.

Training Procedure

  • Epochs: 12 (with early stopping, patience = 4, monitored on validation κ)
  • Batch size: 24 per device × 2 gradient accumulation steps = 48 effective
  • Optimizer: AdamW (lr = 3e-5, weight decay = 0.01)
  • Scheduler: Linear with 10% warmup
  • Precision: bf16 mixed-precision
  • Hardware: Single NVIDIA H100 GPU
  • Post-training: Stochastic Weight Averaging of top-3 checkpoints

Training Results (Held-Out Synthetic Test Set)

Metric Value
Accuracy 0.990
Macro F1 0.990
Cohen's κ 0.990

Note: These near-perfect numbers reflect performance on the synthetic data distribution. Real-world performance on human-authored items is reported below.

Evaluation

Test Set

The model was evaluated on N = 1,284 human-coded test items from christiqn/EVT-items, drawn from published and unpublished psychological instruments spanning 95 distinct scales. Items were independently coded by a human expert and by the model. Items marked as formatting artefacts, third-person phrasing, or multiple anchors were excluded prior to evaluation. All confidence intervals are bias-corrected and accelerated (BCa) bootstrap CIs with B = 10,000.

Agreement with Human Coder (Full 6-Class)

Metric Value 95% BCa CI
Cohen's κ (unweighted) 0.767 [0.741, 0.793]
Cohen's κ (linear weighted) 0.768 [0.735, 0.797]
Krippendorff's α 0.767 [0.740, 0.793]
PABAK 0.772
Overall accuracy 0.810 [0.787, 0.830]
Macro F1 0.812 [0.790, 0.834]
Weighted F1 0.807 [0.785, 0.828]

Cohen's κ = .77 indicates substantial agreement according to Landis & Koch (1977).

Per-Class Performance (6-Class)

Class Precision Recall F1 κ (OvR) n
ATTAINMENT_VALUE .800 [.726, .868] .865 [.798, .926] .831 [.775, .880] .815 111
COST .795 [.732, .854] .859 [.800, .912] .826 [.778, .869] .802 149
EXPECTANCY .843 [.801, .881] .886 [.850, .921] .864 [.834, .892] .820 308
INTRINSIC_VALUE .839 [.793, .883] .893 [.852, .931] .865 [.831, .896] .834 234
OTHER .726 [.668, .780] .636 [.580, .691] .678 [.631, .722] .595 283
UTILITY_VALUE .846 [.793, .897] .774 [.716, .831] .808 [.764, .849] .775 199

Confusion Matrix (6-Class)

Pred: AV Pred: CO Pred: EX Pred: IV Pred: OT Pred: UV
True: AV 96 1 1 5 4 4
True: CO 1 128 1 10 7 2
True: EX 0 11 273 5 18 1
True: IV 4 4 0 209 15 2
True: OT 13 14 40 17 180 19
True: UV 6 3 9 3 24 154

Cramér's V = 0.782 (6-class).

Marginal Homogeneity (Stuart-Maxwell Test)

The Stuart-Maxwell test for marginal homogeneity is significant (χ²(5) = 18.62, p = .002), indicating a systematic difference between the model's and human coder's label distributions. The model tends to over-predict EXPECTANCY (+16), INTRINSIC_VALUE (+15), COST (+12), and ATTAINMENT_VALUE (+9) relative to the human coder, while under-predicting OTHER (−35) and UTILITY_VALUE (−17). This is a direct consequence of the OvR + asymmetric loss design, which prioritises recall on core constructs.

Core Construct Discrimination (5-Class, Excluding OTHER)

The OTHER category is not a psychological construct — it is a heterogeneous residual class that the model was never explicitly trained to recognise. To separately assess the model's ability to discriminate among the five core EVT constructs, a restricted analysis was conducted on items where both the human coder and the model assigned a core construct (n = 933, 72.7% of all items).

Metric Value 95% BCa CI
Cohen's κ (5-class) 0.899 [0.876, 0.920]
Krippendorff's α 0.899 [0.876, 0.921]
Overall accuracy 0.922 [0.901, 0.937]
Macro F1 0.915 [0.895, 0.933]
PABAK 0.902

Cohen's κ = .90 indicates almost perfect agreement on construct discrimination.

Class Precision Recall F1 n
ATTAINMENT_VALUE .897 [.838, .950] .897 [.835, .951] .897 [.851, .937] 107
COST .871 [.812, .922] .901 [.849, .948] .886 [.843, .922] 142
EXPECTANCY .961 [.937, .982] .941 [.912, .967] .951 [.932, .968] 290
INTRINSIC_VALUE .901 [.861, .938] .954 [.924, .980] .927 [.901, .950] 219
UTILITY_VALUE .945 [.907, .977] .880 [.830, .926] .911 [.877, .941] 175

Additionally, when evaluating all items the human coded as a core construct (n = 1,001) — allowing the model to predict OTHER — the model achieved κ = 0.822 [0.795, 0.848] and accuracy = 0.859 [0.835, 0.878]. Of these 1,001 items, 68 (6.8%) were incorrectly pushed to OTHER by the model.

Summary Across Evaluation Scenarios

Scenario κ Macro F1 N
Full 6-class (with OTHER) 0.767 0.812 1,284
5-class: both raters assigned core construct 0.899 0.915 933
Human = core, model unrestricted 0.822 1,001

The jump from κ = .77 to κ = .90 confirms that the model's primary weakness lies at the EVT/non-EVT detection boundary (the OTHER category), not in discriminating among the five core constructs.

Base Model Comparison

To quantify the effect of domain-specific fine-tuning, the classifier was compared against the un-fine-tuned base model (all-mpnet-base-v2). The base model achieved κ = −0.054 [−0.073, −0.033] vs. κ = 0.767 [0.741, 0.793] for the fine-tuned model (Δκ = +0.821 [0.787, 0.854]). A McNemar test indicated a highly significant difference in error rates, χ²(1) = 867.91, p < .001 (fine-tuned correct only: 911 items; base correct only: 14 items). The fine-tuned model outperforms the base model on all six classes.

Class Base F1 Fine-Tuned F1 Δ
ATTAINMENT_VALUE 0.051 0.831 +0.780
COST 0.090 0.826 +0.736
EXPECTANCY 0.169 0.864 +0.695
INTRINSIC_VALUE 0.200 0.865 +0.666
OTHER 0.007 0.678 +0.671
UTILITY_VALUE 0.010 0.808 +0.799

Known Limitations and Biases

OTHER is the weakest category (κ = .60, F1 = .68). Because OTHER is defined as "no construct activated above threshold" rather than a learned class, the model tends to find some EVT signal in ambiguous items that a human would code as non-EVT. The most frequent error type is predicting EXPECTANCY when the true label is OTHER (40 cases, 16.4% of all errors). When OTHER is excluded, the model reaches κ = .90 — almost perfect agreement — on the five core constructs.

Systematic label distribution differences. The Stuart-Maxwell test is significant (χ²(5) = 18.62, p = .002). The model over-predicts EXPECTANCY (+16), INTRINSIC_VALUE (+15), COST (+12), and ATTAINMENT_VALUE (+9), and under-predicts OTHER (−35) and UTILITY_VALUE (−17).

UTILITY_VALUE under-recall. The model is conservative on utility value (recall = .774 vs. precision = .846), suggesting that items referencing indirect or long-term benefits may be harder for the model to detect.

Trained on synthetic data. The training set consists of synthetically generated items, not real published instruments. While the model generalises well (κ = .77 overall, κ = .90 on core constructs), there may be stylistic patterns in human-authored items from specific subfields or cultures that the model has not encountered.

English only. The model has not been evaluated on items in other languages. Cross-lingual transfer should not be assumed.

References

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
  • Eccles, J. S., et al. (1983). Expectancies, values, and academic behaviors. In J. T. Spence (Ed.), Achievement and achievement motives (pp. 75–146). W. H. Freeman.
  • Eccles, J. S., & Wigfield, A. (2002). Motivational beliefs, values, and goals. Annual Review of Psychology, 53(1), 109–132.
  • Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. Proceedings of ACL, 328–339.
  • Izmailov, P., et al. (2018). Averaging weights leads to wider optima and better generalization. Proceedings of UAI, 876–885.
  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
  • Miyato, T., et al. (2017). Adversarial training methods for semi-supervised text classification. Proceedings of ICLR.
  • Ridnik, T., et al. (2021). Asymmetric loss for multi-label classification. Proceedings of ICCV, 82–91.
  • Sun, C., et al. (2019). How to fine-tune BERT for text classification. Proceedings of CCL, 194–206.
  • Wang, H., et al. (2018). CosFace: Large margin cosine loss for deep face recognition. Proceedings of CVPR, 5265–5274.

Downloads last month
119
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train christiqn/mpnet-EVT-classifier

Space using christiqn/mpnet-EVT-classifier 1

Collection including christiqn/mpnet-EVT-classifier

Evaluation results

  • Cohen's Kappa (6-class, N=1284) on EVT-items (Human-coded psychological test items)
    self-reported
    0.767
  • Macro F1 (6-class) on EVT-items (Human-coded psychological test items)
    self-reported
    0.812
  • Accuracy (6-class) on EVT-items (Human-coded psychological test items)
    self-reported
    0.810
  • Krippendorff's Alpha on EVT-items (Human-coded psychological test items)
    self-reported
    0.767