arxiv:2606.00467

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Published on May 30

· Submitted by

Rafal Kocielnik on Jun 12

California institute of technology

Upvote

Authors:

Abstract

Large language models exhibit limited ability to correct zero-shot errors through prompting, with model performance more strongly linked to definition-specific familiarity than text-level memorization metrics.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors ("decision stickiness"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.

View arXiv page View PDF Add to collection

Community

RKocielnik

Paper submitter about 21 hours ago

Do LLM annotators actually follow the definitions we give them?

Our paper studies how model-internalized priors shape LLM annotation behavior. Across 9 models and 5 toxicity-related datasets, we find that performance is better explained by definition alignment than by text memorization. We introduce Definition-Specific Familiarity (DSF), a lightweight diagnostic for measuring whether a model’s internal concept matches the task definition.

We also find strong “decision stickiness”: most zero-shot errors persist even after aligned definitions and few-shot examples, and high-confidence errors are especially hard to correct. Models can also confidently follow misaligned definitions, making confidence an unreliable indicator of whether the intended labeling standard is being applied.

The takeaway: LLM annotation pipelines should not assume that prompt definitions fully control model behavior. Definition design and definition alignment need to be measured explicitly.

librarian-bot

about 17 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.00467

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.00467 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.00467 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.00467 in a Space README.md to link it from this page.