On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance
Abstract
Large language models exhibit limited ability to correct zero-shot errors through prompting, with model performance more strongly linked to definition-specific familiarity than text-level memorization metrics.
Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors ("decision stickiness"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.
Community
Do LLM annotators actually follow the definitions we give them?
Our paper studies how model-internalized priors shape LLM annotation behavior. Across 9 models and 5 toxicity-related datasets, we find that performance is better explained by definition alignment than by text memorization. We introduce Definition-Specific Familiarity (DSF), a lightweight diagnostic for measuring whether a model’s internal concept matches the task definition.
We also find strong “decision stickiness”: most zero-shot errors persist even after aligned definitions and few-shot examples, and high-confidence errors are especially hard to correct. Models can also confidently follow misaligned definitions, making confidence an unreliable indicator of whether the intended labeling standard is being applied.
The takeaway: LLM annotation pipelines should not assume that prompt definitions fully control model behavior. Definition design and definition alignment need to be measured explicitly.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations (2026)
- From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation (2026)
- Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification (2026)
- The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods (2026)
- Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation (2026)
- Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit (2026)
- The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.00467 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
