Paradigm 1
Procedural Memory
Tests whether models internalize new routines from minimal exposure and execute them after misleading interference.
- Tool and API usage
- Linguistic formats
- Logic, rules, and creative constraints
ACL 2026 Oral
From what agents recall to what they automatically enact.
1 The University of Hong Kong 2 Harbin Institute of Technology
Abstract
Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders.
We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit memory through three cognitively grounded constructs: Procedural Memory, Priming, and Classical Conditioning. Our 300-item suite uses a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring.
Across 17 models, no system exceeds 66% overall. The strongest performers are DeepSeek-R1 at 65.3%, Qwen3-32B at 64.1%, and GPT-5 at 63.0%, all far below the 100% human baseline.
Benchmark
Paradigm 1
Tests whether models internalize new routines from minimal exposure and execute them after misleading interference.
Paradigm 2
Measures thematic carryover by comparing experimental and control instances with identical test probes.
Paradigm 3
Evaluates whether repeated negative outcomes shape immediate avoidance or adaptation when a stimulus returns.
Dataset
Generation protocol. Each item is built from structured blueprints, generated into dialogues, checked for leakage and format shortcuts, and validated with rule-based or LLM-judge scoring.
Evaluation protocol. Models receive learning or priming context, unrelated interference, then a test probe where only the first attempt counts.
Results
| Rank | Model | Procedural | Classical | Priming | Overall |
|---|---|---|---|---|---|
| 1 | DeepSeek-R1 | 76.33 | 69.67 | 49.90 | 65.30 |
| 2 | Qwen3-32B | 75.67 | 67.00 | 49.73 | 64.13 |
| 3 | GPT-5 | 75.33 | 64.00 | 49.67 | 63.00 |
| 4 | Qwen3-8B | 75.33 | 64.00 | 47.73 | 62.35 |
| 5 | GPT-o3 | 76.00 | 57.67 | 51.70 | 61.79 |
| 6 | GPT-o4-mini-high | 70.67 | 60.00 | 51.95 | 60.87 |
| 7 | GLM-4.5 | 76.33 | 53.33 | 46.12 | 58.59 |
| 8 | Gemini-2.5-pro | 74.33 | 47.33 | 45.42 | 55.69 |
| 9 | Claude-4.1-opus | 76.67 | 41.67 | 48.60 | 55.65 |
| 10 | Gemini-2.5-flash | 72.33 | 49.00 | 44.97 | 55.43 |
| 11 | GPT-4o-mini | 61.67 | 44.00 | 46.98 | 50.88 |
| 12 | Qwen-2.5-72B | 61.00 | 47.00 | 44.33 | 50.78 |
| 13 | GPT-4o | 61.67 | 43.67 | 45.62 | 50.32 |
| 14 | Claude-4-sonnet | 51.67 | 51.67 | 46.17 | 49.84 |
| 15 | LLaMA-3.3-70B | 58.33 | 47.33 | 42.67 | 49.44 |
| 16 | LLaMA-3.1-8B | 46.67 | 38.33 | 47.53 | 44.18 |
| 17 | Qwen-2.5-7B | 50.67 | 35.67 | 44.12 | 43.49 |
Findings
66%
No model exceeds 66% overall. DeepSeek-R1, Qwen3-32B, and GPT-5 form the leading tier, but all remain far below the 100% human baseline.
17.6%
Inhibitory adaptation averages only 17.6%, while preference-based adaptation reaches 75.0%. Jargon avoidance is especially severe at roughly 4%.
33.8
Top models reach 93.8% on surface formatting, but only 60.0% on deep multi-rule protocols, exposing brittle proceduralization.
r = 0.63
Stronger priming effects correlate with more constraint violations, suggesting models mimic style rather than extract abstract thematic structure.
35 pts
Claude-4.1-opus tops procedural memory, yet drops sharply on classical conditioning. Strong performance in one paradigm does not predict another.
5
Jargon avoidance, API distrust, context-dependent behavior, API aversion, and emotion-driven strategy shift remain hard across model families.
Citation
@misc{qin2026implicitmembenchmeasuringunconsciousbehavioral,
title={ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models},
author={Chonghan Qin and Xiachong Feng and Weitao Ma and Xiaocheng Feng and Lingpeng Kong},
year={2026},
eprint={2604.08064},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2604.08064},
}