Abstract

Implicit memory for LLM agents

Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders.

We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit memory through three cognitively grounded constructs: Procedural Memory, Priming, and Classical Conditioning. Our 300-item suite uses a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring.

Across 17 models, no system exceeds 66% overall. The strongest performers are DeepSeek-R1 at 65.3%, Qwen3-32B at 64.1%, and GPT-5 at 63.0%, all far below the 100% human baseline.

Benchmark

Learning, interference, and first-attempt behavior

End-to-end construction and evaluation pipeline for ImplicitMemBench.

Paradigm 1

Procedural Memory

Tests whether models internalize new routines from minimal exposure and execute them after misleading interference.

  • Tool and API usage
  • Linguistic formats
  • Logic, rules, and creative constraints

Paradigm 2

Priming

Measures thematic carryover by comparing experimental and control instances with identical test probes.

  • Rich thematic exposure
  • Neutral interference
  • Creative generation tests

Paradigm 3

Classical Conditioning

Evaluates whether repeated negative outcomes shape immediate avoidance or adaptation when a stimulus returns.

  • Tool safety
  • Conversational adaptation
  • System protection

Dataset

300 items across three implicit memory paradigms

300 Total items
18 Task families
3 Paradigms
Paradigm Items Coverage
Procedural Memory 100 Tool, linguistic, logic, rules, creative constraints
Priming 100 10 thematic domains with matched controls
Classical Conditioning 100 Tool safety, conversation, system protection

Generation protocol. Each item is built from structured blueprints, generated into dialogues, checked for leakage and format shortcuts, and validated with rule-based or LLM-judge scoring.

Evaluation protocol. Models receive learning or priming context, unrelated interference, then a test probe where only the first attempt counts.

View dataset on Hugging Face
Phase structure and token distribution across the three paradigms.

Results

Current models still struggle with implicit adaptation

No evaluated model exceeds 66% overall, while the human baseline reaches 100%.
Rank Model Procedural Classical Priming Overall
1 DeepSeek-R1 76.33 69.67 49.90 65.30
2 Qwen3-32B 75.67 67.00 49.73 64.13
3 GPT-5 75.33 64.00 49.67 63.00
4 Qwen3-8B 75.33 64.00 47.73 62.35
5 GPT-o3 76.00 57.67 51.70 61.79
6 GPT-o4-mini-high 70.67 60.00 51.95 60.87
7 GLM-4.5 76.33 53.33 46.12 58.59
8 Gemini-2.5-pro 74.33 47.33 45.42 55.69
9 Claude-4.1-opus 76.67 41.67 48.60 55.65
10 Gemini-2.5-flash 72.33 49.00 44.97 55.43
11 GPT-4o-mini 61.67 44.00 46.98 50.88
12 Qwen-2.5-72B 61.00 47.00 44.33 50.78
13 GPT-4o 61.67 43.67 45.62 50.32
14 Claude-4-sonnet 51.67 51.67 46.17 49.84
15 LLaMA-3.3-70B 58.33 47.33 42.67 49.44
16 LLaMA-3.1-8B 46.67 38.33 47.53 44.18
17 Qwen-2.5-7B 50.67 35.67 44.12 43.49

Findings

Implicit memory reveals different failure modes

66%

Overall ceiling remains low

No model exceeds 66% overall. DeepSeek-R1, Qwen3-32B, and GPT-5 form the leading tier, but all remain far below the 100% human baseline.

17.6%

Inhibition is the bottleneck

Inhibitory adaptation averages only 17.6%, while preference-based adaptation reaches 75.0%. Jargon avoidance is especially severe at roughly 4%.

33.8

Surface rules beat deep protocols

Top models reach 93.8% on surface formatting, but only 60.0% on deep multi-rule protocols, exposing brittle proceduralization.

r = 0.63

Priming behaves like style bias

Stronger priming effects correlate with more constraint violations, suggesting models mimic style rather than extract abstract thematic structure.

35 pts

Capabilities dissociate

Claude-4.1-opus tops procedural memory, yet drops sharply on classical conditioning. Strong performance in one paradigm does not predict another.

5

Universal bottlenecks persist

Jargon avoidance, API distrust, context-dependent behavior, API aversion, and emotion-driven strategy shift remain hard across model families.

Detailed analysis of behavioral asymmetries, robustness failures, and model capability trade-offs.

Citation

Use this BibTeX

@misc{qin2026implicitmembenchmeasuringunconsciousbehavioral,
      title={ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models}, 
      author={Chonghan Qin and Xiachong Feng and Weitao Ma and Xiaocheng Feng and Lingpeng Kong},
      year={2026},
      eprint={2604.08064},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2604.08064}, 
}