ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models

Abstract

Implicit memory for LLM agents

Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders.

We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit memory through three cognitively grounded constructs: Procedural Memory, Priming, and Classical Conditioning. Our 300-item suite uses a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring.

Across 17 models, no system exceeds 66% overall. The strongest performers are DeepSeek-R1 at 65.3%, Qwen3-32B at 64.1%, and GPT-5 at 63.0%, all far below the 100% human baseline.

Benchmark

Learning, interference, and first-attempt behavior

Paradigm 1

Procedural Memory

Tests whether models internalize new routines from minimal exposure and execute them after misleading interference.

Tool and API usage
Linguistic formats
Logic, rules, and creative constraints

Paradigm 2

Priming

Measures thematic carryover by comparing experimental and control instances with identical test probes.

Rich thematic exposure
Neutral interference
Creative generation tests

Paradigm 3

Classical Conditioning

Evaluates whether repeated negative outcomes shape immediate avoidance or adaptation when a stimulus returns.

Tool safety
Conversational adaptation
System protection

Dataset

300 items across three implicit memory paradigms

300 Total items

18 Task families

3 Paradigms

Paradigm Items Coverage

Procedural Memory 100 Tool, linguistic, logic, rules, creative constraints

Priming 100 10 thematic domains with matched controls

Classical Conditioning 100 Tool safety, conversation, system protection

Generation protocol. Each item is built from structured blueprints, generated into dialogues, checked for leakage and format shortcuts, and validated with rule-based or LLM-judge scoring.

Evaluation protocol. Models receive learning or priming context, unrelated interference, then a test probe where only the first attempt counts.

View dataset on Hugging Face

Dataset statistics showing token distribution, phase token proportions, and turn structure by paradigm. — Phase structure and token distribution across the three paradigms.

Results

Current models still struggle with implicit adaptation

ImplicitMemBench ranking chart across evaluated models. — No evaluated model exceeds 66% overall, while the human baseline reaches 100%.

Rank	Model	Procedural	Classical	Priming	Overall
1	DeepSeek-R1	76.33	69.67	49.90	65.30
2	Qwen3-32B	75.67	67.00	49.73	64.13
3	GPT-5	75.33	64.00	49.67	63.00
4	Qwen3-8B	75.33	64.00	47.73	62.35
5	GPT-o3	76.00	57.67	51.70	61.79
6	GPT-o4-mini-high	70.67	60.00	51.95	60.87
7	GLM-4.5	76.33	53.33	46.12	58.59
8	Gemini-2.5-pro	74.33	47.33	45.42	55.69
9	Claude-4.1-opus	76.67	41.67	48.60	55.65
10	Gemini-2.5-flash	72.33	49.00	44.97	55.43
11	GPT-4o-mini	61.67	44.00	46.98	50.88
12	Qwen-2.5-72B	61.00	47.00	44.33	50.78
13	GPT-4o	61.67	43.67	45.62	50.32
14	Claude-4-sonnet	51.67	51.67	46.17	49.84
15	LLaMA-3.3-70B	58.33	47.33	42.67	49.44
16	LLaMA-3.1-8B	46.67	38.33	47.53	44.18
17	Qwen-2.5-7B	50.67	35.67	44.12	43.49

Findings

Implicit memory reveals different failure modes

66%

Overall ceiling remains low

No model exceeds 66% overall. DeepSeek-R1, Qwen3-32B, and GPT-5 form the leading tier, but all remain far below the 100% human baseline.

17.6%

Inhibition is the bottleneck

Inhibitory adaptation averages only 17.6%, while preference-based adaptation reaches 75.0%. Jargon avoidance is especially severe at roughly 4%.

33.8

Surface rules beat deep protocols

Top models reach 93.8% on surface formatting, but only 60.0% on deep multi-rule protocols, exposing brittle proceduralization.

r = 0.63

Priming behaves like style bias

Stronger priming effects correlate with more constraint violations, suggesting models mimic style rather than extract abstract thematic structure.

35 pts

Capabilities dissociate

Claude-4.1-opus tops procedural memory, yet drops sharply on classical conditioning. Strong performance in one paradigm does not predict another.

5

Universal bottlenecks persist

Jargon avoidance, API distrust, context-dependent behavior, API aversion, and emotion-driven strategy shift remain hard across model families.

Detailed analysis of memory formation patterns in language models. — Detailed analysis of behavioral asymmetries, robustness failures, and model capability trade-offs.

Citation

Use this BibTeX

@misc{qin2026implicitmembenchmeasuringunconsciousbehavioral,
      title={ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models}, 
      author={Chonghan Qin and Xiachong Feng and Weitao Ma and Xiaocheng Feng and Lingpeng Kong},
      year={2026},
      eprint={2604.08064},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2604.08064}, 
}