Letta Research

Context-Bench: A benchmark for agentic context engineering

Context-Bench measures an agent's ability to perform context engineering with:

Last updated: Dec 18, 2025

Interested in contributing a task or model? Email us or open an issue on GitHub.

#
Model
Filesystem Rubric i LLM-as-a-judge rubric to evaluate the agent's ability to correctly retrieve and analyze information from filesystem data files
1
openai openai/gpt-5.2-2025-12-11 (xhigh)
82.61%$84.66
2
openai openai/gpt-5.2-2025-12-11 (high)
80.5%$68.53
3
anthropic anthropic/claude-opus-4-5-20251101
76.8%$39.91
4
openai openai/gpt-5.1-2025-11-13
76.59%$47.22
5
openai openai/gpt-5.1-codex
76.17%$63.99
6
google google/gemini-3-pro-preview
75.33%$95.87
7
anthropic anthropic/claude-sonnet-4-5-20250929
74%$24.58
8
openai openai/gpt-5.2-2025-12-11 (medium)
73.32%$60.34
9
deepseek deepseek/deepseek-reasoner
73.05%$16.03
10
openai openai/gpt-5-2025-08-07
72.67%$43.56
11
openai openai/gpt-5.1-codex-mini
70.83%$14.05
12
google google/gemini-3-flash-preview
68.17%$36.74
13
openai openai/gpt-5-mini-2025-08-07
64.33%$12.45
14
deepseek deepseek/deepseek-chat
63.35%$13.05
15
anthropic anthropic/claude-haiku-4-5-20251001
61.02%$10.18
16
anthropic anthropic/claude-opus-4-1-20250805
61%$110.89
17
z-ai z-ai/glm-4.6
56.83%$21.32
18
minimax minimax/minimax-m2
56.83%$5.21
19
moonshotai moonshotai/kimi-k2-0905
55.13%$12.08
20
moonshotai moonshotai/kimi-k2-thinking
55%$21.83
21
openai openai/gpt-5-nano-2025-08-07
44.83%$2.43
22
openai openai/gpt-4.1-2025-04-14
36.68%$36.85
23
openai openai/gpt-4.1-mini-2025-04-14
36.3%$13.56
24
mistralai mistralai/mistral-large-2512
23.8%$8.97
25
openai openai/gpt-oss-120b
20.2%$2.19
26
openai openai/gpt-4.1-nano-2025-04-14
16.2%$0.98
27
deepseek deepseek/deepseek-chat-v3.1
11.97%$2.54
28
openai openai/gpt-oss-20b
6.67%$0.54
#
Model
Task Completion Rubric i LLM-as-a-judge rubric to evaluate the agent's ability to complete tasks
Skill Use Rubric i LLM-as-a-judge rubric to evaluate the agent's ability to select, load and use skills
1
openai openai/gpt-5.2-2025-12-11 (xhigh)
85.31%
63.12%
2
openai openai/gpt-5.2-2025-12-11 (high)
84.47%
54.21%
3
openai openai/gpt-5.2-2025-12-11 (medium)
77.55%
49.74%
4
anthropic anthropic/claude-sonnet-4-5-20250929
76.5%
72%
5
anthropic anthropic/claude-opus-4-5-20251101
75.54%
68.82%
6
deepseek deepseek/deepseek-chat
75.33%
53.62%
7
anthropic anthropic/claude-opus-4-1-20250805
74.4%
71.8%
8
google google/gemini-3-pro
72.73%
64.29%
9
deepseek deepseek/deepseek-reasoner
70.48%
56.12%
10
openai openai/gpt-5-2025-08-07
70.2%
51.4%
11
anthropic anthropic/claude-haiku-4-5-20251001
69.7%
57.3%
12
openai openai/gpt-5-mini-2025-08-07
68.8%
45.5%
13
z-ai z-ai/glm-4.6
65.9%
50%
14
openai openai/gpt-5.1-codex
64.84%
55.73%
15
openai openai/gpt-5.1
63.75%
55.75%
16
google google/gemini-3-flash
59.54%
59.54%
17
mistralai mistralai/mistral-large-3
56.25%
36.93%
18
openai openai/gpt-5-nano-2025-08-07
52.8%
24%
19
openai openai/gpt-5.1-codex-mini
49.72%
23.58%
20
openai openai/gpt-4.1-2025-04-14
36.1%
31.3%

Leaderboard Updates

December 18, 2025

December 11, 2025

December 9, 2025

November 26, 2025

November 7, 2025

November 4, 2025

October 28, 2025