Letta Research

Context-Bench: A benchmark for agentic context engineering

Context-Bench measures an agent's ability to perform context engineering with:

Last updated: Dec 11, 2025

Interested in contributing a task or model? Email us or open an issue on GitHub.

Model
Filesystem Rubric i LLM-as-a-judge rubric to evaluate the agent's ability to correctly retrieve and analyze information from filesystem data files
openai openai/gpt-5.2-2025-12-11 (xhigh)
82.61%$84.66
openai openai/gpt-5.2-2025-12-11 (high)
80.5%$68.53
anthropic anthropic/claude-opus-4-5-20251101
76.8%$39.91
openai openai/gpt-5.1-2025-11-13
76.59%$47.22
openai openai/gpt-5.1-codex
76.17%$63.99
google google/gemini-3-pro-preview
75.33%$95.87
anthropic anthropic/claude-sonnet-4-5-20250929
74%$24.58
openai openai/gpt-5.2-2025-12-11 (medium)
73.32%$60.34
deepseek deepseek/deepseek-reasoner
73.05%$16.03
openai openai/gpt-5-2025-08-07
72.67%$43.56
openai openai/gpt-5.1-codex-mini
70.83%$14.05
openai openai/gpt-5-mini-2025-08-07
64.33%$12.45
deepseek deepseek/deepseek-chat
63.35%$13.05
anthropic anthropic/claude-haiku-4-5-20251001
61.02%$10.18
anthropic anthropic/claude-opus-4-1-20250805
61%$110.89
z-ai z-ai/glm-4.6
56.83%$21.32
minimax minimax/minimax-m2
56.83%$5.21
moonshotai moonshotai/kimi-k2-0905
55.13%$12.08
moonshotai moonshotai/kimi-k2-thinking
55%$21.83
openai openai/gpt-5-nano-2025-08-07
44.83%$2.43
openai openai/gpt-4.1-2025-04-14
36.68%$36.85
openai openai/gpt-4.1-mini-2025-04-14
36.3%$13.56
mistralai mistralai/mistral-large-2512
23.8%$8.97
openai openai/gpt-oss-120b
20.2%$2.19
openai openai/gpt-4.1-nano-2025-04-14
16.2%$0.98
deepseek deepseek/deepseek-chat-v3.1
11.97%$2.54
openai openai/gpt-oss-20b
6.67%$0.54
Model
Task Completion Rubric i LLM-as-a-judge rubric to evaluate the agent's ability to complete tasks
Skill Use Rubric i LLM-as-a-judge rubric to evaluate the agent's ability to select, load and use skills
openai openai/gpt-5.2-2025-12-11 (xhigh)
85.31%
63.12%
openai openai/gpt-5.2-2025-12-11 (high)
84.47%
54.21%
openai openai/gpt-5.2-2025-12-11 (medium)
77.55%
49.74%
anthropic anthropic/claude-sonnet-4-5-20250929
76.5%
72%
anthropic anthropic/claude-opus-4-5-20251101
75.54%
68.82%
deepseek deepseek/deepseek-chat
75.33%
53.62%
anthropic anthropic/claude-opus-4-1-20250805
74.4%
71.8%
google google/gemini-3-pro
72.73%
64.29%
deepseek deepseek/deepseek-reasoner
70.48%
56.12%
openai openai/gpt-5-2025-08-07
70.2%
51.4%
anthropic anthropic/claude-haiku-4-5-20251001
69.7%
57.3%
openai openai/gpt-5-mini-2025-08-07
68.8%
45.5%
z-ai z-ai/glm-4.6
65.9%
50%
openai openai/gpt-5.1-codex
64.84%
55.73%
openai openai/gpt-5.1
63.75%
55.75%
mistralai mistralai/mistral-large-3
56.25%
36.93%
openai openai/gpt-5-nano-2025-08-07
52.8%
24%
openai openai/gpt-5.1-codex-mini
49.72%
23.58%
openai openai/gpt-4.1-2025-04-14
36.1%
31.3%

Leaderboard Updates

December 11, 2025

December 9, 2025

November 26, 2025

November 7, 2025

November 4, 2025

October 28, 2025