Letta Research

Context-Bench: A benchmark for agentic context engineering

Context-Bench measures an agent's ability to perform context engineering with:

Interested in contributing a task or model? Email us or open an issue on GitHub.

Model
Filesystem Rubric i LLM-as-a-judge rubric to evaluate the agent's ability to correctly retrieve and analyze information from filesystem data files
anthropic anthropic/claude-sonnet-4-5-20250929
74%$24.58
openai openai/gpt-5-2025-08-07
72.67%$43.56
openai openai/gpt-5-mini-2025-08-07
64.33%$12.45
anthropic anthropic/claude-haiku-4-5-20251001
61.02%$10.18
anthropic anthropic/claude-opus-4-1-20250805
61%$110.89
z-ai z-ai/glm-4.6
56.83%$21.32
moonshotai moonshotai/kimi-k2-0905
55.13%$12.08
moonshotai moonshotai/kimi-k2-thinking
55%$21.83
openai openai/gpt-5-nano-2025-08-07
44.83%$2.43
openai openai/gpt-4.1-2025-04-14
36.68%$36.85
openai openai/gpt-4.1-mini-2025-04-14
36.3%$13.56
openai openai/gpt-oss-120b
20.2%$2.19
openai openai/gpt-4.1-nano-2025-04-14
16.2%$0.98
deepseek deepseek/deepseek-chat-v3.1
11.97%$2.54
openai openai/gpt-oss-20b
6.67%$0.54
Model
Task Completion Rubric i LLM-as-a-judge rubric to evaluate the agent's ability to complete tasks
Skill Use Rubric i LLM-as-a-judge rubric to evaluate the agent's ability to select, load and use skills
anthropic anthropic/claude-sonnet-4-5-20250929
76.5%
72%
anthropic anthropic/claude-opus-4-1-20250805
74.4%
71.8%
anthropic anthropic/claude-haiku-4-5-20251001
69.7%
57.3%
openai openai/gpt-5-2025-08-07
70.2%
51.4%
z-ai z-ai/glm-4.6
65.9%
50%
openai openai/gpt-5-mini-2025-08-07
68.8%
45.5%
openai openai/gpt-5-nano-2025-08-07
52.8%
24%
openai openai/gpt-4.1-2025-04-14
36.1%
31.3%