Letta Research

Context-Bench: A benchmark for agentic context engineering

Context-Bench measures an agent's ability to perform context engineering with:

Last updated: Mar 13, 2026

Interested in contributing a task or model? Email us or open an issue on GitHub.

#
Model
Filesystem Rubric i LLM-as-a-judge rubric to evaluate a coding agent's ability to correctly retrieve, analyze and reason about information from filesystem data files
1
openai openai/gpt-5.2-codex-xhigh
93%$44.46
2
openai openai/gpt-5.4-xhigh
89%$43.52
3
anthropic anthropic/claude-sonnet-4-6
88%$116.09
4
openai openai/gpt-5.2-xhigh
87%$76.72
5
openai openai/gpt-5.3-codex-xhigh
85%$40.83
6
anthropic anthropic/claude-opus-4-6
84%$268.44
7
google google/gemini-3-flash
82%$37.8
8
google google/gemini-3.1
79%$97.68
9
openai openai/gpt-5-mini-high
67%$35.93
10
anthropic anthropic/claude-opus-4-5-20251101
63%$567.66
11
moonshotai moonshotai/kimi-k2.5
57%$27.36
12
anthropic anthropic/claude-sonnet-4-5-20250929
48%$405.01
13
anthropic anthropic/claude-haiku-4-5
43%$117.55
14
z-ai z-ai/glm-5
39%$85.87
15
minimax minimax/minimax-m2.5
37%$50.28
#
Model
Task Completion Rubric i LLM-as-a-judge rubric to evaluate the agent's ability to complete tasks
Skill Use Rubric i LLM-as-a-judge rubric to evaluate the agent's ability to select, load and use skills
1
openai openai/gpt-5.2-2025-12-11 (xhigh)
85.31%
63.12%
2
openai openai/gpt-5.2-2025-12-11 (high)
84.47%
54.21%
3
anthropic anthropic/claude-opus-4-6
81.32%
62.64%
4
anthropic anthropic/claude-sonnet-4-6
79.08%
64.8%
5
openai openai/gpt-5.2-2025-12-11 (medium)
77.55%
49.74%
6
anthropic anthropic/claude-sonnet-4-5-20250929
76.5%
72%
7
anthropic anthropic/claude-opus-4-5-20251101
75.54%
68.82%
8
deepseek deepseek/deepseek-chat
75.33%
53.62%
9
anthropic anthropic/claude-opus-4-1-20250805
74.4%
71.8%
10
google google/gemini-3-pro
72.73%
64.29%
11
deepseek deepseek/deepseek-reasoner
70.48%
56.12%
12
openai openai/gpt-5-2025-08-07
70.2%
51.4%
13
anthropic anthropic/claude-haiku-4-5-20251001
69.7%
57.3%
14
openai openai/gpt-5-mini-2025-08-07
68.8%
45.5%
15
z-ai z-ai/glm-4.6
65.9%
50%
16
openai openai/gpt-5.1-codex
64.84%
55.73%
17
openai openai/gpt-5.1
63.75%
55.75%
18
google google/gemini-3-flash
59.54%
59.54%
19
mistralai mistralai/mistral-large-3
56.25%
36.93%
20
openai openai/gpt-5-nano-2025-08-07
52.8%
24%
21
openai openai/gpt-5.1-codex-mini
49.72%
23.58%
22
openai openai/gpt-4.1-2025-04-14
36.1%
31.3%

Leaderboard Updates

March 13, 2026

February 17, 2026

February 5, 2026

December 18, 2025

December 11, 2025

December 9, 2025

November 26, 2025

November 7, 2025

November 4, 2025

October 28, 2025