Context-Bench measures an agent's ability to perform context engineering with:

Filesystem Suite: Evaluates how well language models can chain file operations, trace entity relationships, and manage multi-step information retrieval.
Skills Suite: Evalutes how well language models can discover and load relevant skills from a library to complete tasks.

Last updated: Mar 13, 2026

Interested in contributing a task or model? Email us or open an issue on GitHub.

#

Model

Filesystem Rubric

1

openai/gpt-5.2-codex-xhigh

93%$44.46

2

openai/gpt-5.4-xhigh

89%$43.52

3

anthropic/claude-sonnet-4-6

88%$116.09

4

openai/gpt-5.2-xhigh

87%$76.72

5

openai/gpt-5.3-codex-xhigh

85%$40.83

6

anthropic/claude-opus-4-6

84%$268.44

7

google/gemini-3-flash

82%$37.8

8

google/gemini-3.1

79%$97.68

9

openai/gpt-5-mini-high

67%$35.93

10

anthropic/claude-opus-4-5-20251101

63%$567.66

11

moonshotai/kimi-k2.5

57%$27.36

12

anthropic/claude-sonnet-4-5-20250929

48%$405.01

13

anthropic/claude-haiku-4-5

43%$117.55

14

z-ai/glm-5

39%$85.87

15

minimax/minimax-m2.5

37%$50.28

#

Model

Task Completion Rubric

Skill Use Rubric

1

openai/gpt-5.2-2025-12-11 (xhigh)

85.31%

63.12%

2

openai/gpt-5.2-2025-12-11 (high)

84.47%

54.21%

3

anthropic/claude-opus-4-6

81.32%

62.64%

4

anthropic/claude-sonnet-4-6

79.08%

64.8%

5

openai/gpt-5.2-2025-12-11 (medium)

77.55%

49.74%

6

anthropic/claude-sonnet-4-5-20250929

76.5%

72%

7

anthropic/claude-opus-4-5-20251101

75.54%

68.82%

8

deepseek/deepseek-chat

75.33%

53.62%

9

anthropic/claude-opus-4-1-20250805

74.4%

71.8%

10

google/gemini-3-pro

72.73%

64.29%

11

deepseek/deepseek-reasoner

70.48%

56.12%

12

openai/gpt-5-2025-08-07

70.2%

51.4%

13

anthropic/claude-haiku-4-5-20251001

69.7%

57.3%

14

openai/gpt-5-mini-2025-08-07

68.8%

45.5%

15

z-ai/glm-4.6

65.9%

50%

16

openai/gpt-5.1-codex

64.84%

55.73%

17

openai/gpt-5.1

63.75%

55.75%

18

google/gemini-3-flash

59.54%

59.54%

19

mistralai/mistral-large-3

56.25%

36.93%

20

openai/gpt-5-nano-2025-08-07

52.8%

24%

21

openai/gpt-5.1-codex-mini

49.72%

23.58%

22

openai/gpt-4.1-2025-04-14

36.1%

31.3%

Leaderboard Updates

March 13, 2026

Filesystem v2 full model refresh
- Re-ran all models on updated v2 dataset with code-verified ground truths
- Added GPT 5.4, GPT 5.3 Codex, GPT 5.2 Codex, Gemini 3.1 Pro, GLM 5, Kimi K2.5, Minimax M2.5
- GPT 5.2 Codex takes #1 on Filesystem at 93.0 with only $44.46 total cost
- GPT 5.4 at #2 (89.0) and Sonnet 4.6 at #3 (88.0)

February 17, 2026

Filesystem v2 refresh
- Letta code agents with real filesystem and client-side tools
- Multi-hop navigation and reasoning
Sonnet 4.6
- #3 on Filesystem and #4 on Skills
- 70% improvement in token efficiency with 38% improvement in accuracy over Sonnet 4.5

February 5, 2026

Opus 4.6
- #1 on Filesystem and #3 on Skills
- Same price as Opus 4.5 but less token efficient leading to higher costs

December 18, 2025

Gemini 3 Flash
- Outperforms Haiku and GPT 5 Mini on Filesystem
- Less token efficient, leading to higher costs

December 11, 2025

GPT 5.2 (medium, high, xhigh)
- #1 on Filesystem (6% ↑) and Skills (9% ↑)
- 1.5-2.1x Opus cost but still cheaper than Gemini 3 Pro

December 9, 2025

Deepseek v3.2 (reasoner, chat)
- #1 OSS and only 1% behind Sonnet 4.5
- 16% ↑ on Filesystem and 10% ↑ on Skills over GLM 4.6
Mistral Large 3
- Ranks below GPT 4.1 Mini on Filesystem

November 26, 2025

GPT 5.1, 5.1 Codex, 5.1 Codex Mini
Gemini 3 Pro
Claude Opus 4.5

November 7, 2025

Launched Skills Suite
Kimi K2 Thinking

November 4, 2025

Minimax M2

October 28, 2025

Launched leaderboard website
Launched Filesystem Suite