Context-Bench measures an agent's ability to perform context engineering with:
- Filesystem Suite: Evaluates how well language models can chain file operations, trace entity relationships, and manage multi-step information retrieval.
- Skills Suite: Evalutes how well language modelsc can discover and load relevant skills from a library to complete tasks.
Interested in contributing a task or model? Email us or open an issue on GitHub.