20
Tasks Evaluated
1,820
Scored Runs
+52pp
Top Model Lift
Introduction
Most enterprise AI deployments stall at the same point: the models are increasingly capable but don’t know your team’s taxonomy, your incident classification system, or your internal configuration vocabulary. They can read the logs. They can’t speak your language. Skill documents close that gap, and this dataset proves it at scale.
What We Built
G2i built a curated 20-task data sample dataset for the SkillsBench evaluation program. Each task is a real-world engineering or DevOps scenario. An agent is paired with a skill document, works through the environment, and its output is scored against a test suite.
Validation ran four frontier models across all tasks, with and without skills loaded, producing 1,820 scored runs. The 20 submitted tasks were selected for reliable performance lift, cross-model consistency, and clean run histories (no flaky tests, no environment failures).
How the Dataset Is Organized
The 20-task portfolio covers five major domains:
| Domain | Tasks |
|---|---|
| Engineering: General SWE | 9 |
| DevOps: Incident Response & Observability | 5 |
| DevOps: Infrastructure (Kubernetes, Cloud) | 2 |
| Engineering: Security & Compliance | 3 |
| Other (SRE, Networking, Messaging) | 1 |
Diversity was a deliberate design goal. Tasks vary across domain, the number of skills required (1 to 4+), environment complexity (minimal single-file setups to 16+ file environments with distractors), test count, and difficulty tier.
Difficulty Classification
Tasks are graded using the official SkillsBench rubric (arXiv:2602.12670v1), which defines three tiers based on estimated human completion time for a median domain specialist working without AI assistance:
| Tier | Human time | Typical baseline (no skills) |
|---|---|---|
| Easy (Core) | Under 60 minutes | Above 50% |
| Medium (Extended) | 1–4 hours | 10–50% |
| Hard (Extreme) | 4+ hours | Under 15% |
The data sample portfolio contains 19 medium tasks and 1 hard task. Easy tasks were excluded because models solve them without any skill augmentation. The skew toward medium over hard is also deliberate: tasks with zero baselines generate cleaner signal than hard tasks, where even the best models partially fail with skills loaded.
What the Data Shows
Skill documents produce large, consistent performance gains when the task follows the investigation pattern: the agent reads system state or logs, the skill teaches an enumerated vocabulary of failure modes, and the agent classifies its findings accordingly. This pattern accounts for the strongest results in the dataset.
The gap the benchmark measures
Without skills, agents can observe and describe. With skills, they classify and act using the same language your team uses. That gap is what the benchmark measures.
Performance across all four models:
| Model | With skills | Without skills | Avg delta |
|---|---|---|---|
| claude-opus-4-6 | 89% | 37% | +52pp |
| claude-sonnet-4-6 | 79% | 37% | +42pp |
| openai/gpt-5.4 | 57% | 18% | +39pp |
| openai/gpt-5.3-codex | 49% | 17% | +32pp |
How to Use This Dataset
The portfolio is structured for immediate use in the SkillsBench data sample evaluation. Each task is self-contained and includes the scenario environment, the skill document, and the test suite. The grading and scoring methodology follows the SkillsBench standard, so results are directly comparable to other submissions in the program.
Beyond the data sample, the dataset illustrates a broader principle: the gap between what a model knows from training and what it needs to know to operate in a specific team’s environment is measurable and significant. The 20 tasks in this portfolio demonstrate that across security, DevOps, general software engineering, and infrastructure domains.
Appendix: Supporting Data
A. Strongest Tasks in the Portfolio
Ranked by binary delta, all four models consistent:
| Task | Delta | Domain |
|---|---|---|
| phantom-config-override | +70pp | Engineering / Config Management |
| ownership-inference | +64pp | Engineering / General SWE |
| config-inotify-partial | +60pp | DevOps / Infrastructure |
| sabot-trace | +50pp | Engineering / General SWE |
| haskell-strict-map-leak | +48pp | Engineering / Performance |
| secret-rotation-failure | +46pp | Engineering / Security |
| sg-conntrack-revoke-persist | +45pp | Engineering / Networking |
| custom-load-balancer | +40pp | DevOps / Infrastructure |
| rate-limit-bypass-investigation | +40pp | Engineering / Security |
B. Zero-Baseline Tasks
Tasks where all four models scored 0% without skills. The skill document is the only path to a passing result.
| Task | With-skills delta | What the skill provides |
|---|---|---|
| ownership-inference | +64pp | Weighted ownership scoring + stale-fragment rules |
| phantom-config-override | +70pp | systemd override mechanism enum + recommended_action enum |
| sabot-trace | +50pp | Sabot stack-language semantics reference |
| rate-limiter-ip-leak | +16pp | Gin middleware ordering + trusted-IP resolution |
| mruby-utf8-byte-search | +30pp | UTF-8 boundary + cursor forward-progress invariants |
C. Dataset Composition
Environment complexity (files per scenario):
| Environment files | Count |
|---|---|
| 1–3 files | 4 |
| 4–7 files | 4 |
| 8–15 files | 7 |
| 16+ files | 5 |
D. Referenced Standards and Research
SkillsBench (arXiv:2602.12670v1)
Defines the three-tier difficulty rubric, task.toml schema, and binary/partial scoring methodology used throughout this engagement.
Harbor Evaluation Framework
Open-source agentic evaluation harness used for container build, skill injection, test execution, and run logging (claude-institute/harbor).
Contextual benchmarks
SWE-bench Verified (Princeton/OpenAI, 2024) established agent evaluation on real GitHub issues with verified test suites; SkillsBench extends this to knowledge-augmented agents. GAIA (Mialon et al., 2023) benchmarks general assistant capabilities across tool use and multi-step reasoning; SkillsBench tasks are narrower in scope and deeper in domain specificity. HumanEval / MBPP code-generation baselines confirm that strong HumanEval-class performance does not translate to domain classification tasks without explicit vocabulary augmentation.
Benchmark conducted April 2026 by G2i, Inc. Difficulty classifications per the SkillsBench rubric (arXiv:2602.12670v1).
Interested in Collaborating?
We’re always looking to partner with AI labs on research that advances the field.
