AI Skills Augmentation: Data Sample Dataset and Findings

Tasks Evaluated

1,820

Scored Runs

+52pp

Top Model Lift

Introduction

Most enterprise AI deployments stall at the same point: the models are increasingly capable but don’t know your team’s taxonomy, your incident classification system, or your internal configuration vocabulary. They can read the logs. They can’t speak your language. Skill documents close that gap, and this dataset proves it at scale.

What We Built

G2i built a curated 20-task data sample dataset for the SkillsBench evaluation program. Each task is a real-world engineering or DevOps scenario. An agent is paired with a skill document, works through the environment, and its output is scored against a test suite.

Validation ran four frontier models across all tasks, with and without skills loaded, producing 1,820 scored runs. The 20 submitted tasks were selected for reliable performance lift, cross-model consistency, and clean run histories (no flaky tests, no environment failures).

How the Dataset Is Organized

The 20-task portfolio covers five major domains:

Task count by domain across the 20-task portfolio
Domain	Tasks
Engineering: General SWE	9
DevOps: Incident Response & Observability	5
DevOps: Infrastructure (Kubernetes, Cloud)	2
Engineering: Security & Compliance	3
Other (SRE, Networking, Messaging)	1

Diversity was a deliberate design goal. Tasks vary across domain, the number of skills required (1 to 4+), environment complexity (minimal single-file setups to 16+ file environments with distractors), test count, and difficulty tier.

Difficulty Classification

Tasks are graded using the official SkillsBench rubric (arXiv:2602.12670v1), which defines three tiers based on estimated human completion time for a median domain specialist working without AI assistance:

SkillsBench difficulty tiers by estimated human completion time
Tier	Human time	Typical baseline (no skills)
Easy (Core)	Under 60 minutes	Above 50%
Medium (Extended)	1–4 hours	10–50%
Hard (Extreme)	4+ hours	Under 15%

The data sample portfolio contains 19 medium tasks and 1 hard task. Easy tasks were excluded because models solve them without any skill augmentation. The skew toward medium over hard is also deliberate: tasks with zero baselines generate cleaner signal than hard tasks, where even the best models partially fail with skills loaded.

What the Data Shows

Skill documents produce large, consistent performance gains when the task follows the investigation pattern: the agent reads system state or logs, the skill teaches an enumerated vocabulary of failure modes, and the agent classifies its findings accordingly. This pattern accounts for the strongest results in the dataset.

The gap the benchmark measures

Without skills, agents can observe and describe. With skills, they classify and act using the same language your team uses. That gap is what the benchmark measures.

Performance across all four models:

Average task pass rate with and without skill documents, by model
Model	With skills	Without skills	Avg delta
claude-opus-4-6	89%	37%	+52pp
claude-sonnet-4-6	79%	37%	+42pp
openai/gpt-5.4	57%	18%	+39pp
openai/gpt-5.3-codex	49%	17%	+32pp

How to Use This Dataset

The portfolio is structured for immediate use in the SkillsBench data sample evaluation. Each task is self-contained and includes the scenario environment, the skill document, and the test suite. The grading and scoring methodology follows the SkillsBench standard, so results are directly comparable to other submissions in the program.

Beyond the data sample, the dataset illustrates a broader principle: the gap between what a model knows from training and what it needs to know to operate in a specific team’s environment is measurable and significant. The 20 tasks in this portfolio demonstrate that across security, DevOps, general software engineering, and infrastructure domains.

Appendix: Supporting Data

A. Strongest Tasks in the Portfolio

Ranked by binary delta, all four models consistent:

Strongest tasks ranked by binary delta, consistent across all four models
Task	Delta	Domain
phantom-config-override	+70pp	Engineering / Config Management
ownership-inference	+64pp	Engineering / General SWE
config-inotify-partial	+60pp	DevOps / Infrastructure
sabot-trace	+50pp	Engineering / General SWE
haskell-strict-map-leak	+48pp	Engineering / Performance
secret-rotation-failure	+46pp	Engineering / Security
sg-conntrack-revoke-persist	+45pp	Engineering / Networking
custom-load-balancer	+40pp	DevOps / Infrastructure
rate-limit-bypass-investigation	+40pp	Engineering / Security

B. Zero-Baseline Tasks

Tasks where all four models scored 0% without skills. The skill document is the only path to a passing result.

Zero-baseline tasks — all four models scored 0% without skills
Task	With-skills delta	What the skill provides
ownership-inference	+64pp	Weighted ownership scoring + stale-fragment rules
phantom-config-override	+70pp	systemd override mechanism enum + recommended_action enum
sabot-trace	+50pp	Sabot stack-language semantics reference
rate-limiter-ip-leak	+16pp	Gin middleware ordering + trusted-IP resolution
mruby-utf8-byte-search	+30pp	UTF-8 boundary + cursor forward-progress invariants

C. Dataset Composition

Environment complexity (files per scenario):

Dataset composition by environment file count
Environment files	Count
1–3 files	4
4–7 files	4
8–15 files	7
16+ files	5

D. Referenced Standards and Research

SkillsBench (arXiv:2602.12670v1)

Defines the three-tier difficulty rubric, task.toml schema, and binary/partial scoring methodology used throughout this engagement.

Harbor Evaluation Framework

Open-source agentic evaluation harness used for container build, skill injection, test execution, and run logging (claude-institute/harbor).

Contextual benchmarks

SWE-bench Verified (Princeton/OpenAI, 2024) established agent evaluation on real GitHub issues with verified test suites; SkillsBench extends this to knowledge-augmented agents. GAIA (Mialon et al., 2023) benchmarks general assistant capabilities across tool use and multi-step reasoning; SkillsBench tasks are narrower in scope and deeper in domain specificity. HumanEval / MBPP code-generation baselines confirm that strong HumanEval-class performance does not translate to domain classification tasks without explicit vocabulary augmentation.

Benchmark conducted April 2026 by G2i, Inc. Difficulty classifications per the SkillsBench rubric (arXiv:2602.12670v1).

Interested in Collaborating?

We’re always looking to partner with AI labs on research that advances the field.