AI Skills Augmentation: Data Sample Dataset and Findings

By G2i AI TeamMay 2026/ 6 Min read/ Research

20

Tasks Evaluated

1,820

Scored Runs

+52pp

Top Model Lift

Introduction

Most enterprise AI deployments stall at the same point: the models are increasingly capable but don’t know your team’s taxonomy, your incident classification system, or your internal configuration vocabulary. They can read the logs. They can’t speak your language. Skill documents close that gap, and this dataset proves it at scale.

What We Built

G2i built a curated 20-task data sample dataset for the SkillsBench evaluation program. Each task is a real-world engineering or DevOps scenario. An agent is paired with a skill document, works through the environment, and its output is scored against a test suite.

Validation ran four frontier models across all tasks, with and without skills loaded, producing 1,820 scored runs. The 20 submitted tasks were selected for reliable performance lift, cross-model consistency, and clean run histories (no flaky tests, no environment failures).

How the Dataset Is Organized

The 20-task portfolio covers five major domains:

Task count by domain across the 20-task portfolio
DomainTasks
Engineering: General SWE9
DevOps: Incident Response & Observability5
DevOps: Infrastructure (Kubernetes, Cloud)2
Engineering: Security & Compliance3
Other (SRE, Networking, Messaging)1

Diversity was a deliberate design goal. Tasks vary across domain, the number of skills required (1 to 4+), environment complexity (minimal single-file setups to 16+ file environments with distractors), test count, and difficulty tier.

Difficulty Classification

Tasks are graded using the official SkillsBench rubric (arXiv:2602.12670v1), which defines three tiers based on estimated human completion time for a median domain specialist working without AI assistance:

SkillsBench difficulty tiers by estimated human completion time
TierHuman timeTypical baseline (no skills)
Easy (Core)Under 60 minutesAbove 50%
Medium (Extended)1–4 hours10–50%
Hard (Extreme)4+ hoursUnder 15%

The data sample portfolio contains 19 medium tasks and 1 hard task. Easy tasks were excluded because models solve them without any skill augmentation. The skew toward medium over hard is also deliberate: tasks with zero baselines generate cleaner signal than hard tasks, where even the best models partially fail with skills loaded.

What the Data Shows

Skill documents produce large, consistent performance gains when the task follows the investigation pattern: the agent reads system state or logs, the skill teaches an enumerated vocabulary of failure modes, and the agent classifies its findings accordingly. This pattern accounts for the strongest results in the dataset.

The gap the benchmark measures

Without skills, agents can observe and describe. With skills, they classify and act using the same language your team uses. That gap is what the benchmark measures.

Performance across all four models:

Average task pass rate with and without skill documents, by model
ModelWith skillsWithout skillsAvg delta
claude-opus-4-689%37%+52pp
claude-sonnet-4-679%37%+42pp
openai/gpt-5.457%18%+39pp
openai/gpt-5.3-codex49%17%+32pp

How to Use This Dataset

The portfolio is structured for immediate use in the SkillsBench data sample evaluation. Each task is self-contained and includes the scenario environment, the skill document, and the test suite. The grading and scoring methodology follows the SkillsBench standard, so results are directly comparable to other submissions in the program.

Beyond the data sample, the dataset illustrates a broader principle: the gap between what a model knows from training and what it needs to know to operate in a specific team’s environment is measurable and significant. The 20 tasks in this portfolio demonstrate that across security, DevOps, general software engineering, and infrastructure domains.

Appendix: Supporting Data

A. Strongest Tasks in the Portfolio

Ranked by binary delta, all four models consistent:

Strongest tasks ranked by binary delta, consistent across all four models
TaskDeltaDomain
phantom-config-override+70ppEngineering / Config Management
ownership-inference+64ppEngineering / General SWE
config-inotify-partial+60ppDevOps / Infrastructure
sabot-trace+50ppEngineering / General SWE
haskell-strict-map-leak+48ppEngineering / Performance
secret-rotation-failure+46ppEngineering / Security
sg-conntrack-revoke-persist+45ppEngineering / Networking
custom-load-balancer+40ppDevOps / Infrastructure
rate-limit-bypass-investigation+40ppEngineering / Security

B. Zero-Baseline Tasks

Tasks where all four models scored 0% without skills. The skill document is the only path to a passing result.

Zero-baseline tasks — all four models scored 0% without skills
TaskWith-skills deltaWhat the skill provides
ownership-inference+64ppWeighted ownership scoring + stale-fragment rules
phantom-config-override+70ppsystemd override mechanism enum + recommended_action enum
sabot-trace+50ppSabot stack-language semantics reference
rate-limiter-ip-leak+16ppGin middleware ordering + trusted-IP resolution
mruby-utf8-byte-search+30ppUTF-8 boundary + cursor forward-progress invariants

C. Dataset Composition

Environment complexity (files per scenario):

Dataset composition by environment file count
Environment filesCount
1–3 files4
4–7 files4
8–15 files7
16+ files5

D. Referenced Standards and Research

SkillsBench (arXiv:2602.12670v1)

Defines the three-tier difficulty rubric, task.toml schema, and binary/partial scoring methodology used throughout this engagement.

Harbor Evaluation Framework

Open-source agentic evaluation harness used for container build, skill injection, test execution, and run logging (claude-institute/harbor).

Contextual benchmarks

SWE-bench Verified (Princeton/OpenAI, 2024) established agent evaluation on real GitHub issues with verified test suites; SkillsBench extends this to knowledge-augmented agents. GAIA (Mialon et al., 2023) benchmarks general assistant capabilities across tool use and multi-step reasoning; SkillsBench tasks are narrower in scope and deeper in domain specificity. HumanEval / MBPP code-generation baselines confirm that strong HumanEval-class performance does not translate to domain classification tasks without explicit vocabulary augmentation.

Benchmark conducted April 2026 by G2i, Inc. Difficulty classifications per the SkillsBench rubric (arXiv:2602.12670v1).

Interested in Collaborating?

We’re always looking to partner with AI labs on research that advances the field.

G2i flag planted on a grassy hill

Build better AI systems

Start with expert engineering data