Following OpenAI's methodology for SWE-bench Verified, we conducted a rigorous human verification of TypeScript tasks in Multi-SWE-bench. Our findings reveal critical quality patterns that explain low AI agent resolution rates and establish a verified subset for reliable evaluation.
Introduction
In 2024, OpenAI set a new standard for benchmark quality assurance by creating SWE-bench Verified, a rigorously validated subset of the original SWE-bench's 2,500 tasks. This verified benchmark quickly became the gold standard for training and evaluating AI coding agents, demonstrating the critical importance of high-quality, human-validated evaluation data in advancing autonomous software engineering capabilities.
Building on this foundation, we undertook a similar verification effort focused on the TypeScript portion of Multi-SWE-bench, a comprehensive multilingual benchmark introduced by Bytedance that extends software engineering evaluation beyond Python to multiple programming languages. Our decision to focus specifically on TypeScript was driven by two key factors: first, TypeScript tasks in the benchmark exhibited one of the lowest resolution rates among AI agents, suggesting either heightened task difficulty or potential quality issues that warranted closer examination; second, our organization maintains a historically strong community of TypeScript engineers, positioning us uniquely to conduct thorough technical evaluation of these tasks.
Annotation Pipeline
To ensure methodological consistency and enable meaningful comparison with existing verified benchmarks, we adopted OpenAI's annotation guidelines as our framework, implementing them without modification. This approach allows the research community to build upon established best practices while extending verification efforts to new programming languages.
Our verification process followed a multi-stage structure designed to maximize both thoroughness and quality:
Dual Evaluation
- Each task was independently assessed by two experienced developers, ensuring multiple perspectives on task validity, clarity, and solvability.
Consensus-Based Review
- Following the initial dual evaluation, tasks proceeded to a review stage where the process varied depending on whether the two developers reached consensus. Tasks with agreement moved forward more quickly, while those with divergent assessments received additional scrutiny to resolve ambiguities.
Quality Control
- To validate the integrity of our annotation process, we implemented selective quality checks as a control measure, sampling a subset of evaluated tasks for additional review by senior engineers..
Annotation Results
Our verification effort evaluated 210 TypeScript tasks from Multi-SWE-bench, assessing both the clarity of issue specifications and the quality of test coverage. The results reveal a mixed quality landscape with distinct patterns across specification quality and test adequacy.
Overall Quality Distribution
Issue Specification Quality
Requirements were clear and unambiguous
Notable gaps in clarity
Some ambiguity present
Failed to provide sufficient information for meaningful solution attempts
Test Quality Distribution
Tests comprehensively covered all reasonable solutions
Adequate but imperfect test coverage that might miss edge cases
Tests that could reasonably reject valid solutions
Tests that were either too narrow, too broad, or misaligned with the stated issue
Key Finding
Among the 137 well-specified tasks, 87 (63.5%) also featured excellent test coverage, establishing a core subset of 41.4% of all tasks with both dimensions rated as "Good". This represents a potential "verified" subset comparable to SWE-bench Verified.
Correlation Between Specification and Test Quality
A notable pattern emerges when examining the relationship between issue specification quality and test adequacy. Among the 137 well-specified tasks, 87 (63.5%) also featured excellent test coverage, establishing a core subset of 41.4% of all tasks with both dimensions rated as "Good." However, even among well-specified issues, test quality varied considerably: 35 had acceptable tests, 12 had problematic tests, and 3 had failing tests.
This correlation weakens significantly for lower-quality specifications. Tasks with poorly specified issues (category 3 - Bad) showed scattered test quality: 20 had good tests, 14 had acceptable tests, 7 had bad tests, and 4 had failing tests. This suggests that specification clarity and test quality, while correlated, represent distinct dimensions of task quality that require independent evaluation.
Difficulty Distribution and Quality Patterns
Task difficulty showed concentration in the moderate range, with the majority falling between 15 minutes and 4 hours of estimated solution time. Interestingly, only 5 tasks (2.4%) were categorized as very simple (<15 minutes), while 84 tasks (40%) were estimated to require 15 minutes to 1 hour, 74 tasks (35.2%) fell in the 1-4 hour range, and 47 tasks (22.4%) required 4+ hours.
When examining difficulty in relation to quality, well-specified tasks with good tests distributed relatively evenly across difficulty levels, suggesting that task complexity doesn't necessarily correlate with quality issues. However, tasks requiring extended solution time (4+ hours) showed slightly higher rates of specification ambiguity, which may indicate inherent complexity that's difficult to capture concisely in issue descriptions.
Implications for Benchmark Refinement
These results suggest that approximately 41% of the TypeScript Multi-SWE-bench tasks (87 out of 210) meet the stringent criteria of both excellent specification and comprehensive test coverage, establishing a potential "verified" subset comparable to SWE-bench Verified. If the benchmark were refined to include only these verified tasks—those rated as both well-specified and having good tests—we estimate that the resolution rate of top agents on the leaderboard would increase by approximately 7pp.
If refined to include only verified tasks, we estimate the resolution rate of top agents on the leaderboard would increase by approximately 7 percentage points.
This improvement would reflect not enhanced agent capabilities, but rather the removal of tasks where AI agents struggle due to ambiguous requirements or flawed tests rather than genuine technical difficulty. The relatively high rate of test quality issues (20% with significant problems) helps explain the notably low resolution rates previously observed with AI agents on TypeScript tasks. By focusing on verified tasks, the benchmark would provide a clearer signal of actual agent performance on well-defined software engineering problems, enabling more accurate assessment of progress in autonomous coding capabilities across multiple programming languages.
Analysis of Specification Issues
We analyzed the 73 tasks (35% of the total dataset) that were not rated as well-specified to understand common patterns in specification quality problems. The issues fall into ten distinct categories, with many tasks exhibiting multiple specification problems simultaneously.
Primary Categories of Specification Problems:
1. Vague or Unclear Problem Description
27%The most prevalent issue across all severity levels. Issue descriptions were ambiguous or difficult to comprehend, making it unclear what problem needed to be solved. In many cases, the actual issue only became apparent after examining the pull request or fix implementation. For example, one task's description made it "very difficult to understand what this issue is trying to fix—it is not obvious until you take a look at PR that it is basically about fixing a warning message when it seems upon reading the issue that it might be about fixing underlying implementation."
2. Unclear Requirements or Scope
15%Issues lacked specificity about what exactly needed to be implemented, including details on variants, features, edge cases, or customization requirements. A typical example stated simply "Add the ToggleButton component" without specifying "what type of toggle button is needed, how many variants should be included, whether toggle groups are required, or what level of customization should be supported." Some issues also presented multiple proposed solutions, creating ambiguity about the intended approach.
3. Missing or Inadequate Reproduction Steps/Examples
15%Tasks lacked clear steps to reproduce the problem or code examples demonstrating the issue. Annotators noted missing information such as "examples on how the component was used to produce the same issue" or ambiguity in reproduction where "it has to be an instant movement" to trigger the bug, making it difficult to verify the problem existed.
4. Missing Context or Background Information
12%Issues provided insufficient background to understand the motivation or broader context. Descriptions would reference problems without explaining why changes were needed or how they fit into the larger system. One annotator noted: "The issue description lacks a lot of information and context to understand the issue. The only context that we have is that it is related to Grid2 and Dialog."
5. Missing Technical Details
11%Critical technical information was frequently omitted, including library versions, environment configurations, or dependency details. Annotators emphasized that information "like the version of the MUI library that was being used at that time...plays a crucial role in identifying the exact component and current functionalities."
6. Over-Reliance on External Resources
10%Issue descriptions depended heavily on external links (CodeSandbox, Loom videos, screenshots) rather than including critical information in the description itself. This created problems when "a lot of information is in the Loom and the Playground" and reviewers must infer the issue, potentially leading them "down the wrong path." This problem was compounded when links broke over time.
7. Unclear or Misleading Titles
8%Issue titles failed to accurately convey the problem or assumed reviewers possessed full context. As one annotator noted, "The issue title is very unclear and assumes the reviewer has the full context of the issue."
8. Overly Complex or Phased Implementation
7%Some issues involved multiple phases, complex rollouts, or too many interdependencies, making it difficult to determine a clear path forward. These "complex changes" and discussions about "how it can be introduced in phased manner" would "lead to longer conversations between team members" rather than enabling direct implementation.
9. Too Generic or Broad
4%Issues were occasionally too high-level or lacking in specificity. Some "reported very generic" problems that needed "more description" or resembled feature requests rather than specific bugs with concrete solutions.
10. Broken or Non-Working Links
3%Referenced external resources such as sandboxes or demos were no longer accessible, preventing verification of reported issues.
Severity Distribution
Among the 73 non-well-specified tasks, 62% were rated as category 2 (Ok—some blanks to fill in but sensible interpretation possible), 29% as category 3 (Bad—vague with room for ambiguity), and 10% as category 4 (Fail—nearly impossible to understand without additional information).
Key Patterns
The analysis reveals that the most common issues—vague descriptions, unclear requirements, and missing reproduction steps—collectively account for 57% of specification problems. Many tasks compound multiple issues, such as combining vague descriptions with missing examples or relying on external resources while also lacking technical details. External resource dependency emerged as a significant concern, particularly when links break over time, permanently degrading task quality.
Conclusion
Our verification of 210 TypeScript tasks from Multi-SWE-bench reveals that while the benchmark contains valuable evaluation material, significant quality variation exists that impacts the reliability of agent performance measurement.
By establishing a verified subset representing 41% of tasks with both excellent specifications and comprehensive test coverage, we provide the research community with a higher-quality evaluation benchmark. The estimated 7 percentage point improvement in resolution rates when using only verified tasks demonstrates that current low performance on TypeScript tasks is partially attributable to benchmark quality rather than solely agent capability limitations.
This work extends the verification methodology pioneered by OpenAI to multilingual contexts and provides actionable insights for improving benchmark quality across programming languages. Our detailed analysis of specification issues offers concrete guidance for future benchmark development and task authoring.
Interested in Collaborating?
We're always looking to partner with AI labs on research that advances the field.
