AI systems now solve 12-hour tasks independently, coding at 93.9%

AI systems hit 93.9% on real GitHub issues, work 12 hours solo

Multiple independent benchmarks show AI systems crossing critical thresholds for autonomous work. On SWE-Bench, which tests AI systems against real GitHub issues, Claude Mythos Preview scored 93.9% (per Import AI analysis), up from Claude 2's ~2% in late 2023. The benchmark appears saturated, with remaining failures likely due to label quality rather than AI capability.

Time horizons for independent work have expanded dramatically. METR's analysis shows AI systems can now reliably complete 12-hour tasks (Opus 4.6 in 2026), compared to 30 seconds for GPT-3.5 in 2022 (per METR tracking). The progression: 4 minutes (GPT-4, 2023), 40 minutes (o1, 2024), 6 hours (GPT 5.2 High, 2025).

On scientific reproduction tasks, CORE-Bench went from 21.5% (GPT-4o, September 2024) to 95.5% (Opus 4.5, December 2025). The benchmark tests whether AI systems can install dependencies, run research code, and answer questions about outputs. MLE-Bench, testing Kaggle competition performance, rose from 16.9% (o1, October 2024) to 64.4% (Gemini3, February 2026).

Most AI research tasks now sit within AI capability windows

The 12-hour threshold matters because it encompasses the granularity of actual research work: cleaning datasets, launching experiments, reading papers, implementing baselines. Jack Clark notes that "a lot of their tasks boil down into things that might take a person a few hours to do." These tasks now fall within demonstrated AI capabilities.

PostTrainBench results show AI systems achieving 25-28% of human-level fine-tuning performance (per April 2026 scores), compared to human baselines of 51%. The human baselines represent production models from frontier labs, not academic exercises. Even partial automation at this level would materially accelerate research cycles.

The coding saturation is already reshaping workflows. Clark reports that "the vast majority of people I meet at frontier labs and around Silicon Valley now code entirely through AI systems." As coding moves from bottleneck to solved problem, research attention shifts to higher-level design decisions.

Test automation pilots before manual workflows become obsolete

Teams should audit their current research pipelines and identify tasks under the 12-hour threshold. Start with data preprocessing, baseline implementations, and result reproduction. The PostTrainBench results suggest even complex model training workflows are becoming automatable.

The kernel optimization work shows AI systems tackling previously specialized domains. Multiple groups report success with GPU kernel generation, including DeepSeek, Meta, and Huawei implementations. While kernel design benefits from verifiable rewards, the pattern suggests AI automation will expand beyond obviously measurable tasks.

Clark's 60%+ confidence in no-human-involved AI R&D by 2028 reflects the compound effect of these individual capabilities. Each benchmark crossing human-level performance removes another constraint on autonomous research workflows.

AI systems now solve 12-hour tasks independently, coding at 93.9%

Our Take

Why it matters

Do this week

AI systems hit 93.9% on real GitHub issues, work 12 hours solo

Most AI research tasks now sit within AI capability windows

Test automation pilots before manual workflows become obsolete

Related stories

Gresham and FundGuard merge data platforms for asset managers

ANNA Money adds 3.66% savings account for UK small businesses

Payward buys Reap for $600M to merge stablecoin cards with B2B rails