Back to news
NewsMay 4, 2026· 2 min read

AI systems now solve 12-hour tasks independently, coding at 93.9%

Multiple benchmarks show AI systems crossing key automation thresholds for software development and research tasks that define modern AI R&D workflows.

Our Take

The convergence across coding, task duration, and scientific benchmarks creates a credible case that AI R&D automation is months, not years away.

Why it matters

AI researchers already delegate most coding work to AI systems, and the time horizons now match the granularity of actual research tasks. If this trend continues, human-led AI development becomes optional by 2028.

Do this week

AI teams: audit which research tasks take under 12 hours and test automation pilots this quarter so you can stay competitive as manual workflows become obsolete.

AI systems hit 93.9% on real GitHub issues, work 12 hours solo

Multiple independent benchmarks show AI systems crossing critical thresholds for autonomous work. On SWE-Bench, which tests AI systems against real GitHub issues, Claude Mythos Preview scored 93.9% (per Import AI analysis), up from Claude 2's ~2% in late 2023. The benchmark appears saturated, with remaining failures likely due to label quality rather than AI capability.

Time horizons for independent work have expanded dramatically. METR's analysis shows AI systems can now reliably complete 12-hour tasks (Opus 4.6 in 2026), compared to 30 seconds for GPT-3.5 in 2022 (per METR tracking). The progression: 4 minutes (GPT-4, 2023), 40 minutes (o1, 2024), 6 hours (GPT 5.2 High, 2025).

On scientific reproduction tasks, CORE-Bench went from 21.5% (GPT-4o, September 2024) to 95.5% (Opus 4.5, December 2025). The benchmark tests whether AI systems can install dependencies, run research code, and answer questions about outputs. MLE-Bench, testing Kaggle competition performance, rose from 16.9% (o1, October 2024) to 64.4% (Gemini3, February 2026).

Most AI research tasks now sit within AI capability windows

The 12-hour threshold matters because it encompasses the granularity of actual research work: cleaning datasets, launching experiments, reading papers, implementing baselines. Jack Clark notes that "a lot of their tasks boil down into things that might take a person a few hours to do." These tasks now fall within demonstrated AI capabilities.

PostTrainBench results show AI systems achieving 25-28% of human-level fine-tuning performance (per April 2026 scores), compared to human baselines of 51%. The human baselines represent production models from frontier labs, not academic exercises. Even partial automation at this level would materially accelerate research cycles.

The coding saturation is already reshaping workflows. Clark reports that "the vast majority of people I meet at frontier labs and around Silicon Valley now code entirely through AI systems." As coding moves from bottleneck to solved problem, research attention shifts to higher-level design decisions.

Test automation pilots before manual workflows become obsolete

Teams should audit their current research pipelines and identify tasks under the 12-hour threshold. Start with data preprocessing, baseline implementations, and result reproduction. The PostTrainBench results suggest even complex model training workflows are becoming automatable.

The kernel optimization work shows AI systems tackling previously specialized domains. Multiple groups report success with GPU kernel generation, including DeepSeek, Meta, and Huawei implementations. While kernel design benefits from verifiable rewards, the pattern suggests AI automation will expand beyond obviously measurable tasks.

Clark's 60%+ confidence in no-human-involved AI R&D by 2028 reflects the compound effect of these individual capabilities. Each benchmark crossing human-level performance removes another constraint on autonomous research workflows.

#LLM#Agents#Research#Developer Tools
Share:
Keep reading

Related stories