IBM cuts agent token use by 30× on legacy code tasks

IBM tested agent logic across four enterprise domains

IBM Research built and deployed agents equipped with program analysis libraries, knowledge graphs, and policy-enforcement algorithms across four high-friction enterprise workflows. Each agent was designed to steer an LLM (Mistral Medium, Devstral 24B, Claude, Gemini, GPT variants) through structured reasoning paths rather than freeform context expansion.

Application understanding (legacy code modernization): The App Insights agent in IBM watsonx Code Assistant for Z uses static analysis to pre-index code dependencies across systems up to 1 million lines. When asked about application behavior, the agent retrieves structured facts from this pre-computed graph rather than forcing the LLM to reason over raw source. Result: ~30× lower token consumption than baseline LLM-only approach while maintaining answer accuracy on mainframe systems.

Test generation: Aster, IBM's proprietary program analysis library, feeds structured call graphs and data-flow output into an LLM tasked with generating unit, integration, and API tests. Sub-agents handle coverage gaps and compilation errors. Tested on 75+ Java applications up to 67K lines of code: 20–45% improvement in line/branch/method coverage compared to zero-shot LLMs and open-source coding agents, with up to 15× lower token consumption (independent benchmark: open-source test suites).

Incident response and root cause: A multi-agent system combines knowledge graphs of microservices, databases, and observability data (Instana model) with program dependency graphs. Each agent reasons over a constrained, relevant subset of the full IT stack rather than the entire deployed environment. The proprietary I3 agent achieved 4.0× better performance than ReAct with GPT-5.1 on ITBench; even with newer Gemini 3 Flash, it consumed 1.6× fewer tokens while staying within 17% of I3 performance. For source code analysis and bug repair, agents using dependency graphs outperformed state-of-the-art coding agents by 3.0× and 1.6× respectively, with 3.7× and 5.9× lower token use.

Compliance automation: A multi-agent system using adaptive planning and dynamic task decomposition automates compliance control creation, assessment, and remediation. Instead of fixed planning, it sequences steps with continuous feedback. Results on ITBench: 1.3–2.0× better performance than prior agents using static plans; success rates climbed from single digits to 80% in complex scenarios (Claude 4 Sonnet baseline).

Two additional case studies showed similar patterns: a configurable generalist agent (CUGA) applying policy-as-code achieved 15–26% accuracy gains across model families in healthcare customer service; an asset maintenance agent using directed acyclic graphs reduced analysis time from 15–20 minutes to 15–30 seconds (97% improvement) and cut token use by 77% on average.

Context and guidance are not the same thing

The article frames agent logic as a "guide" that constrains the LLM to relevant reasoning paths. This is not a novel idea in isolation (RAG systems, tool use, and structured prompting already exist), but the specificity matters: IBM is reporting that pre-computed graphs, static analysis output, and policy constraints reduce both token cost and hallucination risk simultaneously across real production systems.

The benchmark comparisons are meaningful because they isolate the agent logic layer. Token savings and accuracy gains are measured against both vendor-published baselines (LLM-only) and independent benchmarks (ITBench for incident response and compliance; open-source test suites for Aster). The claims are not presented as vendor magic but as engineering discipline applied to structured domains.

However, these are not recipes that transfer easily. The App Insights agent works because legacy systems have stable, analyzable code; the I3 incident agent works because observability data is queryable and bounded. A compliance agent works because policies are formalizable. None of these solve the general problem of guiding LLMs through novel, unstructured workflows. They are domain-specific engineering wins, not proof that agent logic is a universal scaling lever.

Audit your critical workflows for structure before betting on agents

If your enterprise workflow involves static, analyzable artifacts (source code, policies, service dependencies, asset inventories), agent logic applied to that layer will likely cut both cost and error rate. Invest in pre-computation: static analysis, knowledge graphs, policy engines. Feed the structured output to the LLM, not the raw problem.

If your workflow is primarily unstructured (customer negotiations, novel diagnosis, creative problem-solving), agent logic as shown here will not help much. The cost reduction and accuracy gains depend on having a trusted, pre-indexed reference layer to query. Without that layer, you are back to raw context expansion and the associated hallucination risk.

Start by mapping your top three AI pilot failures. Which ones failed because the LLM reasoned over incomplete or noisy data? Those are candidates for agent logic. Which ones failed because the problem itself was intractable or required human judgment? Those need a different approach.

IBM cuts agent token use by 30× on legacy code tasks

Our Take

Why it matters

Do this week

IBM tested agent logic across four enterprise domains

Context and guidance are not the same thing

Audit your critical workflows for structure before betting on agents

One daily brief. Every story gets a hype verdict.

Related stories

The 30-Day AI-Native Challenge: a free/freemium roadmap to real AI skills

Your AI compliance gap is wider than your governance framework

Compliance teams ditch spreadsheets for unified EDD software