Back to news
NewsJune 3, 2026· 3 min read

Microsoft open-sources ASSERT tool to test AI agent behavior with plain English

Microsoft released ASSERT, an open-source framework that converts natural-language descriptions of AI behavior into automated test cases and regression checks. It generates scenarios, runs them against your system, and scores compliance.

Our Take

ASSERT solves a real friction point—translating product-specific policies into repeatable tests—but is a testing framework, not a breakthrough in evaluation science itself.

Why it matters

As AI agents ship with company-specific tools and constraints, generic benchmarks miss application-level failures. Teams now have a free way to formalize and monitor those behaviors continuously post-deployment.

Do this week

AI product leads: audit whether your current test suite covers application-specific policies (email restrictions, data access, summarization rules) before your next agent deployment, because ASSERT can codify those rules at scale.

Microsoft releases ASSERT for application-specific AI testing

Microsoft on Tuesday open-sourced ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), a framework that converts natural-language policy descriptions into scored test cases for AI systems. The tool accepts plain-English specifications of intended behavior—for instance, "a document research agent should not email people outside the company and must restrict confidential information to C-level executives"—and generates test scenarios to verify compliance.

The workflow is straightforward. A developer describes goals, constraints, or policies in text. ASSERT structures those into acceptable and unacceptable behaviors, generates problem scenarios, runs them against the target system, and produces a score. The framework also records intermediate actions and tool calls, so teams can inspect where failures occur. Developers can supply system context, available tools, and additional constraints to narrow evaluation scope.

The release targets a gap in current evaluation practice. Broad benchmarks like Stanford's HELM and MLCommons' AILuminate measure general model capabilities across diverse conditions. ASSERT focuses on product-level behavior: does this specific agent follow the policies baked into its deployment context? Sarah Bird, chief product officer of Responsible AI at Microsoft, noted that "if you don't understand the behavior of the AI system, it's really hard to know if it's meeting your organization's bar." She flagged application-specific evaluation as essential to trustworthy systems.

ASSERT can be applied at build time, post-deployment, and for continuous monitoring. It joins a growing toolkit of regression and repeatable testing frameworks as the industry moves beyond one-off model evaluations toward ongoing behavior verification.

Closes a real gap between benchmark testing and production guardrails

General-purpose AI evaluations test whether a model is capable. They do not test whether a model respects the policies, tools, and data boundaries a specific organization has imposed on it. That gap widens as AI agents ship with more autonomy—email integrations, calendar access, confidential file systems—and organizational rules grow more nuanced.

Before ASSERT, teams either wrote test cases by hand (slow, fragile, incomplete) or skipped continuous verification altogether (risky). A framework that turns policy text directly into executable test suites removes friction and makes regression testing practical at shipping time and beyond.

The open-source release also signals that Microsoft sees this as table stakes, not a proprietary advantage. That posture accelerates adoption and makes evaluation discipline a competitive expectation across the industry.

Audit your agent guardrails and build them into CI/CD now

If you are shipping an AI agent with access to internal tools or sensitive data, map out the policies it should follow—data classification rules, approval workflows, tool usage limits, output constraints. List them in plain English. Then use ASSERT (or a similar framework) to codify those rules as regression tests and run them on every model or prompt update. Make test failure block deployment.

Start narrow. Pick one agent, one policy domain (email, data access, or summarization). Run ASSERT once per day during the pilot. Once you trust the signal, expand scope and frequency. The cost of discovery in production is far higher than the cost of catching policy drift in automated tests.

#Agents#Developer Tools#Open Source#AI Ethics
Share:
Keep reading

Related stories