Microsoft open-sources ASSERT tool to test AI agent behavior with plain English

Microsoft releases ASSERT for application-specific AI testing

Microsoft on Tuesday open-sourced ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), a framework that converts natural-language policy descriptions into scored test cases for AI systems. The tool accepts plain-English specifications of intended behavior—for instance, "a document research agent should not email people outside the company and must restrict confidential information to C-level executives"—and generates test scenarios to verify compliance.

The workflow is straightforward. A developer describes goals, constraints, or policies in text. ASSERT structures those into acceptable and unacceptable behaviors, generates problem scenarios, runs them against the target system, and produces a score. The framework also records intermediate actions and tool calls, so teams can inspect where failures occur. Developers can supply system context, available tools, and additional constraints to narrow evaluation scope.

The release targets a gap in current evaluation practice. Broad benchmarks like Stanford's HELM and MLCommons' AILuminate measure general model capabilities across diverse conditions. ASSERT focuses on product-level behavior: does this specific agent follow the policies baked into its deployment context? Sarah Bird, chief product officer of Responsible AI at Microsoft, noted that "if you don't understand the behavior of the AI system, it's really hard to know if it's meeting your organization's bar." She flagged application-specific evaluation as essential to trustworthy systems.

ASSERT can be applied at build time, post-deployment, and for continuous monitoring. It joins a growing toolkit of regression and repeatable testing frameworks as the industry moves beyond one-off model evaluations toward ongoing behavior verification.

Closes a real gap between benchmark testing and production guardrails

General-purpose AI evaluations test whether a model is capable. They do not test whether a model respects the policies, tools, and data boundaries a specific organization has imposed on it. That gap widens as AI agents ship with more autonomy—email integrations, calendar access, confidential file systems—and organizational rules grow more nuanced.

Before ASSERT, teams either wrote test cases by hand (slow, fragile, incomplete) or skipped continuous verification altogether (risky). A framework that turns policy text directly into executable test suites removes friction and makes regression testing practical at shipping time and beyond.

The open-source release also signals that Microsoft sees this as table stakes, not a proprietary advantage. That posture accelerates adoption and makes evaluation discipline a competitive expectation across the industry.

Audit your agent guardrails and build them into CI/CD now

If you are shipping an AI agent with access to internal tools or sensitive data, map out the policies it should follow—data classification rules, approval workflows, tool usage limits, output constraints. List them in plain English. Then use ASSERT (or a similar framework) to codify those rules as regression tests and run them on every model or prompt update. Make test failure block deployment.

Start narrow. Pick one agent, one policy domain (email, data access, or summarization). Run ASSERT once per day during the pilot. Once you trust the signal, expand scope and frequency. The cost of discovery in production is far higher than the cost of catching policy drift in automated tests.

Microsoft open-sources ASSERT tool to test AI agent behavior with plain English

Our Take

Why it matters

Do this week

Microsoft releases ASSERT for application-specific AI testing

Closes a real gap between benchmark testing and production guardrails

Audit your agent guardrails and build them into CI/CD now

One daily brief. Every story gets a hype verdict.

Related stories

Fenergo hires Finastra CRO to lead global revenue expansion

UK banks have 18 months to map third-party risks under PS26/2

Quantifind Lands $200M to Scale AI-Native Financial Crime Detection