ASSERT: Microsoft Turns Plain English into AI Behaviour Tests – and Points at a Real Problem

by Owen Radner 03.06.2026

03.06.2026

Microsoft released ASSERT on Tuesday at its Build 2026 developer conference in San Francisco, an open-source framework it calls Adaptive Spec-driven Scoring for Evaluation and Regression Testing. The tool takes natural-language descriptions of an AI application’s intended behaviour, policies, and constraints, converts them into a structured set of acceptable and unacceptable scenarios, generates test cases from those scenarios, and runs them automatically against the live application. Developers can supply system context, tool constraints, and custom policies to focus what the evaluations cover; ASSERT records intermediate actions and tool calls so engineers can trace exactly where failures occur. YourNewsClub surfaces this as one of the more practically significant developer tools to emerge from Build 2026, which also produced the MAI-Thinking-1 reasoning model, the Web IQ grounding API suite, and the Microsoft Execution Containers sandbox for agent isolation.

The problem ASSERT addresses is specific and has grown quickly. Model-level evaluations for safety, compliance, sycophancy, and alignment have become standard practice at AI labs. But application developers face a different need: verifying that an AI system built on top of those models behaves correctly inside a specific product context. A document-processing agent at one company has different behavioural requirements than a customer support agent at another. Generic benchmarks cannot test product-specific policies. YourNewsClub identifies the gap between model-level evaluation and application-level evaluation as the space ASSERT is designed to fill – and as the space where the highest concentration of enterprise deployment failures currently originates.

A practical example from Microsoft’s documentation: a developer specifies that a document research agent should not send emails to people outside the company and should limit confidential information to C-level executives only. ASSERT converts those rules into test cases and runs them continuously as the underlying model updates, flagging regressions before they reach production. The framework ships on GitHub and runs as a hosted evaluation service through Azure AI Foundry, with integrations for PromptFlow and Semantic Kernel already in place. That means developers using existing Microsoft AI application infrastructure can add ASSERT to a pipeline that is already partially built, lowering the adoption barrier to a configuration file rather than a new codebase.

Jessica Larn, who studies macro-level technology policy and infrastructure impact of AI, places the release in its deployment-risk context: “Application-level AI behaviour failures – an agent that leaks sensitive data, routes queries to the wrong system, or ignores a policy constraint – create legal and regulatory liability, not just bugs. A testing framework that catches those failures before production is infrastructure as much as it is tooling. The question is adoption consistency: whether ASSERT becomes a standard deployment gate or optional tooling that teams skip under time pressure.”

The release reflects where Microsoft perceives the highest near-term enterprise risk in AI. Foundational model performance has advanced faster than the governance infrastructure organisations use to deploy it. The danger is no longer whether a model can perform a task – it is whether the application wrapping the model follows the organisation’s policies reliably across every update cycle. That compliance gap, between model capability and deployment governance, is what ASSERT is designed to audit continuously. Three metrics will tell whether it matters in practice: Azure AI Foundry adoption data in Q3, enterprise developer uptake on GitHub, and whether ASSERT or a similar framework gets cited in a significant AI deployment incident response. The developer tools desk at Your News Club will follow all three.

The broader significance of ASSERT is what its existence implies about the current state of enterprise AI deployment. If the industry had solved the behaviour-consistency problem at the model layer, a framework for application-level regression testing would not need to exist. Its release at Build 2026 is a quiet acknowledgment that the governance gap is real, widespread, and not closing on its own. YourNewsClub expects ASSERT or its successors to become standard deployment checkpoints in enterprise AI pipelines by end of 2027.

ASSERT: Microsoft Turns Plain English into AI Behaviour Tests – and Points at a Real Problem

Uber Blew Its AI Budget in Four Months. Then Came the Caps

Meta Spent $400M and Fought the Government for Supernatural. Then Handed It Back

You may also like