Every time an AI agent fails in production, the post-mortem follows a familiar pattern. Someone asks what the agent was thinking. Someone else pulls the traces. A third person checks the prompt. And the conclusion — almost invariably — is that the problem was not the model. The model did what it was asked to do. The problem was the code around it.

The agent had no timeout on its external API call. Or it had no exit condition on its retry loop. Or it passed unvalidated LLM output directly to a database query. Or it executed a tool call with no log entry. In every case, the failure mode was structural — an architectural gap that no amount of prompt engineering could have closed.

This is the gap Maxwell is built to close. And the reason we use static analysis to close it — rather than runtime monitoring, LLM-based review, or probabilistic evaluation — is not a technical preference. It is the only approach that produces a proof rather than a probability.

The problem with everything else.

Runtime monitoring

Observability tools like LangSmith, Arize, and Datadog's LLM integrations are genuinely useful. They show you what your agent did, how long it took, and where it failed. If you have a production incident, they are where you start.

But they are reactive by design. They operate after the problem has occurred. That distinction matters enormously when the failure costs $10,000 in API credits, hallucinates a financial recommendation, or executes an irreversible action.

The deeper issue is that observability tools observe execution paths that have been taken. They cannot observe paths that have not been taken — the malformed input that has not yet arrived, the API timeout that occurs once in ten thousand calls, the edge case that only surfaces under load. Static analysis operates on the structure of the code, not on a sample of its executions. It sees every path, including the ones you have never run.

Runtime guardrails

Prompt injection filters and runtime AI firewalls operate as proxies — intercepting requests before they reach the model, filtering inputs, blocking detected attacks. They are the WAF analogy for AI systems. They are solving a different problem. A WAF cannot tell you that your authentication architecture has a design flaw — it can only block inputs that exploit it. Similarly, a prompt injection filter cannot tell you that your agent has no error handler on its LLM call.

LLM-based code review

This is the most seductive trap, because it feels like it should work. You have a language model that understands code. You ask it to review your agent architecture. It gives you a confident, detailed response.

The issue is that LLM code review is itself non-deterministic. Run the same prompt against the same codebase twice and you may get different findings. The model may miss a timeout violation on one run and catch it on another. It has no formal notion of “every execution path” — it is pattern-matching on text, not traversing a graph. It cannot prove that a guard dominates a sink. It can only guess that it probably does.

“You cannot prove a negative with a probability. You can only prove it with a structure.”

Why static analysis is the right tool.

Static analysis was invented precisely for this class of problem: reasoning about all possible executions of a program without running it. The methodology predates LLMs by decades. It is how compilers catch type errors before execution. It is how safety-critical systems — avionics, medical devices, nuclear control software — are certified.

Control-flow graphs

A control-flow graph (CFG) is a directed graph where every node is a basic block of code and every edge represents a possible transfer of control. Every possible execution of a function corresponds to a path through its CFG. Maxwell builds a CFG for every function in your agent codebase. This is deterministic: the same source code always produces the same CFG. No sampling, no approximation.

Dominance proofs

A node A dominates a node B in a CFG if every path from the function entry to B passes through A. This is the mechanism behind Maxwell's invariant checks. Consider AG-001: no error containment on an LLM call. Maxwell identifies the LLM call as a “sink” and the error handler as a “gate”. It then checks: does the gate dominate the sink? If yes, every execution path that reaches the LLM call has passed through the error handler — not probabilistically, but provably, for all inputs.

[CRITICAL] AG-001 · no error containment

src/dispatch.py:47 — LLM call has no dominant error handler

∅ counterexample path: entry → plan() → dispatch() → llm_call()

handler at line 62 does not dominate this path

What this means in practice.

Maxwell's output is fundamentally different from monitoring tools, evaluation frameworks, or LLM reviewers. It is not a score. It is not a recommendation. It is a binary finding: this invariant holds, or it does not, for every possible execution of this function.

It is reproducible. Run Maxwell on the same codebase twice and you get the same result. This is not true of LLM-based review, and it is not true of runtime monitoring. Reproducibility is a prerequisite for compliance evidence.

It covers all paths. Runtime tools only observe executed paths. Static analysis covers every path — the ones that surface under adversarial input, under load, or in edge cases your test suite never reaches.

It is auditor-legible. “Our CFG analysis proves the error handler dominates every LLM call sink” is an answer. “We monitored it in staging and it looked fine” is not.

There is one thing static analysis cannot do: it cannot tell you whether your agent will produce a correct or useful output. It cannot evaluate the quality of your prompts or the accuracy of your retrieval. Those are real problems, and evaluation frameworks exist to address them. But correctness of output and safety of architecture are orthogonal concerns. Maxwell addresses the architectural layer — the one that determines whether the system is safe to operate, not whether its answers are good.

The methodology that made compilers trustworthy can make AI agents trustworthy too. The regulatory moment for applying it is arriving now.

← All postsmaxwell 0.4.1 · Maxwell Engineering · May 2026See the invariant library →