Did the Messages Arrive, or Did They Agree?
The most expensive question in AI is one our tools don't even ask.
Picture a system doing everything right. A handful of AI agents, each pointed at its piece of a hard problem. One retrieves the documents, one reasons over them, one drafts the answer, one checks the draft. Messages pass cleanly between them. Nothing times out. Every box on the dashboard is green. The pipeline reports success.
And the answer is wrong — not because any single agent failed, but because two of them quietly assumed different things, and nobody was positioned to notice.
This happens constantly, and our tools are blind to it, because they were built to answer the wrong question. Ask any modern “orchestration” framework what it monitors and you’ll get a precise, confident answer: did the call succeed, did it retry, what was the latency, did the message arrive. These are real questions. They are also entirely about *transport*. They tell you the parts spoke to each other. They tell you nothing about whether the parts agreed.
We inherited this blindness honestly. The machinery for coordinating many components came from distributed systems — microservices, queues, retries, uptime. In that world, “it worked” means “the bytes got there.” That metaphor was quietly imported into AI, and it brought its definition of success along with it. So we now have elaborate infrastructure for guaranteeing that five language models *responded*, and almost none for asking whether what they said *coheres*.
Here is the distinction I care about, stated plainly. Every component in one of these systems is **locally competent**: in its own corner, on its own slice, it does fine. The open question — the only one that matters when the stakes are real — is whether all that local competence assembles into something **globally coherent**. Local competence is cheap now; we have a glut of it. Global coherence is the scarce thing, and we don’t even measure it.
Let me make “incoherence” concrete, because it’s sneakier than it sounds. The obvious failure is two agents flatly contradicting each other, and that one you might catch. The dangerous failure is subtler. Imagine three views on a question. A and B agree. B and C agree. A and C agree. Every pair is consistent — and yet the three together cannot all be true at once. The contradiction lives not in any pair but in the *loop*. No pairwise check will ever find it, because pairwise, everything is fine. You can stack up “they agree” reports all day and still be sitting on an irreconcilable view.
This is not an exotic edge case. It is the normal texture of disagreement among many sources, and it is exactly the kind of thing that slips through “did everyone respond?” untouched.
The instinct here is to reach for a familiar fix: just average them, or take a vote. Run five models, go with the majority. But averaging *hides* incoherence rather than resolving it — it manufactures a confident-looking number precisely by erasing the disagreement that was the most important signal in the room. (That deserves its own essay, and it’ll get one.) Smoothing over the seams is not the same as understanding them.
So what would it look like to actually measure the thing? This is where my own background stopped being a side interest and became the whole point. There’s a branch of mathematics — sheaf theory — built for exactly this: how local data on overlapping pieces glues into a global whole, and how to detect, precisely, when it *can’t*. It gives you two things you can compute on the outputs of many agents. The first is the **consistent core** — what they genuinely all agree on, the part you can actually trust. The second is the **obstruction** — a measured account of the irreducible disagreement, including the cyclic kind no pairwise comparison sees. You can put a number on it. You can point at *where* it lives.
I won’t pretend this is solved or easy; turning that mathematics into something that runs on real model outputs is most of what I do, and it’s hard. But the conceptual move is the important part, and it’s available to anyone right now, with no new tools: **stop asking whether your system ran, and start asking whether it cohered.**
Whether you can get away with ignoring this depends entirely on what’s at stake. If a chatbot gives a slightly incoherent answer about a movie, who cares — the cost of undetected incoherence is zero. But in finance, in law, in medicine, in any setting where a decision follows from the output, the undetected incoherence *is* the risk. “The models mostly agreed” is not an answer there. It’s a liability wearing the costume of an answer.
The shift I’m arguing for is small to state and large to absorb. We’ve spent years making sure the messages arrive. The next era is about making sure they agree — and being honest, with a number, about how much they don’t.
That number is most of what I think about. More on how you actually compute it soon. And more on the company that currently computes this number, within a pod of AIs, working together on a common goal. The company is called Sheaf and our enterprise API solution - Sheaf Pod.
— Jack

