Conformance vs Comprehension

2026-06-27T07:56:05Z

In 1991, a young student from Finland changed the world.

In the mid-1990s I lead a team working on software configuration management tools for IBM — loosely an on-prem, pre-internet GitHub, with version control, issue tracking, and continuous integration. Overly ambitious for its time; trivial by today's standards — in no small part because that same student from Finland came back in 2005 and turned version control itself into a primitive, with Git. Hold that thought; I'll come back to it.

Our target market was enterprise, so we supported all the major Unix systems of the era: Solaris, HP-UX, and of course IBM's own AIX. Our code was C, and every source file opened with a thicket of #ifdefs to paper over OS differences. I saw this new thing called Linux, and one weekend I ported our code to it. My management appreciated the initiative, and as luck would have it, a visiting executive was coming to town — she had me present what I'd done to him.

Suffice it to say, he was not impressed. He saw no future in open source, least of all for enterprise customers. I could end the anecdote here by calling him an arrogant fool, or by noting that the distribution I'd happened to choose was Red Hat. But neither is my point.

Given the information available at the time, his call was eminently defensible. Open source was — and remains — chaotic and unpredictable. Enterprises require governance.

What I want to suggest is that he wasn't wrong about open source. He was wrong about where to look. He was governing the artifact — the code, the thing you inspect, certify, and support — and the artifact was not where the durable enterprise value was going to live. A decade and a half later, IBM paid thirty-four billion dollars for Red Hat. Red Hat did not sell the kernel. It sold conformance, certification, and indemnification: a way to govern open source without inspecting every line of it. The asset had relocated, from the code to the thing the code answered to, and the old model couldn't see the new location because it was busy guarding the old one.

I've spent a fair amount of my career standing near that relocation, on the side that builds the thing code answers to. Before I retired I spent years in the standards world — co-chairing HTML at the W3C, secretary for Atom at the IETF, ssitting in TC39 and convening the C# and CLI working groups at Ecma. That last one matters most for what I want to say today, because standardizing a language and a runtime is not standardizing an artifact. It's standardizing a behavior: the precise set of things an implementation must reproduce to count as conformant. ECMA-334 and 335 existed so that Mono could be a legitimate implementation of C# and the CLI without a line of Microsoft's source. The spec was the neutral oracle. The implementations were downstream of it.

So conformance-by-testing is the water I swim in. When I notice something, that's usually the shape I notice it in. I'm telling you this not to claim authority over the conclusions — a man with one project and a strong prior should not be claiming authority over anything — but so you understand why a particular thing jumped out at me, and why it might be worth a second look even though my sample size is one.

The project

Here is the thing that jumped out. Over a few months, retired, working with Claude as a co-author, I built a compiler called Roundhouse. It reads a Rails application — untyped Ruby — and emits standalone projects in several statically typed targets: Rust, Crystal, TypeScript, Go, Python, Elixir, plus a Ruby round-trip. The emitted projects compile clean and pass their tests. The way I know they're correct is a conformance oracle: the same URL fetched from Rails and from each target produces byte-identical responses, checked three ways — emitted unit tests against fixed expected values, a differential compare gate against live Rails, and end-to-end browser tests for the dynamic behavior a static diff can't reach.

I want to be careful here, because there are two different claims tangled in that paragraph, and they land very differently in a room like this. Let me pull them apart.

The first claim: the axis might be wrong

The retreat's findings land, in the section on languages for agents, on a sensible-sounding conclusion: languages that favor expressiveness over safety make both agent generation and human review harder. The room converged on strong static typing as a guardrail for agent output — make incorrect code unrepresentable.

Roundhouse is a small piece of evidence that the axis might be drawn in the wrong place. It performs whole-program type inference over an untyped Rails application and emits into typed targets that hold the safety property — without anyone writing a type annotation. The safety is real; the annotation tax is zero.

Now, I don't want to oversell this, because the typed-language people in this room will rightly hand me a counterexample if I do. Type inference dissolving the annotation tax is not news — the ML tradition has shown it for forty years. The honest version of my claim is narrower and, I think, more defensible: Rails was already typed. has_many :comments is a type declaration; the conventions of the framework are implicit type information that was simply never written down. Whole-program inference can recover it. So the axis I'd put back to the room is not "expressive versus safe." It's "annotated versus inferred" — and a related question the project raises by existing: does the artifact the agent generates into have to be the same artifact a human edits? In Roundhouse it isn't. I edit the dense, intent-describing source; the compiler emits the verbose, mechanism-describing target. Neither audience pays the other's verbosity cost.

That's the first claim, and it's a genuine, falsifiable disagreement with one of the retreat's conclusions. But it's not the one I think matters most today.

The second claim: the cost collapsed

The claim that matters is about what it cost to build that.

Whole-program type inference over a real framework, with a multi-target conformance oracle, is the kind of thing that used to require a funded compiler team, or a vendor, or a research lab, or — in the case I know best — a standards body with multiple member companies and a multi-year process. Standardizing the CLI so that Mono and Microsoft answered to the same spec took a working group and years. The conformance oracle in particular — the executable part that mechanically decides whether an implementation is correct — was always the expensive half of standardization, the part that often never got fully built. Test262 for JavaScript came long after the prose spec, and it was a real lift.

What Roundhouse demonstrates, as a sample of one, is that the expensive half has gotten cheap. One retired person and an agent, in months, produced the conformance oracle and the multi-target compiler that answers to it. That is the observation I'd ask this room to sit with — not "AI writes code faster," which everyone here has already metabolized, but "AI has collapsed the cost of authoring the artifacts that used to require institutions."

And the moment you say a thing got cheaper to produce, this room will reach for the same idea I did: Jevons. When you make a resource cheaper to use, total consumption goes up, not down — the more efficient steam engine didn't reduce coal use, it made coal economical for more things, so we burned more of it. Apply that here and the comforting half of the conclusion writes itself. Cheaper software production does not mean less software or fewer engineers; it means vastly more software, and more demand for the judgment that directs it. We will not author fewer oracles now that authoring one is cheap. We will author orders of magnitude more. That is the demand-side engine underneath everything I'm about to say — it's why the answer to "one person built this" is not "how quaint" but "then there will be a great many of them."

And notice how that closes the loop with the standards lens. The durable thing in Roundhouse is not the compiler — I couldn't write a compiler, and I can't read the Rust codegen it produces. The durable thing is the oracle: the reference behavior every target is tested against, the thing I actually authored. When the next model writes a better emitter, it still has to pass my oracle. That is exactly the relationship a conformance suite has to its implementations, and exactly the relationship the IBM executive couldn't see between Red Hat's certification and the Linux kernel. The asset is the spec the implementations answer to. What changed between the 1990's and 2026 is not that this became true — it was always true. What changed is that authoring the asset stopped requiring an institution.

Where the rigor went: conformance, not comprehension

The retreat's largest question — the one it says surfaced in nearly every session — was where the engineering rigor goes once the agent writes the code. The room gave five answers: upstream into specifications, into test suites, into type systems, into risk tiering, and into "continuous comprehension." I'll sign two of those without hesitation, set one aside, and spend my time on the one I think is drawn backwards — because the project I built lands squarely on it.

Specs and tests first, because there I'm just agreeing — and in this room, agreeing with people who made the case for test-first development long before there was an agent on the other end of it. The report's sharpest single line is about TDD: tests written before the code stop "a particular mental error where the agent writes a test that verifies the broken behavior." That is exactly right. It's worth saying that this conviction is not universally shared outside these walls — Linus, the same student from Finland this talk keeps circling back to, is famously skeptical of test-first development, and on most days I'd rather stand with him than against him. On this one I'm with the room and against him. Every iteration of Roundhouse was driven test-first, and the discipline did precisely what many of you have argued for years it would: it made it impossible for the agent to declare victory by quietly lowering the bar. The test is the bar, and the agent doesn't get to move it.

But notice which tests, because this is the distinction I think the room undervalues, and it's the whole game. There are two kinds, and we lump them together. A spec test asserts a behavior because the specification says so — it descends from the written authority. A conformance test asserts a behavior because a reference implementation, or the real consumers downstream of it, actually depend on it — it descends from observed behavior. Test262 is a spec suite: it exists because ECMA-262 says so. The Roundhouse oracle is a conformance suite: it exists because Rails does a thing and a correct target must do the same thing, byte for byte. Most of the rigor in real systems lives in the second kind. Most of the prestige attaches to the first.

Let me make the case with the smallest project I can. Roundhouse needed a real Rails app, and the one I most wanted was Mastodon, and Mastodon's views are written in HAML — so "support Mastodon" quietly became "implement HAML." Here is the entire method. I surveyed the HAML features Mastodon actually uses — not the language, the subset Mastodon leans on. I implemented those. I tested them against Mastodon's own source as the corpus. The tests immediately surfaced the gaps — constructs I'd gotten subtly wrong or hadn't covered — and each gap became a small, fast iteration: failing case, fix, green, next.

And here is the part I want this room to sit with, because it is a direct counter to one of your five answers — and, I suspect, to something a good number of you hold dearer than any of the five. The report's continuous-comprehension section quotes someone saying paired programming "solves all of this" — that if it's important to understand the system you should "do it all the time," not in little phases. That is not a fringe view in this room; for many of you, pairing and continuously shared understanding are close to founding commitments, earned honestly over long careers. So let me be exact about where I'm parting company, because it turns on a single word. The comfortable version of my claim is that conformance augments comprehension — that the tests let me get away with understanding the code a little less. That is not what has been happening. For three months, across the whole project, conformance has replaced comprehension. I did not build a thin mental model of the HAML lowering and lean on the suite for the rest; I built none. I never read the generated code, not once, and could not have judged it if I had, because I don't write compilers. There was no architecture I was keeping current with, no system I held in my head, no session in which understanding changed hands. The standing model of what the code is doing — the very thing that section wants to preserve — was never in the loop.

That sounds reckless only until you see that the understanding didn't vanish; it moved. What I understood completely was the problem — which HAML features Mastodon leans on, what a fair subset is, what correct output looks like byte for byte. What I authored was the oracle: the corpus, the expected values, the compare gate. The implementation is that oracle's output — the agent writes code to satisfy the spec I wrote — so reading that code to reassure myself it's right would amount to checking machine output against the input I handed the machine. My control and the agent's freedom meet at the oracle and nowhere else. The day this stopped being theory for me was the day I noticed I find out whether a thing works the moment I finish authoring the test that defines "works," not the moment I run it: a benchmark number I'd aimed the whole project at, I executed for the first time minutes before I wrote it up, and it did exactly what I expected — because the oracle had already checked it, on every layer that mattered, and my own run was a formality. That is what replacement feels like from the inside. First-person confirmation becomes redundant by construction.

It is also why I'm unmoved by the decision-fatigue worry the retreat raises elsewhere — agents producing work faster than humans can say yes to it. I said yes to almost nothing in the HAML work. There were no judgment calls to fatigue me, because the corpus had already decided every case I cared about: the output matched Mastodon's, or it didn't. Decision fatigue is what you get when a human is the oracle. Move the oracle into the test suite and the fatigue goes with it.

So conformance tests are undervalued — that's the affirmative claim. But I owe you the boundary, because a conformance suite has a real failure mode and pretending otherwise would be cheating. It encodes whatever the reference implementation happens to do — its bugs, its accidents, and nothing about the inputs you never thought to try. It tells you that you match the oracle on the cases you tested; it cannot tell you the oracle was right. That is exactly where the written spec earns its keep. When a conformance test is silent, or when conformance and spec disagree, the spec wins — not because it's prestigious, but because it's the only thing that adjudicates the cases the reference implementation left undefined. The honest hierarchy: conformance tests do the overwhelming majority of the work and carry almost none of the cognitive load; the spec is the court of last appeal, reached for exactly when they run out. The mistake the industry keeps making — and the one I think the room half-makes when it reaches for type systems and comprehension as the guardrails — is to lavish its attention on the appellate court while starving the trial court that handles every actual case.

What life is like in 2031

Let me come back to where I started.

In 1991, a young student from Finland changed the world. The scarce thing, in 1991, was not really the ability to write a kernel — plenty of people could write a kernel. The scarce thing was the will to do it in the open, the timing, and a world willing to converge on the one who did. The construction was hard, but the construction was never the whole story.

And here's the thought I asked you to hold. He did it twice. In 1991 he gave us the kernel; in 2005 he gave us Git, and in doing so he took the entire product category I'd spent those years at IBM building — version control as a thing you bought — and collapsed it into a primitive you assume. One person dissolved a commercial category into infrastructure. That is the pre-AI proof of the thing I'm claiming. It has always been possible for a single person to have that kind of impact. What was scarce was being that person.

By 2031, the construction will be cheap. Not just writing code — building the things that used to take institutions. Conformance oracles. Multi-target compilers. Domain standards. Whole-program analyses that used to be research. If one retired hobbyist and an agent can build, in a few months, the kind of artifact that took Ecma a working group and years, then by 2031 there will not be one student capable of that kind of impact. There will be hundreds.

So the question I'd leave you with is not "isn't that exciting" — though it is. It's that the scarce input is about to change. When construction was scarce, we organized everything around who could build. When hundreds can build, the scarce thing won't be construction — it will be judgment about what is even worth pinning down, which behaviors are worth making an oracle of. That is the one input the agent still doesn't supply. In 1991 the world had one Linus and rallied to the kernel he chose to build. In 2031 it will have hundreds, each able to author a conformance oracle over a weekend — and the open question, the one I've spent my career on one side of and now find myself on the other, is no longer who can build the thing code answers to. It's whether we still know which things are worth answering to.

I don't know the answer. But I know which room I'd want to be asking it in.

Roundhouse is open source: dual-licensed MIT / Apache-2.0. Issues and discussion welcome.