The Oracle Is the Asset

2026-06-12T12:55:13Z

Every post in this series so far has reported something you can rerun: a benchmark, a lowering you can read, an archive you can download. This one extrapolates from those measurements to a claim about where frameworks in general are headed. The observations are sourced; the arc connecting them is a bet. I'll mark the seam between the two as I go, and I'll name where the bet is weakest before I argue it's strong, because that's the order in which I'd want to read it.

The bet, in one sentence: frameworks that are runtime libraries today will increasingly become transpilers — and when they do, the durable asset turns out to be the reference behavior they're tested against, not the implementations that produce it.

Why now

A web framework is, operationally, an interpreter for applications written in its conventions. Routes, associations, validations, and templates are data the framework consults on every request. Specialize the interpreter with respect to the program it interprets and you get the compiled program — the partial-evaluation literature has called that the first Futamura projection since 1971. None of this is news, and that's the point: the idea has sat in plain sight for fifty years.

Frameworks stayed interpreters anyway, and I'd argue it was never on the merits. A runtime library is cheap to build, trivially correct — it is its own spec — and keeps the open world open. A compiler for a framework's full surface was a decade of engineering with no incremental payoff. The exceptions prove the rule: where a team could actually afford compilation, they chose it. Tom Dale was writing "Compilers are the New Frameworks" in 2017; Svelte built its identity on the move; React — the canonical runtime library — eventually shipped React Compiler. Every one of those is a single-target compiler, built by a well-funded team or a singular obsessive, for a framework designed or redesigned around compilability. The direction was already chosen wherever it was affordable. It just wasn't affordable to many.

What changed is the floor, not the idea. I've written before about what LLM-assisted development did to my own exploration budget: architectural bets that were multi-day commitments became session-sized probes. Roundhouse is what that budget bought — a retrofit compiler, aimed at a framework that actively resists compilation, targeting nine languages, built by one retired person in weeks. It is one data point, and one data point can't carry a claim about frameworks in general. What it can do is establish what the construction cost has fallen to, because five years ago this particular data point could not have existed. The rest of this post asks what follows if that cost drop is real and not a fluke of one project.

The shape it takes

If the bet is right, the form is more specific than "compilers."

Transpilers, not LLVM backends. What got cheap is writing the compiler. What did not get cheap is owning a garbage collector, a scheduler, a standard library, and a debugger. A source-to-source compiler borrows all of it: the previous post in this series measured what happens when you emit plain Ruby and hand it to the JVM — the JIT pays a further 5–6× on top of what compilation bought, thirty years of someone else's engineering enlisted as your backend. Emit Kotlin, Rust, or TypeScript and the same applies, per target. "Modular compiler" turns out to mean the target toolchains are the modules.

Downstream of incumbents, not replacing them. This is the claim in this section I'm least sure of, so let me give it the room it needs rather than the sentence I first gave it. The observation is that LLMs are fluent in the frameworks people already write and poor in anything new — so at exactly the moment syntax becomes a compilation source format, incumbent syntax is the syntax with the deepest model support. That argues for compilers arriving downstream of existing frameworks, often third-party, with the framework's conventions slowly ossifying into a spec.

But I should be honest that I was building downstream-of-incumbent compilers — ruby2js, Juntos — before LLM fluency was anyone's forcing function, for the ordinary reason that the incumbent is where the users are. So the LLM-fluency argument may be over-explaining a thing that has a duller and older cause. I think both are true — the dull cause (users are on the incumbent) sets the direction, and the LLM cause (models are fluent in the incumbent) lowers the cost of going there and deepens the lock-in once you do. But if you make me rank them, the dull one is load-bearing and the LLM one is an accelerant, and I'd rather say that than dress the accelerant up as the engine.

A target set with a floor. The floor is JavaScript/TypeScript, and that's a description, not a prediction: every Rails application already ships JavaScript, and "single-language framework" has always quietly meant "plus the JS we don't talk about." I know the set-of-one case intimately — Juntos was that bet, shipped: when you target exactly one language, you target JS. What compilation changes is that the source goes monolingual while the target set keeps JS, with the seam typed instead of stringly.

And a ceiling set by deployment surfaces, not languages. A deployment target is really a triple: language × runtime profile × data topology. Offline-first moves the data layer into the client. An edge isolate forbids in-process state under a hard memory cap. Mobile means lifecycle and code a platform reviewer can audit. The cheap VPS is no gate at all — just an economics gradient, where the current benchmark round puts the premium for compiling past the floor at one to two orders of magnitude in requests-per-GB. This triple is also, I think, why "just use wasm everywhere" keeps not cohering: wasm answers the language coordinate — portable compute — which was already the easy coordinate. Only a compiler that knows the application has models, validations, and views can re-derive the data topology per surface: the same association lowered against Postgres on the server, local storage in the browser, a sync-aware store on a phone. The multi-surface world doesn't merely permit framework-level compilation. It selects the framework as the compilation unit, because the framework is the lowest layer that still knows what the application means.

The interpreter's three jobs

The interpreter at the heart of today's frameworks does three jobs: it's the production runtime, the development feedback loop, and the semantic authority. Compilation takes those jobs apart and gives them to different owners.

Production goes to the emitted targets. That's the whole performance story this series has measured, and I won't re-litigate it here.

The feedback loop is the surprise. It never actually belonged to interpretation — what developers valued was loop latency, and interpretation was just how you got it in 1995. I measured the modern alternative in the Juntos work: full-stack Capybara-style system tests at 47ms each in a real browser, re-running on every save, against 425ms and flaky under Rails/Selenium. HMR patches a running app with state intact; the Rails dev loop reloads the page. The floor target's toolchain — Vite, Vitest, Playwright, thirty years of the entire industry's DX investment forced through the browser — now beats the interpreter at the thing the interpreter was supposedly for. The floor of the target set and the ceiling of the developer experience are the same place, and not by coincidence: the same gate that made JS obligatory concentrated the world's tooling investment on it.

So what's left for the interpreter? One job: authority. The reference implementation survives as the referee — the behavior every emitted target is tested against. But "tested against" is where I have to be careful, because the strength of this whole thesis lives in the structure of that testing, and it's easy — I did it myself in an earlier draft — to collapse it to the one layer that quotes well.

The oracle is three layers, not a DOM diff. Roundhouse pins each target down three ways, and they fail in different directions on purpose:

Emitted model and controller tests, against fixed expected values. Minitest-style tests for validations, associations, CRUD, redirects, 422-on-invalid-params are written into the source and compiled alongside the app, so each target runs its own translated suite natively. These don't reference Rails at all — they assert specific behavior, so a target can't pass by coincidentally matching whatever the reference happened to emit. This is the floor, and it's the layer I'd most want a skeptic to look at, because it's the one that doesn't have the differential gate's blind spot.
Compare tests, against live Rails. Fetch the same URL from Rails and from a target, canonicalize both DOM trees, diff. This is the differential layer — its job is catching structural drift the fixed-value tests weren't written to anticipate.
End-to-end tests, against fixed expected values, in a real browser. Playwright drives the dynamic behavior a static DOM diff structurally cannot reach: Turbo Stream comment inserts, Action Cable broadcasts across tabs, validation re-renders, computed Tailwind. Fixed expectations again, not a live diff.

The interpreter is demoted from runtime to spec. That sounds like a demotion; I'll argue it's the most durable position in the arrangement. But before that I have to say the thing that makes "durable asset" too weak a phrase for what the oracle actually is — because in this project the oracle isn't a layer I added to check the work. It's the layer I authored, and the work is what passed it.

I didn't write Roundhouse. I can't write a compiler; I've never written one. I don't read the Rust codegen against V8 internals, and I haven't needed to. What I wrote was the oracle — the fixture, the three test layers, the compare gate, the framing of which Rails subset is fair to target — and Claude Code wrote whatever satisfied it. That's the Drucker Inversion in one project: when the agent holds the implementation depth the principal lacks, the principal directs by outcome, and the outcome is the oracle. "Constrain by outcome, not method" isn't a testing slogan in that arrangement. It's the whole interface. The agent's freedom and my control meet at the oracle and nowhere else, which means the oracle is not downstream of the implementation, checking it. The implementation is downstream of the oracle, generated to satisfy it.

This reframes "the oracle is the durable asset" into something sharper. The oracle isn't durable merely because implementations are cheap to regenerate. It's durable because it's the thing that was written. You don't own the compiler; you own the spec the compiler answers to — and "own" here means "authored," not "kept around." The implementations are its output.

Which is also why the soundness worry I built up in an earlier draft was aimed at the wrong target. Now the honest accounting of where even a three-layer oracle stops — which turns out to be a smaller and more ordinary place once you see the oracle as a spec rather than a net.

Where the oracle stops

Here is the objection I'd lead with if I were reading this rather than writing it — stated at its real size, which is smaller than I first wrote it, because seeing the oracle as a spec rather than a net changes what the gap is.

No finite suite is total, and no spec is complete. The three layers cover each other's blind spots — fixed-value tests don't depend on the reference, the compare gate catches drift the fixed tests didn't enumerate, E2E reaches the dynamic behavior neither static layer can see — and that union is strong on the convention-dense core, because there the conventions are the behavior and all three layers exercise them. What the union doesn't cover is behavior nobody specified: the construct lowered plausibly that no layer asserts on. For a scaffold-shaped app that residue is small. For an app reaching into framework internals, monkeypatching, or leaning on incidental ordering, it's where two targets could diverge with all three layers green.

But notice what that gap actually is, once the oracle is the authored artifact rather than a check applied to an independently-written program. An unenumerated behavior isn't an unchecked region of someone's code — it's an underspecified region of the spec. That's the ordinary condition of all specification, and managing it is exactly what the principal's domain judgment is for. I can't write the Rust, but I know Rails: which features are load-bearing, what a fair subset is, what users actually rely on. That knowledge is what decides which behaviors the oracle must pin down, and a gap in it is a gap in my framing, not a silent failure hiding inside generated code I never read. The worry shrinks from "the implementation might be secretly wrong" to "the spec might be incomplete" — which is true, unremarkable, and the thing specification has always been.

And I have one piece of evidence that this mechanism does what I'm claiming, more convincing to me than any argument. The benchmark at the center of the previous post — emitted Ruby on JRuby, the 54× diagonal — I ran for the first time immediately before posting. Not because I was confident in a hand-written way; because I had nothing to be confident about. I hadn't watched that path work. What I'd done was specify the oracle, and the emit had passed every layer of it: the fixed-value tests in their own language, the compare gate against Rails, the Playwright suite in a browser. So when I finally ran it myself, it did exactly what I expected — not because I'd verified it by hand, but because the oracle had already verified it and my run was a formality. The first-person confirmation was redundant by construction. That is what it feels like when the oracle is the spec and the implementation is its output: you find out the thing works at the moment you author the test that says what "works" means, not at the moment you run it.

And the redundancy isn't only my word for it. The emitted projects ship as self-contained archives whose READMEs are executed verbatim by CI against the published artifact — the same bundle install, seed, boot, and benchmark commands a reader would run, run on every build before anything reaches the site. My manual run reproduced what the archive's own CI job had already established; if it hadn't, the archive wouldn't have published. So "I ran it for the first time and it worked" is not a claim about my luck or my judgment. It's a claim about an artifact that had already been run, by machine, against the spec that defines it — and that anyone can run again with three commands and no install of mine. The receipt is the archive.

The residue is real and gets its own falsifier below. But it's the ordinary incompleteness of a spec, not a crack in the foundation — and I'd have handed you the wrong version if I'd kept treating the compare gate as the whole oracle and the oracle as a net thrown over code I wrote. I didn't write the code. I wrote the net, and the code was woven to fit it.

I'll keep this separate from corpus representativeness, because conflating them mismeasures both. Corpus representativeness is a ranking problem — which gaps to close first — and more apps fix it. Spec incompleteness shrinks as the suite grows, fastest when an incoming app brings its own tests, because then the behavior it cares about arrives already specified and joins the fixed-value layer.

The inversions

With that on the table: here's why the referee position is the durable one anyway.

When compilers are cheap, the implementations stop being the scarce asset. What's scarce is the spec and the oracle — the reference behavior, the differential tests, the documented list of known gaps. And the oracle is portable by the same mechanism as the application: the test suite is itself an emit artifact, one suite compiled alongside every target. That's the asset that doesn't evaporate when the next model writes a better emitter, because the next model's emitter still has to pass it.

That's the first of several places where a traditional cost of being a compiler inverts into an advantage — provided the compilation unit is a convention-rich framework. The others, in turn, because collectively they're why I think this bet is live rather than romantic.

The save loop turns out to be framework-shaped. The classic objection to develop-interpreted, deploy-compiled is drift; the classic cost of develop-through-the-compiler is incremental compilation, the miserable subproblem that makes compilers expensive to live with. But a Rails app's conventions stratify it so that edit frequency and dependency depth are anticorrelated: the things you touch constantly — views, partials, controller actions — are leaves of the dependency graph, and the things with whole-program blast radius — schema, routes, associations — you touch twice a day. Convention over configuration turns out to be convention over dependence analysis: the framework hands the compiler its invalidation strategy as structure. Juntos exploited exactly this in its Vite integration — views hot-swap, model changes rebuild a small manifest, route changes reload — and the mechanism transfers to compilers with much stronger analysis. And the budget is real: I timed Roundhouse's full pipeline — ingest, whole-program type inference, lowering, emit — at roughly 18ms per thousand lines of source on my laptop, linear from a 1.2K-line app (22ms) to a 12K-line one (222ms). Recompiling everything on every save already fits inside the HMR budget. Incrementality becomes an opportunistic optimization over a batch fallback cheap enough to be the correctness net, rather than a second implementation that can drift.

Compiler bugs surface at the right time. Develop on the interpreter and compile at deploy, and the compiler runs cold until the worst possible moment. Develop through the compiler and it's exercised on every save by every developer, with lowering failures appearing in the error overlay with a file, line, and column. Dev/prod drift transmutes into target-to-target drift — which the three-layer oracle already polices in CI, each layer on the surface it's built for.

Partiality becomes shippable — and this is also the answer to the soft spot. The honest caveat in every Roundhouse post — your application almost certainly does not transpile today — looks like the thesis's fatal gap. A runtime library that lacks a method fails in production, at request time, in front of users; partial runtimes aren't shippable, which is why runtime frameworks had to implement everything before anyone could adopt them. A compiler that lacks a method refuses at build time, with a location and a count. That makes a partial framework-compiler shippable at every point on its coverage curve — and it makes the boundary self-mapping. The gaps sort into a handful of subsystem-scale design decisions (Active Storage, Action Text) that announce themselves at their declaration sites, and a long tail of small methods the compiler enumerates mechanically: deduplicate the diagnostics by construct and receiver type across a corpus of apps and the punch list ranks itself by evidence, each item at roughly constant cost because a fix lands once in the transpiled framework runtime and reaches every target. The 80/20 point gets discovered by compilation instead of estimated by survey.

This is also part of the answer to the residue above, and it's worth being precise about how much. Compilation makes the representable boundary self-mapping: a construct the compiler can't lower announces itself at build time. The fixed-value emitted tests then catch a construct that's lowered wrongly but plausibly — provided it's a construct the suite asserts on — because the assertion fails in the target's own language regardless of what the reference did. What's left after both is the genuinely unenumerated behavior: lowered plausibly, asserted by nobody. That's the ordinary residue, and it's smaller than the can't-represent-it and wrong-on-a-tested-construct cases that the build refusal and the fixed-value layer already absorb. Partiality converts the first failure mode from a production surprise into a build-time refusal; the fixed-value layer converts the second into a failing assertion; only the third depends on coverage, and it depends on it the way all software does.

There's a pattern here, and noticing it is what pushed me from "interesting project" to "worth speculating in public": several traditionally painful properties of being a compiler — needing a dev loop, surfacing its own bugs, being complete — invert when the compilation unit is a framework and the brute-force path stays fast. The honest accounting is that the inversions are strongest where the three-layer oracle is strongest — the convention-dense core — and thin out into the same unenumerated long tail every test suite thins out into. That's the boundary the next two sections are about.

Where the bet stops

Three boundaries, stated plainly. The first two are about scope; the third is the soundness problem, promoted to a boundary because it deserves to be one.

It stops at request-invariance. Compilation wins exactly where decisions cannot differ between requests — which, as the benchmark series has shown, is almost everything in a Rails-shaped application. Systems whose value is open-world runtime dynamism — plugin marketplaces, tenant-editable behavior — keep genuinely runtime decisions and stay interpreted, or become per-tenant build farms, which is a different product.

It stops at carrying cost. Cheap-to-place is not cheap-to-carry: a portfolio of targets survives only if framework semantics land once and per-target emitters stay thin. That architecture is the economic precondition, and its absence is what the losing attempts will look like — N parallel code generators, rotting in parallel. Haxe is the ghost to name here, and it deserves more than a dismissal: it proved multi-target transpilation technically viable fifteen years ago, with thin per-target emitters and shared semantics, and it stayed niche anyway. So "thin emitters and shared semantics" plainly isn't the differentiator — Haxe had those. What Haxe didn't have, and couldn't have, is a convention-rich single source framework underneath it. A language isn't convention-dense the way Rails is; it can't hand the compiler an invalidation strategy as structure, can't supply a built-in behavioral oracle, can't make its own coverage gaps self-announcing, because there's no canonical reference behavior a Haxe program is supposed to match. The framework supplies the spec, the dependence analysis, and the oracle for free; the language supplies none of them. If this bet beats Haxe's outcome, that's the reason — not the emitters, the substrate. And if I'm wrong that the substrate is enough, Haxe is what this looks like in fifteen years.

It stops where the suite stops. The correctness guarantee is exactly as wide as the union of the three test layers, and no test suite is total. On the convention-dense core that union is wide, because all three layers exercise the conventions and the conventions are the behavior. Out in the monkeypatched, internals-reaching, incidentally-ordered long tail, it narrows to whatever those particular layers happen to assert — the ordinary residue of testing, not a structural blindness peculiar to this approach. This is a boundary of the thesis, not a caveat on it: the durable-oracle claim holds over the behavior the oracle covers, widens as the oracle does, and makes no promise past there. A version that forgot the boundary would be selling totality it doesn't have — but a version that inflated the boundary into a soundness crack, which is the mistake I had to back out of writing this, would be selling fragility the three-layer design doesn't have either.

What would prove this wrong

Speculation should name its falsifiers. Mine:

Frameworks ship compilers and drop their reference interpreters without losing correctness — then I'm wrong that the oracle is the durable asset; the implementations were, after all.
Targets diverge in production on behavior all three test layers passed, often enough to matter and not cheaply fixable by adding tests — then the residual incompleteness is worse than ordinary and the durable-asset claim narrows to "durable on the tested surface," which is a smaller claim than the one I'm making.
A wasm story emerges that handles data topology and lifecycle at the framework layer, not just compute — then I'm wrong about the triple, and the floor-and-ceiling structure goes with it.
Multi-target compilation stays a population of one while single-target compilers thrive for years — then the floor-and-ceiling structure survives but the portfolio claim demotes to a curiosity, and Haxe's ghost wins.
Bespoke LLM-generated applications prove maintainable at fleet scale without anyone authoring a shared oracle — then the strongest version of the whole thesis fails, because the stable point between interpreter-at-runtime and LLM-writes-everything turns out not to be stable, and the spec-you-write stops being the thing worth owning.
Long-tail coverage costs compound rather than stay constant — features interacting rather than composing — then the frontier stays mapped but stops being cheap to push, and "the punch list ranks itself" stops being true.

What stays open

Governance, near-term, looks answered by precedent: JRuby tracked Rails for two decades with no formal relationship at all — the trailing implementation follows upstream behaviorally and absorbs the lag as the cost of existing. I'd expect framework compilers to operate the same way, with one change: the lag shrinks, because tracking cost is diff-shaped work the new economics compress, and the diagnostics scope each upstream release to the subset a corpus actually exercises. What the long term looks like — shared conformance suites, a gate upstream acknowledges as a spec, or nothing at all — I don't know, and the near-term model is stable enough that I don't need to.

The corpus is the honest gap, and now I can say precisely which problem it is and isn't. The frequency weights that would rank the coverage frontier come, today, from a handful of applications biased toward blogs and 37signals idiom. That's the ranking gap, and it's the benign one — more apps fix it. The coverage residue — behavior no layer was written to assert — is the ordinary incompleteness of testing, and the thing that shrinks it fastest is an incoming app that brings its own tests, because then the divergent behavior arrives already enumerated and joins the fixed-value layer for free. So the loop that matters most isn't "collect more apps," it's "collect more apps that bring their own behavioral tests." If you point Roundhouse at your app and it refuses, the refusal report ranks the frontier. If you point it at your app and it compiles but a target misbehaves against your own test suite, that report is worth more — it's a new assertion the oracle didn't have, which is the only thing that actually widens coverage rather than reweighting the punch list. Discussions is where both go; the second kind especially.

Twenty years ago the JRuby wager was that Ruby deserved the JVM. The wager here is bigger and correspondingly less certain: that the framework you already use is the source language of a compiler nobody could previously afford to build, that the affordability problem — the only reason it didn't happen decades ago — is over, and that what you end up owning at the end isn't the compiler but the reference behavior every compiler answers to.

There are two claims braided together in that sentence, and it's worth pulling them apart because they're the same claim seen from opposite ends. One is economic: when compilers are cheap, the oracle is the scarce, durable asset. The other is methodological: the oracle is the thing the principal actually authors, and the implementations are generated to satisfy it. The economic claim says implementations are cheap to replace; the methodological one says they were never the thing you wrote. Roundhouse is the existence proof of both at once — a compiler I could not have written, built to pass an oracle I could, by an agent that held the depth I don't. I provided the spec. The code was woven to fit it. And the first time I ran the headline result, it worked, because the oracle had already said it would.

That's one instance, and one instance can't prove the arc. Whether it generalizes is exactly what this post can't settle — and I've tried to be as clear about where it would break as about where it holds, because at this stage those are the same act of honesty. What I can say is that the thing I'd bet on isn't the compiler, or the targets, or the benchmark numbers. It's the net. Write the net well enough and the rest is downstream.

Roundhouse is open source: dual-licensed MIT / Apache-2.0. Issues and discussion welcome.