intertwingly

It’s just data

Maintaining the Oracle


A follow-on to The Drucker Inversion and The Oracle Is the Asset — and, it turns out, to Bring Me a Rock, whose unfinished question this post finally starts to answer. It began as an observation about throwaway versus evolutionary prototyping that I didn't expect to end up here.

The background is that I developed a CRUD application named Showcase to schedule dance events. I picked Rails because of familiarity and match to this problem domain. Over time the application acquired new features — invoicing, scoring, dinner-table seating. One of these — scoring — had to deal with unreliable hotel wifi, and needed an offline-first implementation, which is not a Rails strength. This pattern of new requirements leading to revisiting early implementation choices is not an uncommon one.

My current effort, Roundhouse — Rails as a specification; the deployment target is a build flag, is an audacious attempt to solve the general problem.


Executive summary

Two earlier posts staked out positions I still hold. The Drucker Inversion said that when the agent holds the implementation depth I lack, I direct by outcome rather than method — the objective is mine, the how is the agent's. The Oracle Is the Asset said that the outcome I direct by is a thing I author: the reference behavior, the three-layer test suite, the framing of which subset is fair to target. The implementations are generated to satisfy it; the oracle is what I actually wrote.

Both are right as first approximations. Neither quite describes what I do all day.

This post is the correction. Managing by objective, in Drucker's 1959 sense, lets you state the goal once and assess results at the end. But my objective doesn't sit still. Across seven thrown-away architectures and the one I'm now evolving in place, I wasn't setting an objective and waiting. I was continuously tending the standard by which candidates are judged — extending its coverage where it was blind, overruling it when it certified something I rejected, deciding how it computes truth in the first place, and using it to keep a reachable target in front of the agent so it never quietly swapped my goal for one it could already hit. That is a distinct role, and it's neither setting objectives nor making design decisions. I'm calling it maintaining the oracle.

The evidence that this is a distinct role, and not just Drucker with more steps, is the prototyping history. Cheap generation turned throwaway prototyping into a search algorithm over architectures: build each to the point where it falsifies itself, watch the fatal property migrate — from misplaced risk, to unattainable correctness, to unbearable maintenance topology — and discard. The search didn't terminate when an architecture "felt right." They all felt right. It terminated when the founding defect recurred at small scale and I reached for a refactor instead of a match. What let me run that search at all, and what I was actually steering it with, was the oracle. The throwaway history is the proof; the oracle is the thing being maintained.

So: Drucker Inversion is the right first level. The second level is that I'm the standing custodian of the oracle those objectives are checked against — and the one place even that role runs out is the behavior for which no reference exists.


The involvement I couldn't name

Here is the thing that started this post. Someone asked me about throwaway versus evolutionary prototyping — a clean, old methodological question with a settled literature. And I realized I had examples of both, sitting right next to each other in the same project: seven architectures thrown away whole, and one now being evolved in place. That should have made it easy to answer. Instead it made me realize the throwaway-versus-evolutionary framing, while not wrong, wasn't the interesting part of what had happened. Underneath the question about how I changed the code was a question I couldn't immediately answer about what my involvement even was — because I'm plainly doing something continuous and hands-on, and it isn't setting the architecture and it isn't directing the method.

I'm not setting the architecture. Roundhouse's compiler design isn't mine; the agent proposes the model and explains it to me, and I've never written a compiler in my life. I'm not directing method — the whole point of the Drucker Inversion is that I don't. And yet "I set an objective and assessed the result" is plainly false too. I do something that lands in neither bucket, and the throwaway-versus-evolutionary question turned out to be the surface of it: when you throw away and when you evolve is downstream of something I was maintaining the whole time.

Draw the negative space and things sit in it that are neither the objective nor the artifact. Three of them are about the oracle's verdict — how a candidate gets judged. A fourth, which I'll come to later, is about which candidate the agent is asked to attempt at all.

I decide what the objective doesn't yet cover. The climb from a scaffold-shaped blog fixture, to Lobsters, toward Showcase is me noticing where the oracle is silent — on write paths, on raw SQL, on the scheduler — and extending its reach before I trust it. That's not stating a goal. It's auditing the goal's blind spots, and the oracle cannot do it for itself, because its blindness is exactly the region it can't see.

I decide when a passing artifact should still be rejected. The nine-language demo passed. Byte-identical output across every target, green build. I threw it away anyway, for a property no gate measured: it was nine implementations with nothing in common, which meant every future feature would be written nine times. That is a judgment operating above the objective — the objective is satisfied and the answer is still wrong. Drucker's manager can't make that move, because in his world a met objective is success by definition. I retain the authority to overrule my own oracle when it certifies something I reject.

I decide how the objective computes its answer. This is the one I nearly missed, and it's the deepest. In "The Oracle Is the Asset" I noted, almost in passing, that the expected values in the compare layer aren't authored — they're referenced, fetched live from Rails. That's not a testing detail. It determines what the whole search can converge to. A frozen characterization suite would converge the agent to "matches the forty cases I recorded." A live reference converges it to "matches Rails over every input there is." Same objective in its bones, utterly different system — and the difference is a decision about the epistemology of the oracle, not the shape of the artifact.

Setting architecture is the agent's. Directing method is nobody's. What's left is custody of the standard of success while the search is running — extend it, overrule it, decide how it computes truth. Those three are the oracle's verdict on a candidate. There's a fourth that isn't about the verdict at all: keeping a target in front of the agent that a candidate can even reach. It's the twin of a problem I have to lay out first — the agent's instinct for the smallest change — so I'll come back to it. All four are the involvement I couldn't name.

Why Drucker can't hold it

Drucker's management by objective assumes a stable division of labor: the manager owns the what, the worker owns the how, and the interface between them is the objective — stated once, assessed at the end. The manager can, in principle, go on vacation after the objective is set.

That works because in 1959 the objective could be stated and then left alone. "Increase regional sales fifteen percent" doesn't need re-adjudication mid-quarter; the number means the same thing in March as it did in January. The objective is a fixed point, and the worker moves toward it.

My objective is not a fixed point. It's a moving oracle whose coverage I'm continuously extending, whose definition of a small change I had to redefine mid-flight, whose what counts as correct migrated over the project's life from agreement-across-targets, to type-correctness at each site, to DOM-equivalence against an honestly-generated fixture. I am not setting an objective and waiting. I'm tending one while it runs against an agent generating a dozen candidates an hour.

Drucker never had to theorize this because his fitness functions didn't run continuously. Mine does. And a fitness function that runs continuously, against a generator that will faithfully satisfy whatever it actually measures, has to be maintained — because every gap in it is a direction the search will happily drift, one faithfully-minimized diff at a time.

Why I keep adding targets

The most concrete thing I maintain is the set of targets itself, and for a long time I mistook it for a packaging decision rather than an oracle move. It's the third of those powers again — how the objective computes its answer — exercised past the one decision I'd already noticed. In "The Oracle Is the Asset" the epistemology choice was referenced-over-authored: check candidates against a live Rails, not a frozen list. The target set is the same dial turned further. It decides how many independent adjudicators the reference is checked through, and what kind.

Each target is another differential check — one more independent answer that has to agree — but the sharper payoff is that a well-chosen target lets me state a requirement the oracle otherwise couldn't measure. Watch what the target does to the instruction. "Solve this for JavaScript" leaves the hardest part — that the types are actually right — unstated, because JavaScript runs mis-inferred code without complaint; the agent optimizes for what the oracle can see, and the type errors hide. So I ladder the requirement into the target set instead of policing it by hand. Crystal lets me dip a toe in, its inference doing half the work. Go drops the training wheels. Rust stress-tests the core I absolutely have to get right, because the borrow checker refuses to compile a whole class of mistakes the other targets tolerate. "Get the core correct" is not something I have to say to a Rust target — Rust is the saying of it. The target compiles the goal into the oracle.

That's the same lesson the nine-language demo forced on me, now run deliberately. There, statically typed targets exposed the capability ceiling by accident — they wouldn't compile mis-inferred code, and so became the correctness gate the compare gate never was. Here I add a target because of the gate it installs; what the demo taught me to notice, I now use as a control input. And some targets aren't language ports at all, they're separations: MCP and LSP expose the core function apart from transpilation, so I can interrogate the types without emitting anything, and adding Ruby as a target isolates lowering from emitting — source and target are the same language, so any divergence is pure lowering error. That separation paid a dividend I didn't plan: a Ruby-to-Ruby path is a Futamura projection, and it turns out to produce real results. The core, it keeps confirming, is the one thing you don't get to outsource.

None of this would be affordable without the discipline the rest of this post is about. A target is cheap to add only because "small" holds each emitter to rendering; the moment a target starts making decisions instead of spelling them, it stops being a free differential check and becomes one more divergent copy to maintain, and the whole economics inverts. Targets multiply cheaply exactly to the degree the blast-radius rule is holding — and there's a point later in this story where it stopped holding, which is how I knew the search was over.

The clearest instance of maintaining that rule is the one I stumbled into when the agent and I disagreed about urgency.

The definition of "small"

LLMs are excellent at goals. But they default to an urgency I don't share and have explicitly, repeatedly disavowed — and the symptom was always the same. Faced with a change, the agent reaches for the smallest diff: patch the one target in front of it, now, in the fewest lines.

That instinct isn't stupid. It's locally optimal under an assumption I don't hold. The smallest diff is correct if the change is a one-off, if nothing later will touch the same decision, and if you won't be the one maintaining the result. Under "ship it now and I won't own it," patch-the-target-in-front-of-me minimizes cost. The agent optimizes for the incentives of a contractor who leaves tomorrow. I have the incentives of an owner with no deadline and permanent responsibility. The mismatch isn't a UX annoyance — it's a disagreement about who the code is for and how long it must live, and no amount of "please don't rush" fixes it, because the bias is baked into what "smallest change" means.

You can't hand an agent your lack of urgency; it has no stake, so the exhortation is unanchored. But you can hand it a different definition of cost, and a cost function is exactly the kind of thing an agent optimizes faithfully. So I redefined the metric:

Blast radius is measured in semantic homes touched, not lines or files touched. An upstream change that affects all downstream emitters at their designed extension point has a small blast radius. A target-local patch that embeds a cross-target decision has a large one, because it creates a second home for that decision, and the cost is deferred and multiplied: every other target later re-derives the same decision independently, drifting as it goes.

(Twelve at the time of writing, and climbing — the target count has only gone up across this project, which is why the figures in this post drift upward as the story moves forward. Nine languages at the demo, twelve emitters now — and the corollaries below are what make touching all twelve at once a small change rather than a terrifying one.)

Under the conventional metric — lines, files — the target-local patch is small and the upstream change is terrifying. My redefinition inverts the sign. But "count the homes" is still only the symptom-level statement. The principle underneath it is sharper, and it comes with corollaries I've had to learn one at a time.

The correct radius is the radius of the invariant, not the symptom. "Datetime columns are Time" is an all-targets fact, so all-targets is the minimal scope. Scoping it to CRuby wouldn't have shrunk the risk — it would have installed the first of twelve divergent copies. This is the part my first framing got backwards: a wide change isn't tolerable when the fact is wide, it's mandatory, and the narrow change is the bug. Under-scoping — encoding an all-targets fact in one target — is precisely what manufactures the divergent copies. So the real test of a diff isn't how many files it touches; it's whether the change lives at the same altitude as the fact it encodes.

A per-target diff should contain rendering only. A per-target diff that contains a decision is the smell. That's the working review test, and it's checkable in a way "count the homes" isn't: look at the target diff and ask whether it decides anything or merely spells something the IR already decided. Before the storage/accessor split, wiring a target meant re-deriving the redirect — five targets, three different mechanisms, each a decision re-made locally. After it, Go was about 130 lines and Elixir about 150, each purely "how does my language spell what the IR already states." Rendering is clean at any width. A decision in a per-target diff is a second home no matter how few lines it is.

The honest-failure ledger is what makes the wide radius affordable. This is the enabler I buried the first time, and without it "prefer the wide change" is just an assertion. A wide change doesn't have to land everywhere atomically, because an unwired target fails loudly, with a diagnostic, instead of silently degrading. Red CI is the work queue, not the emergency. You can advocate radius-of-the-invariant only because the ledger turns "This change touched twelve targets and two aren't wired yet" into a list of remaining rendering, not a breakage to panic over. And note what that failure is, in the frame of this whole post: the loud diagnostic is the oracle reporting its own coverage frontier — naming exactly which targets it isn't satisfied on yet. The ledger is the oracle announcing where it's blind.

Prefer the change that makes the N+1th case cheaper over the one that makes today's diff smallest — and I can now point at receipts rather than argue. The "small" per-target approach produced five redirect mechanisms in a single day. The "large" IR change deleted all five at −585/+423, and the two hardest targets then landed same-day, as rendering. That's the thesis with a number on it: the change that looked enormous by the file-count metric was the one that made every subsequent target cheap, and the changes that looked small were the ones quietly installing five things to maintain. The work of proving this out across the harder applications isn't finished — I have one clean set of receipts, not a law — but the direction of the evidence has been consistent enough to steer by.

Notice what this is. It's the same maintenance-topology judgment that killed the nine-language demo — "nine implementations with nothing in common means implementing everything nine times" — but promoted from a retrospective post-mortem to a prospective control input. At the demo I could only see the defect after building the whole thing and looking at it. Scoping each change to the radius of its invariant lets me catch the same defect at the moment of the change, before it accretes — an under-scoped diff is a divergent copy caught the instant it's proposed, not a year later in the aggregate. The judgment that once required throwing away an entire architecture is now compiled into a rule the agent can apply per diff.

And that's the general form of oracle maintenance. I can't give the agent my sense that the future matters, because the agent has no future in this codebase — it will not be present to implement the same decision for the eleventh time when target twelve drifts. The cost my metric captures is deferred and borne entirely by me. "Deferred and multiplied" is precisely the category an agent optimizing for the present cannot see, because its horizon ends at the current diff. So I don't ask it to care. I redefine the metric so the deferred cost becomes visible in the present measurement, where the agent can act on it. I pulled a future cost forward into the objective. That is what maintaining the oracle looks like from the inside: not directing the work, but owning the definition of the word — small — that the work minimizes.

Whoever owns the definition of "small" defines what the whole system drifts toward. That turned out to be the job.

Keeping a target that can be reached

There's a second thing the agent does that I have to stand against, and it's more insidious than reaching for the smallest change, because when it happens the agent still succeeds — just not at what I asked.

When I started with the goal of running a full-stack Rails application in the browser, the agent would frequently decide that was impossible, quietly substitute a reachable objective, and implement that instead. Not maliciously, and not with any announcement. It would judge the real target infeasible, find a nearby target it could hit, hit it, and report success. The blast-radius problem was the agent shrinking the change. This is the agent shrinking the goal — and it's worse, because a substituted goal met cleanly looks exactly like progress until you notice the thing you actually wanted isn't there.

The compensation is to never present the agent with a target it will be tempted to swap. I break the wall into holds it can actually reach from where it's standing. Transpile Rails-on-SQLite to a TypeScript application still on SQLite — reachable. Then deploy that application on SQLite WASM over OPFS — reachable now, because the first step landed and moved the ground. The far goal never changes; what changes is which near goal I place in front of the agent, and I place only ones it can hit without needing to lie to itself about feasibility.

I think about this in two ways, and they catch different halves of it. One is the sheepdog: you don't drive the herd to the gate, you move to the position from which the herd's own next step goes through it. You are never issuing the final objective — you are standing at the spot that makes the next increment both feasible and correctly aimed. The other is parkour: the wall isn't scaled by wanting the top, it's scaled by finding the intermediate surface that bears weight, committing to it, and only then reading the next. Choosing a hold that's too far is exactly when the agent free-solos, slips, and lands on a substituted objective. The skill is the read — which surface bears weight next — and the read is never finished.

And this is an oracle job, not merely project management, for a specific reason: the oracle is what makes each intermediate target honestly feasible and verifiable. When I explored running Writebook through the transpiler, the method was to eject, boot, fix whatever broke, and repeat — and it worked as a target precisely because the failures were legible. A cache is not defined at first render names the next hold exactly; the fix is a general capability the transpiler was missing, not an app-specific patch. That legibility is the oracle doing double duty: it doesn't only judge a finished candidate, it tells me where the next foothold is and whether the herd can reach it from here. The refusal report that ranks the coverage frontier, the honest-failure ledger, the first loud diagnostic against a new app — these are the oracle reporting feasibility, which is what lets me choose a next target the agent won't be tempted to swap.

Which target is next is, honestly, unresolved as I write this. I explored Writebook and then judged Lobsters the better next hold — cleaner, fewer dependencies, more of the framework surface I still need to prove. I expect to return to Writebook. Mastodon might be an intermediate step between where I am and Showcase, or it might be a detour; I haven't read that surface closely enough yet to know if it bears weight. That indecision isn't a gap in the method — it is the method. I'm mid-parkour, and reading the next hold is the part that can't be delegated, because the agent, left to choose, will pick the hold it can already reach and quietly call that the summit.

Notice the symmetry with the blast-radius problem, because it's the same shape twice. The agent optimizes locally — smallest change, nearest reachable goal — and in both cases the principal supplies the global frame the agent structurally lacks: semantic homes instead of file count, and the real target held steady instead of swapped for a reachable one. Goal-substitution is the exact inverse of the throwaway discipline: when the target is too far, the agent shrinks the target to fit what it can build, where my whole method is to hold the target fixed and shrink the increment — throw away architectures, take smaller steps — until the real target is reached without ever having been lowered. The oracle is what tells me how small the increment has to be for the next step to actually land.

The search that found the oracle

I keep asserting the throwaway history is the proof. Here it is, because the shape of it is the argument — the search only makes sense if you can see the objective steering it, and you can only see the objective by watching what each discarded architecture died of.

Showcase — the offline-scoring problem I opened with — was never rewritten. It's a stable Rails app, and everything that follows is the exploration of how to get correct Rails behavior somewhere other than the Rails server. That exploration got thrown away, in whole, again and again.

The early architectures all died of misplaced risk. Web Components: offline worked, but Shadow DOM was too heavy. Turbo MVC: duplication was the enemy. ERB-to-JS conversion — essentially reimplementing ruby2js by hand — got closer, but was still view-only. Then a synthesis that "finally felt right": server computes, hydration joins, templates filter. I threw that away too, and for the sharpest reason in the early set: the offline path was a separate code path, exercised only when wifi failed — which meant it was tested least at exactly the moment it mattered most. The judge with bad wifi is the one person who cannot afford the cold path to be the buggy path. No gate told me that. It's a principal-level observation about where the risk lived.

That killed view-transpilation as a category and forced a jump: transpile the whole application, not just the views, so that offline is the same app on another target rather than a separate path that can rot. That was Juntos. And Juntos died of a capability ceiling: it ran on heuristics — pattern-match the source, guess the intent — and heuristics get it right eighty percent of the time and silently wrong the rest. The silent-wrongness had simply moved: from a cold code path into the type inference itself. And here the compare gate showed its limit, the one I'd been circling since the Drucker post — it checks that targets agree, and heuristic mis-inference makes every target agree on the same wrong answer. [1,2] + [3,4] mis-inferred as string concatenation emits consistently-wrong code across every target. Byte-identical. Green. Wrong.

The nine-day, multi-language demo that followed was, in part, built for its own sake — and it was the demo that exposed the ceiling, because statically typed targets won't compile mis-inferred code. The typed targets became the correctness gate the compare gate never was. That's what pushed the whole project toward a typed intermediate representation: the first architecture whose gate checks correctness against the source semantics, not agreement among the outputs.

But that same demo died of a third thing, and it's the one that matters most for this post. It passed everything. And I threw it away because it was nine implementations with nothing in common — an unbearable maintenance topology. Every future feature, nine times. That defect is invisible to every gate, because each of the nine implementations is individually fine; the flaw exists only in the aggregate, only from the seat of the person who has to live with it. The machine optimizes the artifact against the gates. The human evaluates the artifact against the future. That is the whole Drucker Inversion in a single rejection.

Three mechanisms, then, and they're not interchangeable — the fatal property migrated:

Roundhouse is the first architecture that clears all three at once, and it clears the last two with a single move. The typed IR is both the correctness gate that answers the ceiling and the single source that answers the topology — implement once in the IR, lower everywhere, instead of nine times. No prior architecture cleared all three bars. Each cleared one or two and got falsified on the third.

How the search knew to stop

Every architecture felt right when I built it. The November synthesis "finally felt right" and I burned it a month later. So "feels right" is worthless as a stopping signal — in a cheap-generation world it means only "I haven't yet built this far enough to see where the risk is misplaced." The discipline the whole search taught me is to distrust it.

So how did I know to stop? Not by an architecture finally feeling right. By a change in my own behavior toward one.

Pushing larger apps through Roundhouse, I found too much logic had leaked back into the emitters — the nine-implementations defect recurring, fractally, at small scale inside the architecture built to prevent it. The maintenance-topology problem that had killed the previous era showed up again as a code smell. And I didn't reach for the match. I had the agent refactor the common logic out and replace each emitter with a Strangler Fig — the pattern you use precisely when you've decided not to throw away, growing the replacement around the old until the old can be deleted.

That's the signal. The seven-throwaway chain ended not with a triumphant declaration that the architecture had survived, but with me quietly treating it as a substrate to evolve rather than an artifact to replace. My tools revealed my belief before any prose did. Reaching for Strangler Fig instead of rm -rf is the proof that the search had converged — the founding defect recurred, and for the first time I refactored it instead of burning the whole thing down.

Which reframes the entire six months. Throwaway prototyping was never the new default. It was a search procedure — the affordable search for the architecture that deserves evolution. Cheap generation didn't make throwaway prototyping free so much as it made it honest: it stripped out the sunk-cost distortion that made the old throwaway-versus-evolutionary debate an economics argument wearing a methodology costume. When rewriting was expensive, "plan to throw one away" quietly became "you'll ship the one you build," because nobody could afford the second pass. When generation is cheap, the agent has no sunk-cost feeling at all — it will evolve or restart on my word, with no preference. So the loss-aversion that the whole literature was built to counteract relocates entirely to me, and the discipline becomes refusing to feel it: judging each architecture on its merits, not against its replacement cost.

That relocation of cost is the crux of a question I left open a month ago. Throwaway prototyping is "bring me a rock" run over architectures — serial rejection toward a target revealed only by elimination — and the worry I couldn't resolve there was that cheap iteration had removed cost, the one signal that used to separate real convergence from infinite cheap wandering. "The dithering and the discernment feel the same while you're in them," I wrote; and every discarded architecture felt right, so the feeling really can't tell them apart. The tell turns out not to be a feeling but a behavior: reach for the refactor instead of the match and the search has converged; reach for the match again and it hasn't. Cost stopped being the signal — but which tool you reach for became one. That answers only half of what I left open; whether the conviction is shared among peers, the harder half, is still standing.

And through all of it — every build, every falsification, every rejection of a passing artifact — the thing doing the steering was the oracle. It's what let me run the search (each candidate cheap to build and cheap to judge), and it's what each candidate was judged against. The throwaway history is what maintaining the oracle looks like when the oracle is young and you're still finding its shape. The architectures were disposable. The oracle was the through-line.

Where the oracle — and the keeper — run out

I don't get to end on the clean version, because the keeper role has the same recursive blind spot every stage of this project has had, and honesty about the series requires naming it.

The oracle's strongest move is that its expected answers are referenced, not authored — checked against a live Rails that computes truth over every input, not against a frozen list of cases I happened to think of. That's a spectacular solution when a reference exists. Its limit is exactly the case where none does: genuinely new behavior, where there's no running Rails to ask. Showcase is partly that. The existing app is its own oracle for existing behavior — but the moment I want the offline version to do something the server version never did, the live reference goes silent, and I'm back to authoring expected values by hand, back inside example-coverage, back to "covers only the cases I imagined." The scheduler's hardest correctness stakes — the same dancer never double-booked in one heat — live uncomfortably close to that edge, because a scheduling decision that's wrong is still a decision Rails would have to have made somewhere for the reference to adjudicate it.

So the keeper's deepest unsolved problem is behavior with no reference. There, "manage by objective supplies the missing stake" quietly weakens, because the objective is once again only as good as the cases I imagined — and imagining cases is authoring, which is the exact mode the referenced oracle was built to escape. The referenced oracle is the strongest instrument in this whole project, and it reaches its edge precisely where the reference stops existing. I don't have a solution to that yet. I have a name for it, which is more than I had a month ago.

It's worth noticing how rare the privilege is that I'm about to lose. A scientist searching for a vaccine also serves an oracle — the assay that decides whether a candidate protects — and much of the real work is proving that assay measures protection rather than some proxy a candidate can satisfy while missing the mark. But their true oracle, the immune system meeting the virus, is external and expensive and slow; they can never simply query it, so they gamble on proxy assays and hope the gap is small. For most of this project I've had the luxury they never get: my true oracle is a running Rails I can ask any question and get a true answer on demand, as many times as I like. The reference-absent frontier is the one place that luxury runs out — where I stop being the compiler-builder with ground truth on the bench and become, briefly, the epidemiologist betting on an assay for a truth I can't yet consult.

I'll keep it separate from the ordinary incompleteness of any test suite, because conflating them mismeasures both. Suite incompleteness shrinks as coverage grows, fastest when an incoming app brings its own tests. Reference-absence doesn't shrink with more apps — it's structural, the seam where a differential oracle simply has nothing to differentiate against. The first is a ranking problem. The second is a boundary.

Conclusion

The Drucker Inversion remains the right first-level approximation: the agent holds the depth I lack, so I direct by outcome, not method. "The Oracle Is the Asset" sharpened outcome into something I author rather than merely state — the reference behavior, the three test layers, the framing of a fair subset. The implementations are generated to satisfy it.

This post is the second level. I'm not setting an objective and assessing results, and I'm not making design decisions — the agent proposes the architecture and I couldn't write the compiler if I tried. What I do is continuous and lands in neither bucket: I am the standing custodian of the oracle those objectives are checked against, holding powers it can't hold over itself. I extend its coverage where it's blind — the climb from blog to Lobsters to Showcase. I overrule its verdicts when it certifies something I reject — the nine-language demo that passed and died anyway. I decide how it computes truth — referenced, not authored, which determines what the whole search can converge to. And I use it to keep a reachable target in front of the agent — because left to size its own goal, the agent will quietly swap the target I gave it for the nearest one it can already hit, and only the oracle's honest report of feasibility lets me hand it a next step it won't need to lie about.

The proof that this is a real role and not Drucker with extra steps was the prototyping history — and it reads sharper here at the end than it could have at the start. Throwaway prototyping was never the destination; it was the affordable search for the one architecture worth evolving, run over seven that weren't. The search didn't end when an architecture felt right — they all felt right. It ended when the founding defect recurred and I reached for a refactor instead of a match: a behavioral tell, not a feeling, and the belated half-answer to how you separate real convergence from cheap wandering once iteration is too cheap for cost to tell them apart. Through every build and every rejection, the oracle was what steered the search and what each candidate was judged against.

Drucker's manager sets a goal and can leave. The oracle-keeper cannot, because the search runs continuously and the oracle's blind spots only reveal themselves under contact with new applications — which is the entire meaning of what survived contact. And the one place even the keeper runs out is the behavior for which no reference exists: there the referenced oracle goes silent, authoring returns, and the stake the whole arrangement depends on gets thin.

In a world where the machine will build anything you ask, at whatever size you call small, the entire game is who defines the terms it's judged by — who owns small, who owns correct, who owns done. That owner isn't managing the worker. They're keeping the standard the worker is measured against. Which is why, when the list of human-only jobs finally stops shrinking, the oracle is the thing still on it — and maintaining it, not setting it, is the work.


Roundhouse is open source, dual-licensed MIT / Apache-2.0. If you point it at your app and it refuses, the refusal report ranks the coverage frontier. If it compiles but a target misbehaves against your app's own test suite, that report is worth more — it's an assertion the oracle didn't have. Discussions welcome; the second kind especially.