intertwingly

It’s just data

Numbers Without Conclusions


The previous post, Show Your Work, argued that the right answer to "that can't be real" reactions about benchmark numbers is to show the mechanism — the lowered code, browseable, reproducible — rather than to publish more numbers. The mechanism is still the point. But there's a follow-up question that the mechanism alone doesn't answer: what do those three lowerings actually buy, measured?

This post presents the measurements. It does not present conclusions. The numbers don't conclude anything by themselves, and the workloads I can measure are not your workloads. What follows is a position update — preliminary numbers across eight target surfaces, with an explicit accounting of what those numbers are evidence of, what they aren't evidence of, and which engineering decisions are still in flight.

What was measured

Same Rails source (the Roundhouse real-blog fixture) — a small Rails 8 CRUD application with Articles and Comments, AR associations, validations, before_action callbacks, Turbo Streams broadcasts, jbuilder JSON, ERB layouts and partials. Three HTML endpoints (/articles, /articles/1, /articles/new) and two JSON endpoints (/articles.json, /articles/1.json). Eight target surfaces:

Methodology: wrk -t2 -c64 -d20s --latency, 3 measurement runs per endpoint per target, 20-second warmup before each, single Puma worker × 5 threads for Ruby/Rails targets, single process / multi-thread runtime for compiled targets. SQLite in-process. M4 MacBook Air, clean conditions (load avg < 1.5, 90%+ idle, sustained CPU clock state).

The bench script and the per-run measurement data are at bench/results/ in the repo. Methodology details and the running publication-roadmap discussion live at issue #3.

The numbers

HTML and JSON tables are presented separately because they tell different stories (which is most of the point of this post).

HTML endpoints

/articles — list view with N records + associations rendered

target req/sec p50 (ms) p99 (ms) RSS (MB)
crystal 18,024 3.60 4.30 28
rust 12,291 5.14 10.95 21
ruby 12,536 5.07 5.60 105
typescript 10,759 5.52 11.20 388
ruby-int 8,566 7.43 8.57 90
go 4,131 15.24 22.90 46
rails 1,375 44.58 166.72 320
rails-int 1,305 47.23 162.94 315

/articles/1 — single-record show view

target req/sec p50 (ms) p99 (ms) RSS (MB)
crystal 22,665 2.86 3.60 26
rust 18,950 3.33 6.75 22
ruby 15,213 4.16 4.88 111
typescript 14,610 4.23 8.44 392
ruby-int 5,984 7.85 39.19 96
go 4,814 13.01 19.63 47
rails 1,297 47.39 205.32 335
rails-int 1,254 49.33 76.90 329

/articles/new — form view, no DB

target req/sec p50 (ms) p99 (ms) RSS (MB)
crystal 53,456 1.18 1.64 23
rust 56,121 0.84 78.19* 25
typescript 26,381 2.38 4.24 392
go 22,982 2.65 12.53 58
ruby 21,382 2.88 15.22 117
ruby-int 12,319 5.20 6.30 103
rails 1,803 34.45 58.48 349
rails-int 1,736 33.44 197.61 305

* rust /articles/new p99 = 78ms despite p50 = 0.84ms — single thermal stall during the measurement window. Median is the more meaningful number.

JSON endpoints

/articles.json — list serialized via jbuilder

target req/sec p50 (ms) p99 (ms) RSS (MB)
rust 90,583 0.70 1.37 20
crystal 87,503 0.69 0.94 26
typescript 39,693 1.55 3.16 370
go 34,121 1.92 2.39 47
ruby 17,640 3.63 4.00 96
ruby-int 12,898 4.96 5.28 83
rails 2,309 27.53 32.28 321
rails-int 2,308 27.49 31.56 293

/articles/1.json — single-record serialized

target req/sec p50 (ms) p99 (ms) RSS (MB)
crystal 118,321 0.52 0.77 27
rust 113,476 0.53 1.50 21
typescript 47,609 1.29 3.45 376
go 40,880 1.55 2.31 47
ruby 19,338 3.28 3.66 103
ruby-int 13,977 4.53 4.99 87
rails-int 3,169 19.15 50.79 300
rails 3,073 19.68 52.03 328

What these numbers are evidence of

Three observations the numbers support relatively directly.

1. Same Ruby, same YJIT, ~9× difference. The rails row uses CRuby + Puma + YJIT. The ruby row uses the same Ruby, the same YJIT, the same Puma — running Roundhouse-emitted code instead of stock Rails. Per /articles: 1,375 vs 12,536, a 9.1× delta. Per /articles.json: 2,309 vs 17,640, 7.6×. Same language. Same JIT. Same hardware. The difference is what the JIT sees as input. This is what Show Your Work predicted in qualitative form; the magnitude is now visible.

2. JIT specialization headroom appears only on the lowered shape. Compare ruby (YJIT on) to ruby-int (YJIT off): 12,536 vs 8,566 on /articles — YJIT contributes ~46%. Compare rails to rails-int: 1,375 vs 1,305 — YJIT contributes ~5%, within noise. Same YJIT, same Ruby. The JIT specializes when the input shape lets it; stock Rails' polymorphic dispatch surfaces prevent that work. Roundhouse-emitted Ruby gives YJIT shapes it can do something with.

3. Compiled targets benefit disproportionately from removing the view-render pipeline. On /articles (HTML, with full layout/slot/escape/asset chain), rust hits 12k req/s — about the same as Roundhouse-emit Ruby. On /articles.json (no layout, just controller → AR → jbuilder), rust hits 90k — a 7.4× same-target jump. Go shows the same pattern: 4k HTML → 34k JSON, an 8.3× jump. Interpreters' same-target HTML→JSON ratios are 1.4-1.8×. The view-render pipeline was eating most of the throughput headroom on the compiled targets; remove it, and the per-language ceiling shows up. The interpreter row caps were already close to the interpreter's own ceiling on HTML, so there's less to free up.

4. Memory footprint differs by ~16× between Rails and Rust on the same workload. RSS at workers=1, threads=5: Rails ~320MB, Roundhouse-emit Ruby ~100MB, Rust ~20MB, Crystal ~26MB. These are steady-state resident sizes — the cost a Puma worker pins for its lifetime regardless of whether it's actively serving or idle.

What these numbers are NOT evidence of

This is the longer list, and it matters more.

1. CPU-bound benchmarks; production wall-clock is mostly I/O. Every request in this bench touches local SQLite (in-process, no network roundtrip), renders a tiny dataset, and returns. Real Rails production typically spends 30-60% of per-request wall-clock waiting on remote database queries and another 20-40% on external API calls (payments, email, S3, analytics). For an I/O-heavy endpoint, the framework-CPU fraction these benchmarks measure may be 10-20% of total request time. The other 80-90% is wait time that all targets share equally; throughput differences shrink correspondingly.

The CPU-bound benchmark is representative of:

The CPU-bound benchmark is not representative of:

Your endpoints likely have some of each. The bench shows you the framework-cost component, which is one variable in a multi-variable production budget.

2. Single machine, not a production deployment shape. These numbers are from one process per target on one M4 laptop. They do not measure:

3. The compiled-target numbers reflect implementation choices, not language ceilings. Several known performance gates are filed and unfixed:

Each of these is a known gap with a filed fix. The numbers above are a point-in-time snapshot with those gaps unaddressed; subsequent rounds will land them. If you read this post in three months, the rust and go rows should be substantially higher than what's printed here.

4. The fixture is one Rails app, and the implementation is currently fragile to anything other than that fixture. The real-blog fixture is a small Articles + Comments CRUD app — chosen for compactness and because it exercises the moves Show Your Work described. There are two distinct ways your application might not fit today.

Subset gaps (covered in the Show Your Work checklist) — application-shape patterns that the lowerer doesn't yet handle:

Emit-side fixture-fragility (catalogued in issue #16) — places where the current emit hardcodes real-blog-specific names rather than deriving them from the app's source. Examples: a forced-parens list in the go emitter currently enumerates the literal strings "comments" and "article" — the relation accessor names from real-blog's has_many :comments and belongs_to :article; other association names (posts, categories, user) would not match the list and would emit as bare field reads, failing go vet. The layout selector hardcodes application regardless of layout :foo directives. Asset list expansion bakes the specific stylesheets present at codegen time into the emitted layout. Fourteen items of this kind are filed in #16 today, and the list is explicitly partial.

Both of these constraints mean a user pointing bin/rh transpile at their own Rails source will, in most cases today, hit a lowering or emit failure before getting to bench numbers. Roundhouse covers a subset of Rails and the current emit is calibrated to a specific fixture within that subset. The subset is widening at the lowerer level; the fixture-fragility is widening at the emit level — both as forcing functions surface them.

The honest position: your application very likely does not transpile cleanly today, even if its patterns are theoretically within the supported subset. If you're curious whether yours could in principle, the Show Your Work checklist helps locate the subset-gap blockers; the emit-side fixture-fragility is harder to predict without trying. The "try with your own app" workflow is a forward goal, not a current state.

5. The hardware is a single laptop chip; numbers vary across deployment hardware by 2-3× and across runs on the same hardware by 5-30%. M4 boost clocks throttle aggressively under sustained load. The compiled-target rows in particular drift by 20-50% across cool-start vs warm-baseline thermal states. The numbers above were collected with a 5-minute wait for load average to settle below 1.5 — clean by the laptop's standards, noisy by data-center standards. A locked-clock Linux x86 server (cpupower frequency-set -g performance, governor pinned, SMT considered) would tighten variance to perhaps ±2%. Absolute numbers on Ryzen / Xeon / Graviton would be different from M4 — sometimes higher, sometimes lower, depending on per-core IPC and clock budget.

Map your situation onto these numbers

If you want to read what the numbers imply for your situation specifically, here are some questions to bring to the table.

What deployment shape are you in?

Different deployment shapes care about different metrics:

deployment what bills you which row matters
Bare-metal Kamal box hardware + power + cooling req/sec at CPU saturation; req/watt
Tiered PaaS (Heroku, Render) dyno-size step function does the app fit in the next smaller tier?
Metered cloud (Fly.io, AWS GB-hours, Vercel GB-seconds) continuous GB-hour billing req/sec/GB (direct line item)
Edge platforms (Cloudflare Workers, Vercel Edge, Deno Deploy) per-execution memory cap (128MB–1GB) does the binary fit in the cap at all?

For a Kamal deployment on owned hardware with ample CPU and RAM, the binding constraint is CPU at saturation, not memory. A 16-core box running 50 Puma workers can serve N concurrent requests until the DB connection pool caps or PG's max_connections fills — RAM is sunk cost, GVL forces fork-per-core. A smaller framework helps insofar as it reduces the per-worker baseline that pins memory between requests, but the headline metric is "how many requests per box," not "req/sec/GB."

For a Heroku dyno, the metric is "does the app fit on a Standard-1X (512MB)?" — a discrete step function. A roundhouse-emit Rust binary at 20MB fits with 99% headroom; a Rails app at 320MB needs the tier above.

For Fly.io, every GB-hour is a line item. req/sec/GB maps directly to monthly cost.

For Cloudflare Workers, memory is capped at 128MB free / 1GB paid. A Rails app does not run there at all; a Roundhouse TypeScript emit does. This is a binary gate, not a gradient.

The metric your bill cares about determines which column you should be reading in the tables above.

What does your request profile look like?

These questions don't have right answers — they're meant to help you locate your application on the bench's measurement axis.

The benchmark numbers map onto your request profile only insofar as your request profile is CPU-bound. The framework cost shown here is the floor; production adds I/O on top of it.

Open questions

Things I genuinely don't know the answer to and would welcome input on.

1. What's the right workload mix for a representative bench? Today's bench is local SQLite + tiny fixture + no external services. A more representative workload might include: remote managed Postgres (1ms-5ms RTT), one third-party API call per request (50-200ms), a cache layer (Redis), realistic record sizes (10s of fields, 100s of bytes per row, kilobytes of HTML). I don't have a clear methodology for adding these realistically without introducing measurement noise that swamps the framework signal. Suggestions welcome at issue #3.

2. How should the bench account for production observability overhead? APM agents, structured logging, metrics scraping each cost CPU per request. None of the current numbers include any of these. A bench-with-APM would show smaller deltas between targets; the question is whether that's a more honest measurement or a confounder.

3. Multi-target deployment from one source: real, or curiosity? The architectural claim is that you can ship the same Rails source as a Rust binary to one deployment and a TypeScript bundle to another. Whether this is actually useful for production teams — whether anyone wants two deployment paths from one codebase — depends on workloads I don't have. If you have a real "this would help us" scenario or a real "this would never help us" scenario, both are useful data points.

4. Where does the prepare-statement cache (filed as #12) actually live architecturally? The cache could be hand-written per-target, transpiled from the framework runtime, or built into the per-target sqlite primitives. Each has tradeoffs. The tep collaboration (@OriPekelman) raised related questions in issue #3; they're open.

5. What perf gates land first? #7, #10, #11, #12 each move the needle on at least one row. Ordering depends on what surfaces forcing functions; suggestions in the linked issues are welcome.

What comes next

The forward direction:

This list is the publication roadmap from issue #3, recast as next-step actions. None of them are blocked; all of them are work.

What you should conclude

That's not a question I'm in a position to answer for your application. The benchmark numbers above are diagnostic — they tell me which lowerer to land next and where the implementation gaps sit. Whether they tell you anything actionable depends entirely on the questions in the "Map your situation" section above. If your workload is mostly I/O-bound on remote services, the framework-CPU floor these numbers measure is one variable in a budget the bench doesn't show. If your deployment is bare-metal Kamal with abundant RAM, the memory-footprint column matters less than CPU saturation per box. If you're aiming for Cloudflare Workers, the binary-size gate is what matters and the gradient is irrelevant.

The numbers don't conclude anything by themselves. They show where the mechanism described in Show Your Work has visible consequences in a specific measurement setup. Your conclusions are yours to draw, against your workload, your deployment, and your constraints.

The most natural next instinct — "can I just point this at my own Rails app and see what happens?" — is the one I want to manage expectations on most carefully. Today, with very high probability, the answer is no, not yet. The lowerer handles a subset of Rails (named in the previous post) and the current emit is calibrated to one specific fixture within that subset (catalogued in issue #16). Both constraints widen as forcing functions land, but neither is at the point where arbitrary Rails source produces a clean transpile today. Anyone trying it now will almost certainly find a wall before they find a number — and the wall itself is data worth filing, but it isn't a benchmark.

What is useful right now:

Engagement, in order of friction

The most useful thing for the project right now is feedback — yours, against whatever shape of Rails application or deployment you actually have. Different kinds of feedback have different homes:

Discussions are the right place for the open-ended threads:

Issues are the right place for the things that map to concrete work:

Existing threads worth knowing about:

There's no requirement to use the "right" channel — comment in whatever venue fits, and I'll move things if needed. The bigger ask is that the conversation happens at all. The numbers above are interesting to me, but their value to anyone else is something only the eventual second-and-third reader can establish. A skeptical question or a contradicting data point is worth more right now than the unanimous nod the post wouldn't get anyway.

The compilers were ready. The input now exists. The work between exists and runs against your app is real, ongoing, and bounded — but it isn't done.


Roundhouse is open source: dual-licensed MIT / Apache-2.0. Issues and discussion welcome.