Numbers Without Conclusions

2026-05-25T13:47:52Z

The previous post, Show Your Work, argued that the right answer to "that can't be real" reactions about benchmark numbers is to show the mechanism — the lowered code, browseable, reproducible — rather than to publish more numbers. The mechanism is still the point. But there's a follow-up question that the mechanism alone doesn't answer: what do those three lowerings actually buy, measured?

This post presents the measurements. It does not present conclusions. The numbers don't conclude anything by themselves, and the workloads I can measure are not your workloads. What follows is a position update — preliminary numbers across eight target surfaces, with an explicit accounting of what those numbers are evidence of, what they aren't evidence of, and which engineering decisions are still in flight.

What was measured

Same Rails source (the Roundhouse real-blog fixture) — a small Rails 8 CRUD application with Articles and Comments, AR associations, validations, before_action callbacks, Turbo Streams broadcasts, jbuilder JSON, ERB layouts and partials. Three HTML endpoints (/articles, /articles/1, /articles/new) and two JSON endpoints (/articles.json, /articles/1.json). Eight target surfaces:

rails — stock Rails 8 + Puma + YJIT, as a reference
rails-int — same Rails, YJIT disabled, as a baseline
ruby — Roundhouse-emitted Ruby + Puma + YJIT
ruby-int — same Roundhouse-emit Ruby, YJIT disabled
typescript — Roundhouse-emitted TypeScript via Node + node:http
crystal — Roundhouse-emitted Crystal, AOT-compiled binary
rust — Roundhouse-emitted Rust, AOT-compiled binary (axum + tokio + rusqlite)
go — Roundhouse-emitted Go, AOT-compiled binary (net/http + database/sql + modernc.org/sqlite)

Methodology: wrk -t2 -c64 -d20s --latency, 3 measurement runs per endpoint per target, 20-second warmup before each, single Puma worker × 5 threads for Ruby/Rails targets, single process / multi-thread runtime for compiled targets. SQLite in-process. M4 MacBook Air, clean conditions (load avg < 1.5, 90%+ idle, sustained CPU clock state).

The bench script and the per-run measurement data are at bench/results/ in the repo. Methodology details and the running publication-roadmap discussion live at issue #3.

The numbers

HTML and JSON tables are presented separately because they tell different stories (which is most of the point of this post).

HTML endpoints

`/articles` — list view with N records + associations rendered

target	req/sec	p50 (ms)	p99 (ms)	RSS (MB)
crystal	18,024	3.60	4.30	28
rust	12,291	5.14	10.95	21
ruby	12,536	5.07	5.60	105
typescript	10,759	5.52	11.20	388
ruby-int	8,566	7.43	8.57	90
go	4,131	15.24	22.90	46
rails	1,375	44.58	166.72	320
rails-int	1,305	47.23	162.94	315

`/articles/1` — single-record show view

target	req/sec	p50 (ms)	p99 (ms)	RSS (MB)
crystal	22,665	2.86	3.60	26
rust	18,950	3.33	6.75	22
ruby	15,213	4.16	4.88	111
typescript	14,610	4.23	8.44	392
ruby-int	5,984	7.85	39.19	96
go	4,814	13.01	19.63	47
rails	1,297	47.39	205.32	335
rails-int	1,254	49.33	76.90	329

`/articles/new` — form view, no DB

target	req/sec	p50 (ms)	p99 (ms)	RSS (MB)
crystal	53,456	1.18	1.64	23
rust	56,121	0.84	78.19*	25
typescript	26,381	2.38	4.24	392
go	22,982	2.65	12.53	58
ruby	21,382	2.88	15.22	117
ruby-int	12,319	5.20	6.30	103
rails	1,803	34.45	58.48	349
rails-int	1,736	33.44	197.61	305

* rust /articles/new p99 = 78ms despite p50 = 0.84ms — single thermal stall during the measurement window. Median is the more meaningful number.

JSON endpoints

`/articles.json` — list serialized via jbuilder

target	req/sec	p50 (ms)	p99 (ms)	RSS (MB)
rust	90,583	0.70	1.37	20
crystal	87,503	0.69	0.94	26
typescript	39,693	1.55	3.16	370
go	34,121	1.92	2.39	47
ruby	17,640	3.63	4.00	96
ruby-int	12,898	4.96	5.28	83
rails	2,309	27.53	32.28	321
rails-int	2,308	27.49	31.56	293

`/articles/1.json` — single-record serialized

target	req/sec	p50 (ms)	p99 (ms)	RSS (MB)
crystal	118,321	0.52	0.77	27
rust	113,476	0.53	1.50	21
typescript	47,609	1.29	3.45	376
go	40,880	1.55	2.31	47
ruby	19,338	3.28	3.66	103
ruby-int	13,977	4.53	4.99	87
rails-int	3,169	19.15	50.79	300
rails	3,073	19.68	52.03	328

What these numbers are evidence of

Three observations the numbers support relatively directly.

1. Same Ruby, same YJIT, ~9× difference. The rails row uses CRuby + Puma + YJIT. The ruby row uses the same Ruby, the same YJIT, the same Puma — running Roundhouse-emitted code instead of stock Rails. Per /articles: 1,375 vs 12,536, a 9.1× delta. Per /articles.json: 2,309 vs 17,640, 7.6×. Same language. Same JIT. Same hardware. The difference is what the JIT sees as input. This is what Show Your Work predicted in qualitative form; the magnitude is now visible.

2. JIT specialization headroom appears only on the lowered shape. Compare ruby (YJIT on) to ruby-int (YJIT off): 12,536 vs 8,566 on /articles — YJIT contributes ~46%. Compare rails to rails-int: 1,375 vs 1,305 — YJIT contributes ~5%, within noise. Same YJIT, same Ruby. The JIT specializes when the input shape lets it; stock Rails' polymorphic dispatch surfaces prevent that work. Roundhouse-emitted Ruby gives YJIT shapes it can do something with.

3. Compiled targets benefit disproportionately from removing the view-render pipeline. On /articles (HTML, with full layout/slot/escape/asset chain), rust hits 12k req/s — about the same as Roundhouse-emit Ruby. On /articles.json (no layout, just controller → AR → jbuilder), rust hits 90k — a 7.4× same-target jump. Go shows the same pattern: 4k HTML → 34k JSON, an 8.3× jump. Interpreters' same-target HTML→JSON ratios are 1.4-1.8×. The view-render pipeline was eating most of the throughput headroom on the compiled targets; remove it, and the per-language ceiling shows up. The interpreter row caps were already close to the interpreter's own ceiling on HTML, so there's less to free up.

4. Memory footprint differs by ~16× between Rails and Rust on the same workload. RSS at workers=1, threads=5: Rails ~320MB, Roundhouse-emit Ruby ~100MB, Rust ~20MB, Crystal ~26MB. These are steady-state resident sizes — the cost a Puma worker pins for its lifetime regardless of whether it's actively serving or idle.

What these numbers are NOT evidence of

This is the longer list, and it matters more.

1. CPU-bound benchmarks; production wall-clock is mostly I/O. Every request in this bench touches local SQLite (in-process, no network roundtrip), renders a tiny dataset, and returns. Real Rails production typically spends 30-60% of per-request wall-clock waiting on remote database queries and another 20-40% on external API calls (payments, email, S3, analytics). For an I/O-heavy endpoint, the framework-CPU fraction these benchmarks measure may be 10-20% of total request time. The other 80-90% is wait time that all targets share equally; throughput differences shrink correspondingly.

The CPU-bound benchmark is representative of:

Hot-path cached endpoints (the same /products/:id page served from Redis on every request — no DB roundtrip, just framework work)
Edge / in-region deployments where DB is local (SQLite with Litestream, in-AZ Postgres replicas, embedded DBs)
API-only services with thin models and external-service-free request paths
The CPU floor of any request — every Rails request pays at least this much, regardless of I/O

The CPU-bound benchmark is not representative of:

Traditional Rails apps making remote managed-Postgres calls with two or three third-party API calls per request
Background-heavy workloads where most work happens out-of-band
Reporting / analytics endpoints whose latency is gated by DB query plans, not framework dispatch

Your endpoints likely have some of each. The bench shows you the framework-cost component, which is one variable in a multi-variable production budget.

2. Single machine, not a production deployment shape. These numbers are from one process per target on one M4 laptop. They do not measure:

Multi-worker scaling (workers=1 throughout). Multi-process Rails with Puma workers=N would scale roughly linearly until memory or DB connection pool caps it. Adding the same scaling to the compiled targets is mostly irrelevant — they're already multi-threaded — but matters for typescript and crystal where filed perf gates (#8, #9) would unlock per-machine multi-core.
Cluster / orchestrated deployment (Kubernetes, ECS, Kamal). Adding load balancing, health checks, multiple instances, autoscaling — all of these add overhead and variance the single-process bench doesn't see.
Network latency from clients. Local loopback has no realistic network cost.
Production-grade observability (APM agent, structured logging, metrics scraping). Each instruments the request path and costs CPU.

3. The compiled-target numbers reflect implementation choices, not language ceilings. Several known performance gates are filed and unfixed:

#10 — rust uses a single shared Mutex<Option<Connection>> for sqlite. All concurrent requests serialize through it on DB endpoints. This is why rust /articles is "only" 12k. A real connection pool would likely lift it 3-5×.
#11 — go's statement-table is behind a single sync.Mutex. Stdlib sql.DB pools connections, but the layer above serializes lookups. go /articles at 4k is depressed by this; the no-DB endpoint at 23k is closer to the language ceiling.
#7 — go's layout-slot store is behind a sync.RWMutex (a recent race-condition fix). Every text/html response serializes through it. This is most of go's HTML throughput penalty.
#12 — no target except Rails caches prepared statements. Every request re-parses the same SQL. Rails inherits its statement_cache from ActiveRecord; Roundhouse-emit targets don't have one yet.
#8 — TypeScript runs on one Node event loop. No multi-core scaling without cluster.
#9 — Crystal runs on one fiber thread by default. -Dpreview_mt would enable multi-core; not yet wired.

Each of these is a known gap with a filed fix. The numbers above are a point-in-time snapshot with those gaps unaddressed; subsequent rounds will land them. If you read this post in three months, the rust and go rows should be substantially higher than what's printed here.

4. The fixture is one Rails app, and the implementation is currently fragile to anything other than that fixture. The real-blog fixture is a small Articles + Comments CRUD app — chosen for compactness and because it exercises the moves Show Your Work described. There are two distinct ways your application might not fit today.

Subset gaps (covered in the Show Your Work checklist) — application-shape patterns that the lowerer doesn't yet handle:

Custom DSLs that generate methods at class-body time (acts_as_taggable, paper_trail-style audit logs)
Polymorphic associations and single-table inheritance at scale
Heavy Arel composition (dynamic query trees, ransack-style search)
Action Mailer, Action Text, Active Storage at production scale
Background job graphs (Sidekiq with chained jobs, ActiveJob with multiple queues)
delegated_type, abstract base classes
Method-missing-heavy gems

Emit-side fixture-fragility (catalogued in issue #16) — places where the current emit hardcodes real-blog-specific names rather than deriving them from the app's source. Examples: a forced-parens list in the go emitter currently enumerates the literal strings "comments" and "article" — the relation accessor names from real-blog's has_many :comments and belongs_to :article; other association names (posts, categories, user) would not match the list and would emit as bare field reads, failing go vet. The layout selector hardcodes application regardless of layout :foo directives. Asset list expansion bakes the specific stylesheets present at codegen time into the emitted layout. Fourteen items of this kind are filed in #16 today, and the list is explicitly partial.

Both of these constraints mean a user pointing bin/rh transpile at their own Rails source will, in most cases today, hit a lowering or emit failure before getting to bench numbers. Roundhouse covers a subset of Rails and the current emit is calibrated to a specific fixture within that subset. The subset is widening at the lowerer level; the fixture-fragility is widening at the emit level — both as forcing functions surface them.

The honest position: your application very likely does not transpile cleanly today, even if its patterns are theoretically within the supported subset. If you're curious whether yours could in principle, the Show Your Work checklist helps locate the subset-gap blockers; the emit-side fixture-fragility is harder to predict without trying. The "try with your own app" workflow is a forward goal, not a current state.

5. The hardware is a single laptop chip; numbers vary across deployment hardware by 2-3× and across runs on the same hardware by 5-30%. M4 boost clocks throttle aggressively under sustained load. The compiled-target rows in particular drift by 20-50% across cool-start vs warm-baseline thermal states. The numbers above were collected with a 5-minute wait for load average to settle below 1.5 — clean by the laptop's standards, noisy by data-center standards. A locked-clock Linux x86 server (cpupower frequency-set -g performance, governor pinned, SMT considered) would tighten variance to perhaps ±2%. Absolute numbers on Ryzen / Xeon / Graviton would be different from M4 — sometimes higher, sometimes lower, depending on per-core IPC and clock budget.

Map your situation onto these numbers

If you want to read what the numbers imply for your situation specifically, here are some questions to bring to the table.

What deployment shape are you in?

Different deployment shapes care about different metrics:

deployment	what bills you	which row matters
Bare-metal Kamal box	hardware + power + cooling	req/sec at CPU saturation; req/watt
Tiered PaaS (Heroku, Render)	dyno-size step function	does the app fit in the next smaller tier?
Metered cloud (Fly.io, AWS GB-hours, Vercel GB-seconds)	continuous GB-hour billing	req/sec/GB (direct line item)
Edge platforms (Cloudflare Workers, Vercel Edge, Deno Deploy)	per-execution memory cap (128MB–1GB)	does the binary fit in the cap at all?

For a Kamal deployment on owned hardware with ample CPU and RAM, the binding constraint is CPU at saturation, not memory. A 16-core box running 50 Puma workers can serve N concurrent requests until the DB connection pool caps or PG's max_connections fills — RAM is sunk cost, GVL forces fork-per-core. A smaller framework helps insofar as it reduces the per-worker baseline that pins memory between requests, but the headline metric is "how many requests per box," not "req/sec/GB."

For a Heroku dyno, the metric is "does the app fit on a Standard-1X (512MB)?" — a discrete step function. A roundhouse-emit Rust binary at 20MB fits with 99% headroom; a Rails app at 320MB needs the tier above.

For Fly.io, every GB-hour is a line item. req/sec/GB maps directly to monthly cost.

For Cloudflare Workers, memory is capped at 128MB free / 1GB paid. A Rails app does not run there at all; a Roundhouse TypeScript emit does. This is a binary gate, not a gradient.

The metric your bill cares about determines which column you should be reading in the tables above.

What does your request profile look like?

These questions don't have right answers — they're meant to help you locate your application on the bench's measurement axis.

What fraction of your average request is spent waiting for the database? For external APIs? In Ruby itself? (If you have APM data — New Relic, Datadog, Skylight — this is a chart you've already seen.)
What's the ratio of cached-hit to cache-miss on your most-requested endpoints? Cached hits are CPU-bound; cache misses are I/O-bound.
Are your endpoints serving JSON to an SPA, or HTML to a browser? (Affects which row in the tables maps to you.)
How memory-constrained is your current deployment? Are workers OOM-killing, or do they coast at half capacity?
What concurrency are you actually running? Single-worker laptop tests show different limits than production multi-worker fleets.

The benchmark numbers map onto your request profile only insofar as your request profile is CPU-bound. The framework cost shown here is the floor; production adds I/O on top of it.

Open questions

Things I genuinely don't know the answer to and would welcome input on.

1. What's the right workload mix for a representative bench? Today's bench is local SQLite + tiny fixture + no external services. A more representative workload might include: remote managed Postgres (1ms-5ms RTT), one third-party API call per request (50-200ms), a cache layer (Redis), realistic record sizes (10s of fields, 100s of bytes per row, kilobytes of HTML). I don't have a clear methodology for adding these realistically without introducing measurement noise that swamps the framework signal. Suggestions welcome at issue #3.

2. How should the bench account for production observability overhead? APM agents, structured logging, metrics scraping each cost CPU per request. None of the current numbers include any of these. A bench-with-APM would show smaller deltas between targets; the question is whether that's a more honest measurement or a confounder.

3. Multi-target deployment from one source: real, or curiosity? The architectural claim is that you can ship the same Rails source as a Rust binary to one deployment and a TypeScript bundle to another. Whether this is actually useful for production teams — whether anyone wants two deployment paths from one codebase — depends on workloads I don't have. If you have a real "this would help us" scenario or a real "this would never help us" scenario, both are useful data points.

4. Where does the prepare-statement cache (filed as #12) actually live architecturally? The cache could be hand-written per-target, transpiled from the framework runtime, or built into the per-target sqlite primitives. Each has tradeoffs. The tep collaboration (@OriPekelman) raised related questions in issue #3; they're open.

5. What perf gates land first? #7, #10, #11, #12 each move the needle on at least one row. Ordering depends on what surfaces forcing functions; suggestions in the linked issues are welcome.

What comes next

The forward direction:

Replication on locked-clock Linux x86 server hardware, to tighten variance and produce numbers comparable across runs over time. M4 laptop numbers are diagnostic; data-center numbers are the publication baseline.
Workload variants — JSON-only, write-heavy, longer queries — to see whether the cross-target ratios hold or shift.
Landing the filed perf gates (#7, #8, #9, #10, #11, #12) one at a time and re-running the bench between each, so each gate's contribution is observable in isolation.
A more representative fixture — one that exercises Action Mailer, polymorphic associations, heavier Arel, the patterns that today's real-blog doesn't.

This list is the publication roadmap from issue #3, recast as next-step actions. None of them are blocked; all of them are work.

What you should conclude

That's not a question I'm in a position to answer for your application. The benchmark numbers above are diagnostic — they tell me which lowerer to land next and where the implementation gaps sit. Whether they tell you anything actionable depends entirely on the questions in the "Map your situation" section above. If your workload is mostly I/O-bound on remote services, the framework-CPU floor these numbers measure is one variable in a budget the bench doesn't show. If your deployment is bare-metal Kamal with abundant RAM, the memory-footprint column matters less than CPU saturation per box. If you're aiming for Cloudflare Workers, the binary-size gate is what matters and the gradient is irrelevant.

The numbers don't conclude anything by themselves. They show where the mechanism described in Show Your Work has visible consequences in a specific measurement setup. Your conclusions are yours to draw, against your workload, your deployment, and your constraints.

The most natural next instinct — "can I just point this at my own Rails app and see what happens?" — is the one I want to manage expectations on most carefully. Today, with very high probability, the answer is no, not yet. The lowerer handles a subset of Rails (named in the previous post) and the current emit is calibrated to one specific fixture within that subset (catalogued in issue #16). Both constraints widen as forcing functions land, but neither is at the point where arbitrary Rails source produces a clean transpile today. Anyone trying it now will almost certainly find a wall before they find a number — and the wall itself is data worth filing, but it isn't a benchmark.

What is useful right now:

Read the lowered code at rubys.github.io/roundhouse/browse/. The mechanism is browseable. Whether the moves described in Show Your Work look applicable to your codebase shape is something you can assess from the source alone, without running anything.
Compare the per-row characteristics in the tables above against your application's profile. If your hot endpoint is JSON-heavy on a remote DB, the JSON rows + the I/O-bound caveat together approximate what you'd see. If your hot endpoint is HTML-heavy on a hot cache, the HTML rows are more directly comparable.
If you have a specific deployment shape with a specific cost-economics question (a Heroku dyno bill that grew faster than your QPS, a Cloudflare Workers ambition that Rails can't satisfy, a Kamal box at CPU saturation), the deployment-shape table above maps which row in which table is relevant.
If you've identified a specific blocker that prevents your Rails app from fitting the subset, file it. The lowerer roadmap is driven by user-named blockers more than by speculation about what might matter.

Engagement, in order of friction

The most useful thing for the project right now is feedback — yours, against whatever shape of Rails application or deployment you actually have. Different kinds of feedback have different homes:

Discussions are the right place for the open-ended threads:

"Here's the shape of my Rails app — does any of this apply?" — a representative sketch of your models, controllers, deployment, and workload is more useful than asking abstractly. The mismatch between what you have and what real-blog exercises is the most direct signal about where the lowerer needs to grow.
"Here's our deployment situation — would Roundhouse change the math?" — Heroku bills, Kamal capacity questions, edge-deployment ambitions, multi-target curiosities. The cost-economics framing in this post is general; mapping it to your specific bill is something I'm happy to think through with you.
"These numbers don't match my production" or "these match exactly" — both are calibration data. The bench measures the framework-CPU floor on one fixture; how that maps to your wall-clock is something only you can report.
"I disagree with the framing of X" — the post has positions; some are more confident than others. Argue with any of them.

Issues are the right place for the things that map to concrete work:

Reproducible bugs — bin/rh transpile <target> panics or produces invalid output for a specific input pattern. The minimal Rails source that triggers it is the gold standard; even a description of the pattern + the error is filable. The omnibus issue #16 is the catch-all for emit-side fixture-fragility; if your case fits there, append to it; if it deserves its own issue, file it.
Lowerer subset gaps — a Rails pattern your app uses that doesn't transpile today. Even if you don't know why it fails, naming the pattern (acts_as_taggable, polymorphic associations with two-level inheritance, etc.) is filable. The lowerer roadmap is paced by which gaps get named most often.
Specific perf observations — if you've run the bench against a fixture with a specific shape and the numbers diverge from what's published here in a way you can characterize, the per-target perf-gate issues (#7 through #12) are the right place to add data points.
Methodology improvements — issue #3 is the open thread on bench design (workload mix, observability overhead, multi-target deployment value, hardware-baseline choices). Concrete proposals there shape the next round of measurements.

Existing threads worth knowing about:

Issue #3 — publication roadmap + cross-project synergy with tep and Spinel.
Issues #7-#12 — six per-target perf gates, with expected bench-impact estimates and fix sketches.
Issue #16 — fourteen-item omnibus of emit-time real-blog specializations; the list is explicitly partial.

There's no requirement to use the "right" channel — comment in whatever venue fits, and I'll move things if needed. The bigger ask is that the conversation happens at all. The numbers above are interesting to me, but their value to anyone else is something only the eventual second-and-third reader can establish. A skeptical question or a contradicting data point is worth more right now than the unanimous nod the post wouldn't get anyway.

The compilers were ready. The input now exists. The work between exists and runs against your app is real, ongoing, and bounded — but it isn't done.

Roundhouse is open source: dual-licensed MIT / Apache-2.0. Issues and discussion welcome.