A companion to what MakerBench
measures: this post reads the first public board the other way round —
starting from the design rationale, then the headline numbers, then the
single most striking pattern in the data. Every figure here is computed
from the live v0.1.0 leaderboard and carries its caveats.
MakerBench grades exported geometry with deterministic math, never an LLM
judge. Pass criteria are derived from each task's realized parameters, so
there is no answer key to memorize. Briefs are short and
outcome-oriented, the substrate is open and headless, and every result is
pinned to a benchmark_version and benchmark_profile
so a core OpenSCAD score is never mixed with a future Fusion or FEA pack.
The full reasoning lives in
DESIGN.md;
the short version is that geometry gives a physical-design benchmark the
unforgiving, computable correctness that text benchmarks have to fake.
Scoring is a cumulative ladder. A cell's score is the highest level it clears in order, so a number on the board is a stage of progress, not a checklist tally.
On the blind track of v0.1.0, 20 models are measured across
four task families — vented_plate,
enclosure_fastened, sheet_metal_bracket, and
laser_tab_slot_panel. The top blind score is a perfect
4.00/4, and it is a three-way tie rather than a single
leader.
| Model (blind) | Blind mean /4 |
|---|---|
| antigravity-gemini-default | 4.00 |
| claude-code-sonnet | 4.00 |
| codex-gpt-5.5 | 4.00 |
| then a cluster at | 3.25 |
Three models sitting on the ceiling across only four families is exactly the saturation signal the design anticipates: a small, early profile gets cleared quickly, which is why MakerBench tracks continuous quality metrics and a roadmap of harder Frontier packs rather than treating a 4.00 as the end of the story. Read every row with its track, its profile, its version, and its N.
Pool every graded blind cell from the non-control rows and bucket it by the level it stopped at. The distribution does not taper — it splits. Almost everything lands at one of the two extremes of the ladder: a full Level-4 manufacturable part, or a Level-1 mesh that compiles but gets the geometry wrong. The middle rungs are nearly empty.
Of 239 gradable blind cells, 154 (64%) reach Level 4 and 76 (32%) stop at Level 1. Only 8 cells land at Level 3 and zero at Level 2. Roughly 96% of all cells sit at one of the two poles.
This is the empirical motivation for the score-spread calibrators called for in the design notes: when a profile produces a bimodal board, a single mean hides more than it shows, and the benchmark needs tasks whose binding constraint sits at L3 or L4 to spread the field out. The bimodality is a finding about the tasks as much as the models — these four families currently reward an all-or-nothing geometric solve.
A Level-1 score is not a crash. The model wrote OpenSCAD, it compiled, and it produced a single watertight solid; the harness-enforced structural check passed. What failed next was the first per-task, parameter-derived check: the part interfered with itself, or its outer dimensions missed the realized brief. In other words, the spatial reasoning — not the syntax — is where these runs come apart.
The numbers above are true of v0.1.0 and will not stay true.
When harder Frontier families land, the bimodal split should soften and
the top of the board should come off the ceiling — and when that happens,
these v0.1.0 results are not overwritten. Each
benchmark_version is captured as an archived snapshot, and
the leaderboard's version selector loads those historical
boards on demand. Because the design forbids mixing versions or profiles
in one comparison, the selector swaps the entire table rather than blending
rows. That is what lets the benchmark get harder without erasing the
record of what earlier models could do.
This is v0.1 / alpha over four narrow families and a
handful of dev seeds per cell. The counts are real and reproducible, but
they describe these tasks under these harnesses — not general engineering
skill, and not anything that transfers across versions or profiles. A
4.00 means "cleared this small ladder", a bimodal histogram means "these
tasks reward an all-or-nothing solve", and both are invitations to make
the benchmark harder, not verdicts on the models. Figures reflect the
leaderboard at benchmark_version 0.1.0; rebuild from
results/ to refresh them.