Writing & methodology

Notes on what MakerBench measures, why hardware-agent benchmarks are hard, how grading actually works, and what the current results do — and do not — prove.

View the leaderboard → Get started

Articles

Posts

Long-form companions to the docs in the repository. Each post is plain, factual, and links back to the source contract it summarizes.

What MakerBench measures (and what it doesn't)

methodology

post 01

The four-level grading ladder, the blind vs. perception split, parameter-derived (un-memorizable) pass criteria, honest token/cost telemetry, agent self-verification as an audit signal, and the limits of an alpha leaderboard.

grading anti-cheat versioning

How models fail: findings from v0.1.0

methodology

post 02

The design rationale, the headline numbers, and the one pattern that matters: failure is bimodal — across the blind board, models either reach full Level-4 DFM or collapse to a structural-only mesh, with the geometric and physical-constraint rungs nearly empty. Grounded in the live v0.1.0 leaderboard.

findings how-models-fail versioning

Failure gallery: what a 2/4 actually looks like

methodology

gallery

Representative failure artifacts — welded lids, flipped axes, non-manifold meshes, impossible cut paths — that show why a math-graded hardware benchmark is legible, with the safety workflow that keeps oracle answers out of every example.

failures perception launch