Dwarf-Bench is a small LLM benchmark that tests how well models recall obscure, specific facts about dwarves across fantasy media — Tolkien, Warhammer, Dungeons & Dragons, Discworld, Dragon Age, World of Warcraft, and friends.
Each model is prompted with tools, search, and browsing disabled, and is told to answer from internal knowledge only or to reply "I don't know". Free-form responses are then graded against a gold-standard answer key by an LLM-as-judge using a 1.0 / 0.5 / 0.0 rubric (fully correct / partially correct / wrong-or-unknown).
It's a half-joke benchmark, but it's also a real one — mostly an excuse to build an eval pipeline from scratch (provider calls, async runner, judge, reporting) without leaning on a heavy framework.
Latest results
| Model | Accuracy | Correct | Partial | Wrong | N |
|---|---|---|---|---|---|
| Loading leaderboard… | |||||
Source
Code, dataset, judge prompt, and the leaderboard JSON this page consumes all live on GitHub: github.com/nocount/dwarf-bench.