From Benchmarks to Cognitive Landscapes: Building ReasonScape
When I last wrote, I was neck-deep in the beautiful chaos of trying to make AI models think efficiently. I had Ruminate steering reasoning processes, excess accuracy corrections separating signal from noise, and a power bill that was starting to develop opinions about my life choices.
That was six months ago. Since then, what started as "Can AI Think?" has evolved into something I never quite intended to build: a complete framework for understanding how language models process information at a fundamental level.
The Problem with Benchmarks (Again)
Remember how my coding tests got contaminated? Turns out the reasoning tests had an even more insidious problem: the format itself was lying to me.
Every benchmark I'd seen treated models like magic boxes: prompt goes in, answer comes out, we count the score. But models aren't magic boxes - they're information processing systems. They have architectures, they have cognitive patterns, they have input processing, they have systematic ways they succeed and fail.
I wasn't just measuring the wrong thing - I was measuring things wrong.
Enter the Third Dimension
Traditional benchmarks give you a number: "This model scored 73.2% on task X." Cool. What does that actually mean? Is the model consistently mediocre at everything? Does it ace easy cases and crater on hard ones? Does adding one more item to a list cause a graceful degradation or a catastrophic collapse?
You can't tell. Because that 73.2% is averaging over every possible variation of the problem, smoothing out all the interesting cognitive architecture details into a single meaningless number.
So I did something weird: I made every test parametric.
Instead of "can you sort a list?" I ask "can you sort a list of length N with case mutations C?" Instead of "track these state changes," it's "track D state changes across L objects with I interference instructions." Every test became a function across multiple difficulty axes, generating deterministic test cases from coordinate-based seeds.
Now instead of a single score, I get a difficulty landscape - a 3D surface showing exactly how performance changes as you vary the parameters. And these surfaces? They're beautiful. And by beautiful, I mean they reveal cognitive architecture patterns that are completely invisible to traditional benchmarks.
There's quite a lot going on in these plots, which can be made at-will via the Explorer. It's like going from measuring "how tall is this mountain?" to having a full topographic map - you can suddenly see the terrain unfolding in front of you.
Truncations (where the model failed to complete its answer within 8K token thinking budget) are indicated by the red stalagtites. The "Competence Zone" (region of difficulty where the model successfully performs the task) is indicated by the green spheres.
Some models have sharp cliffs where adding one more nested parenthesis causes accuracy to fall off a ledge. Others show smooth degradation curves. Some excel at short, deep problems. Others prefer long, shallow ones. Some models are agnostic to boolean format representation, while others show clear preferences.
The Spectral Analysis Plot Twist
Here's where things got really weird.
I'm generating millions of test cases, and I need to verify my generators aren't introducing systematic biases. Standard practice is to check statistical properties of the outputs - means, distributions, that kind of thing. But these are reasoning tasks. The content matters more than simple statistics.
How can we be sure thousands of tests we are generating are really 1) the same when they should be the same, 2) more difficult when they should be more difficult?
Then I remembered: problems are just token sequences! And token sequences are just signals, right? Signals can be analyzed in the frequency domain!
What if I just... ran an FFT on the post-template, post-tokenized streams?
It worked absurdly well.
Different types of reasoning problems have different spectral signatures. Mathematical expressions show high-frequency components from repeated operator symbols, but compression from whitespace confounding. Natural language problems have characteristic low-frequency patterns from grammatical structure that get amplified as the problem depth growth.
Even better: different model architectures all demonstrate unique spectral characteristics in their representations of otherwise identical problems. The interation of chat template and tokenization caused architectural differences in the frequency domain which manifests as the positions of peaks, the presense of nulls, the response to whitespace randomization and a host of other factors across the test suite.
Cognitive architecture analysis through signal processing was unplanned, it emerged from following the data.
The Statistical Rigor Rabbit Hole
With parametric tests generating thousands of samples across multiple difficulty dimensions, statistical rigor stopped being optional and became existential.
I needed:
- Excess accuracy correction that works across binary, multiple-choice, and write-in questions
- Wilson confidence intervals with truncation awareness (models that max out at 100% or bottom out at random guessing)
- Dynamic sampling that keeps generating tests until hitting significance thresholds or safety limits
- Hierarchical sampling where smaller samples are perfect subsets of larger ones (for efficient scaling)
The last one is particularly clever: every test is generated from coordinates in the difficulty manifold. Want more samples? Just extend the sequence. Want to downsample? Take the first N cases. Need to compare a 1000-sample run from last month with today's 5000-sample run? They share the first 1000 tests exactly.
This means I can cache responses forever, merge datasets across time, and never waste compute re-running identical tests. In a world where my reasoning models are burning through 5000+ tokens per test case, this cache-everything approach has saved me approximately [checks power bill] one entire RTX 4090's worth of electricity costs.
The methodology documentation covers these points in more detail.
M12X: The Evaluation That Emerged
What started as "test some reasoning tasks" metastasized into M12X: twelve cognitive domains, three difficulty degrees, configurable sampling densities, and adjustable precision levels.
The domains span the cognitive reasoning waterfront:
- Math and logic (arithmetic, boolean)
- Language and selective attention (objects, letters, movies)
- Spatial and temporal reasoning (shapes, dates, cars)
- Structural parsing (brackets, sort, sequence)
- State tracking and planning (shuffle, cars)
Each domain has carefully designed parametric generators that create difficulty manifolds - continuous surfaces where you can smoothly vary the challenge level and watch exactly where and how models break.
The flexibility is what makes it powerful: You can run a quick corner-sampling pass to find the interesting regions, then do a high-density sweep of just those areas. Or run low-precision scans across everything, then boost precision only on domains or difficulty dimensions that show statistical uncertainty.
It's not so much an evaluation system as it is an exploration framework for probing both depth and breadth of capabilities of reasoning models.
The Dataset That Ate My Life
As of this writing, the M12X dataset contains:
- 41 unique models (and counting)
- 2+ million individual tests
- 5.5+ billion tokens of model responses
- Complete difficulty landscapes across all 12 domains
- Spectral analysis data for every test case
- Full response caching for zero-cost re-analysis
All data was generated in my basement on 4x very tired RTX3090 over the span of roughly 2 months.
The Leaderboard tool offers a birds-eye view of the results:
The smallest model evaluation (a tiny granite variant) burned through 7 million tokens. The largest (Ring Mini 2.0 at hard difficulty with thinking enabled) consumed 85 million tokens - that's longer than the entire Game of Thrones book series, except it's a language model having an existential crisis about bracket matching.
The Explorer tool lets you fly through these landscapes in real-time 3D, comparing models, analyzing projections, examining spectral patterns. The Leaderboard shows exactly where each model's cognitive architecture excels and where it collapses, with truncation indicators showing guessing inflation and confidence intervals showing statistical certainty.
The Tools That Built Themselves
Along the way, I accidentally built an entire toolkit:
The parametric test generators became a framework where adding a new domain is just defining difficulty parameters and a generation function. Want to test spatial rotation? Temporal arithmetic? Semantic categorization? Just map it to coordinates and plug it in.
The analysis pipeline evolved into a full statistical framework with excess accuracy correction, confidence intervals, hierarchical sampling, and spectral analysis. It handles truncation, corrects for guessing, and produces publication-ready statistical rigor by default.
The visualization tools grew from "show me some numbers" to interactive 3D explorers with FFT analysis, multi-model comparison, and projection studies. The leaderboard displays performance heatmaps where color shows accuracy and rising darkness indicates truncation at random guessing. At a glance you can see exactly where models break.
None of this was planned. Each tool solved one problem, which revealed the next problem, which required the next tool.
What Actually Matters (Besides My Power Bill)
After analyzing billions of tokens across millions of tests, some patterns emerged:
1. Format effects are huge and systematic. The same reasoning capability looks 30 percentage points different between multiple choice and write-in formats. If you're not correcting for excess accuracy, you're not measuring reasoning - you're measuring the model's ability to pattern-match answer formats.
2. Difficulty is multi-dimensional. "Harder" isn't a single axis. Models have completely different performance profiles depending on whether you scale length vs depth vs interference vs complexity. The landscapes reveal this; single scores hide it.
3. Cognitive architectures are real. You can see systematic patterns in how models process information. It's not just "better" or "worse" - it's fundamentally different processing strategies that succeed and fail in predictable ways.
4. Token efficiency matters more than raw accuracy. A model that scores 85% while using 2000 tokens per response is often more useful than one scoring 90% at 6000 tokens. Especially when your power company starts sending concerned emails.
5. Spectral signatures are cognitive fingerprints. The frequency domain reveals architectural differences in problem representation, and those differences are often observable at the outputs: if the input poorly represents the signal, the processing will struggle.
Where This Goes (Besides Bankrupt From Electricity)
ReasonScape is live at reasonscape.com with:
- Full documentation of the methodology
- Interactive Explorer for 3D landscape navigation
- Leaderboards with multiple difficulty tiers and task-level visualization
- Complete M12X dataset (5.5B tokens) for download
- All the tools, methodology and dare are open source licensed under MIT
This is where "Can AI Code?" ultimately led: not to better benchmarks, but to a framework for understanding information processing in language models through the lens of parametric evaluation, statistical rigor, and spectral analysis.
The question evolved from "Can AI think?" to "How does AI think?" to "Can we map the cognitive architecture through difficulty landscapes and spectral patterns?"
Turns out the answer is yes. Also turns out mapping cognitive architectures requires approximately one death star's worth of GPU compute and a very understanding spouse. At least my power bill is now featured in local news articles about residential electricity consumption anomalies. That's basically peer review, right?
Follow the chaos at github.com/the-crypt-keeper/reasonscape, where I continue asking questions I didn't know I had about problems I didn't know existed, using tools I never planned to build.


