Test-Time Scaling, then AutoTTS

Part 1

The background: what test-time scaling is

Training-time scaling is the familiar story: use more data, more parameters, more pretraining compute, and get a stronger model. Test-time scaling is different. You keep the model fixed, but when a hard question arrives, you spend more computation while answering it.

That can mean “ask the model 64 times and vote.” It can mean “let one chain of thought run longer.” It can mean “sample many branches, inspect partial answers, throw away the bad branches, then keep pushing the promising branches.” Same model. More inference-time search.

The simplest mental model is a human doing math. If the problem is easy, you answer quickly. If it’s hard, you try multiple approaches, check intermediate results, abandon dead ends, and only then give the final answer. Test-time scaling asks whether LLMs can get the same kind of benefit from spending more thought.

Core distinction

Training-time scalingMake the model stronger before deployment.

Test-time scalingUse the deployed model more intelligently at inference.

This is why TTS matters commercially: it turns inference compute into a quality knob.

Part 2

Most TTS methods are different ways to spend the same budget

The paper’s useful simplification is the width–depth view. Width is how many independent reasoning branches you start. Depth is how far you let each branch go. A lot of named methods are just different paths through that space.

more width
more branches

more depth longer reasoning

SC@64

single long chain

Parallel-Probe

AutoTTS search space

Self-consistency

Sample many full answers, then majority-vote. It’s blunt but surprisingly strong, especially on math. The obvious downside is waste: all 64 branches run to completion even if the answer is already clear.

Adaptive stopping

Sample answers until confidence is high enough, then stop. This saves tokens when consensus appears early, but it can stop too early when the first few answers agree for the wrong reason.

Parallel probing

Run branches in chunks, probe intermediate answers, and prune or continue based on the partial evidence. This is more like search than voting, because you’re steering compute as evidence arrives.

Controller policies

A controller decides what to do next: branch, continue, probe, prune, or answer. That controller is where most of the interesting TTS design lives.

Part 3

What AutoTTS changes: don’t hand-design the controller. Discover it.

The paper’s claim is not that it invented “more thinking at test time.” The claim is that TTS strategy design itself should become an agentic search problem. Humans define the environment. A coding agent writes candidate controller programs. The environment evaluates those programs cheaply, gives feedback, and the agent edits the controller again.

That’s the real move. The authors shift the human work from “invent another heuristic” to “build a replay environment where good heuristics can be found.” This is why I described it as adjacent to continuous learning but not the same thing: the base model is fixed. The thing improving is the inference controller around the model.

In their setup, each candidate controller is evaluated against cached reasoning traces. That means the search loop doesn’t need to keep calling Qwen for every candidate. It replays already-collected branches, probes, and intermediate answers, then scores the controller’s accuracy–cost trade-off.

1Collect reasoning traces once

2Freeze a replay environment

3Claude Code proposes controller.py

4Replay evaluates accuracy and cost

5Trace feedback guides the next edit

The mechanics

The AutoTTS environment is basically an inference-time MDP

At any step, the controller sees the active branches, how deep they are, which probe answers have been revealed, and how much budget remains. Then it chooses one action.

BRANCHopen a new reasoning attempt

CONTINUE(i)extend branch i by one interval

PROBE(i)read branch i’s current intermediate answer

PRUNE(i)drop branch i from the active set

ANSWERstop and aggregate the evidence

Why replay matters

If every candidate controller had to call the base model live, discovery would be expensive and noisy. Replay makes controller evaluation cheap, deterministic, and frequent. This is the same pattern I’d watch across agent research: pay once to build a sandbox, then let agents search inside the sandbox.

The discovered strategy

AutoTTS finds the Confidence Momentum Controller

The best discovered controller, CMC, has a very human-sounding rule: don’t stop just because confidence is high right now. Stop when confidence is high and the confidence trend isn’t getting worse.

That matters because TTS systems can get fooled by early agreement. A few branches may converge on the same wrong answer. CMC smooths confidence with an exponential moving average, then uses the trend to decide whether to stop, widen, deepen, or prune.

It also couples width and depth. If existing branches are producing better evidence, it doesn’t need to spawn more branches. If progress stalls or reverses, it widens. That feedback loop is the important discovered structure.

CMC’s four ideas

Trend-based stopping: stop on stable confidence, not a spike.
Coupled width–depth control: widen when deepening stops helping.
Alignment-aware depth: spend more on branches matching the current winner.
Conservative pruning: abandon branches only after persistent disagreement.

Interactive intuition

β is the paper’s single “spend more or spend less” knob

AutoTTS forces each controller to expose one scalar parameter, β. Lower β pushes cheaper behavior. Higher β pushes more expensive, accuracy-seeking behavior. This is partly an anti-overfitting trick: don’t let the coding agent invent ten brittle thresholds that only work on the search set.

β = 0.50

Initial branches5

Max branch ceiling34

Stop confidence threshold0.91

Behaviorbalanced

Results, in plain English

The headline is a better accuracy–token frontier, not magic

Qwen3-8B / AIME24

AutoTTS β=1.0 reports 85.8% accuracy and 467.4k tokens, compared with Parallel-Probe at 81.5% and 730.8k tokens.

Held-out average

At β=0.5, the project page reports 45.3 average held-out accuracy across models versus 45.2 for SC@64, using roughly 69.5% fewer tokens.

Generalization

The discovered policy is searched on AIME24, then evaluated on held-out AIME25 and HMMT25 across Qwen3 model sizes.

My read: this is a strong systems paper, but not a “model improves itself” paper in the weight-update sense. It’s an agent discovering a better inference-time algorithm around a fixed model.

What to remember

The durable idea is environment-driven algorithm discovery

AutoTTS is interesting because it makes a small version of a larger pattern concrete. If you can build a cheap, faithful environment with good feedback, a coding agent can search over programs that would be annoying for a human to design manually.

For TTS, the searched program is an inference controller. For coding agents, it might be a harness. For memory systems, it might be a retrieval policy. The field is going to keep finding places where the “algorithm” around the model is easier to improve than the model itself.

Sources used

AutoTTS paper on arXiv AutoTTS GitHub repo Authors’ project page