Self-consistency
Sample many full answers, then majority-vote. It’s blunt but surprisingly strong, especially on math. The obvious downside is waste: all 64 branches run to completion even if the answer is already clear.
Paper explainer
The AutoTTS paper is easier to understand once you see the field’s basic move: keep the base model fixed, then use extra inference-time compute to sample, extend, check, prune, or vote over multiple reasoning attempts.
Part 1
Training-time scaling is the familiar story: use more data, more parameters, more pretraining compute, and get a stronger model. Test-time scaling is different. You keep the model fixed, but when a hard question arrives, you spend more computation while answering it.
That can mean “ask the model 64 times and vote.” It can mean “let one chain of thought run longer.” It can mean “sample many branches, inspect partial answers, throw away the bad branches, then keep pushing the promising branches.” Same model. More inference-time search.
The simplest mental model is a human doing math. If the problem is easy, you answer quickly. If it’s hard, you try multiple approaches, check intermediate results, abandon dead ends, and only then give the final answer. Test-time scaling asks whether LLMs can get the same kind of benefit from spending more thought.
This is why TTS matters commercially: it turns inference compute into a quality knob.
Part 2
The paper’s useful simplification is the width–depth view. Width is how many independent reasoning branches you start. Depth is how far you let each branch go. A lot of named methods are just different paths through that space.
Sample many full answers, then majority-vote. It’s blunt but surprisingly strong, especially on math. The obvious downside is waste: all 64 branches run to completion even if the answer is already clear.
Sample answers until confidence is high enough, then stop. This saves tokens when consensus appears early, but it can stop too early when the first few answers agree for the wrong reason.
Run branches in chunks, probe intermediate answers, and prune or continue based on the partial evidence. This is more like search than voting, because you’re steering compute as evidence arrives.
A controller decides what to do next: branch, continue, probe, prune, or answer. That controller is where most of the interesting TTS design lives.
Part 3
The paper’s claim is not that it invented “more thinking at test time.” The claim is that TTS strategy design itself should become an agentic search problem. Humans define the environment. A coding agent writes candidate controller programs. The environment evaluates those programs cheaply, gives feedback, and the agent edits the controller again.
That’s the real move. The authors shift the human work from “invent another heuristic” to “build a replay environment where good heuristics can be found.” This is why I described it as adjacent to continuous learning but not the same thing: the base model is fixed. The thing improving is the inference controller around the model.
In their setup, each candidate controller is evaluated against cached reasoning traces. That means the search loop doesn’t need to keep calling Qwen for every candidate. It replays already-collected branches, probes, and intermediate answers, then scores the controller’s accuracy–cost trade-off.
The mechanics
At any step, the controller sees the active branches, how deep they are, which probe answers have been revealed, and how much budget remains. Then it chooses one action.
If every candidate controller had to call the base model live, discovery would be expensive and noisy. Replay makes controller evaluation cheap, deterministic, and frequent. This is the same pattern I’d watch across agent research: pay once to build a sandbox, then let agents search inside the sandbox.
The discovered strategy
The best discovered controller, CMC, has a very human-sounding rule: don’t stop just because confidence is high right now. Stop when confidence is high and the confidence trend isn’t getting worse.
That matters because TTS systems can get fooled by early agreement. A few branches may converge on the same wrong answer. CMC smooths confidence with an exponential moving average, then uses the trend to decide whether to stop, widen, deepen, or prune.
It also couples width and depth. If existing branches are producing better evidence, it doesn’t need to spawn more branches. If progress stalls or reverses, it widens. That feedback loop is the important discovered structure.
Interactive intuition
AutoTTS forces each controller to expose one scalar parameter, β. Lower β pushes cheaper behavior. Higher β pushes more expensive, accuracy-seeking behavior. This is partly an anti-overfitting trick: don’t let the coding agent invent ten brittle thresholds that only work on the search set.
Results, in plain English
AutoTTS β=1.0 reports 85.8% accuracy and 467.4k tokens, compared with Parallel-Probe at 81.5% and 730.8k tokens.
At β=0.5, the project page reports 45.3 average held-out accuracy across models versus 45.2 for SC@64, using roughly 69.5% fewer tokens.
The discovered policy is searched on AIME24, then evaluated on held-out AIME25 and HMMT25 across Qwen3 model sizes.
My read: this is a strong systems paper, but not a “model improves itself” paper in the weight-update sense. It’s an agent discovering a better inference-time algorithm around a fixed model.
What to remember
AutoTTS is interesting because it makes a small version of a larger pattern concrete. If you can build a cheap, faithful environment with good feedback, a coding agent can search over programs that would be annoying for a human to design manually.
For TTS, the searched program is an inference controller. For coding agents, it might be a harness. For memory systems, it might be a retrieval policy. The field is going to keep finding places where the “algorithm” around the model is easier to improve than the model itself.