Researchers say they’ve discovered a new method of ‘scaling up’ AI, but there’s reason to be skeptical

5 Min Read

Have researchers discovered a new AI “scaling law”? That’s what some buzz on social media suggests — but experts are skeptical.

AI scaling laws, a bit of an informal concept, describe how the performance of AI models improves as the size of the datasets and computing resources used to train them increases. Until roughly a year ago, scaling up “pre-training” — training ever-larger models on ever-larger datasets — was the dominant law by far, at least in the sense that most frontier AI labs embraced it.

Pre-training hasn’t gone away, but two additional scaling laws, post-training scaling and test-time scaling, have emerged to complement it. Post-training scaling is essentially tuning a model’s behavior, while test-time scaling entails applying more computing to inference — i.e. running models — to drive a form of “reasoning” (see: models like R1).

Google and UC Berkeley researchers recently proposed in a paper what some commentators online have described as a fourth law: “inference-time search.”

Inference-time search has a model generate many possible answers to a query in parallel and then select the “best” of the bunch. The researchers claim it can boost the performance of a year-old model, like Google’s Gemini 1.5 Pro, to a level that surpasses OpenAI’s o1-preview “reasoning” model on science and math benchmarks.

“[B]y just randomly sampling 200 responses and self-verifying, Gemini 1.5 — an ancient early 2024 model — beats o1-preview and approaches o1,” Eric Zhao, a Google doctorate fellow and one of the paper’s co-authors, wrote in a series of posts on X. “The magic is that self-verification naturally becomes easier at scale! You’d expect that picking out a correct solution becomes harder the larger your pool of solutions is, but the opposite is the case!”

Several experts say that the results aren’t surprising, however, and that inference-time search may not be useful in many scenarios.

Matthew Guzdial, an AI researcher and assistant professor at the University of Alberta, told TechCrunch that the approach works best when there’s a good “evaluation function” — in other words, when the best answer to a question can be easily ascertained. But most queries aren’t that cut-and-dry.

“[I]f we can’t write code to define what we want, we can’t use [inference-time] search,” he said. “For something like general language interaction, we can’t do this […] It’s generally not a great approach to actually solving most problems.”

Mike Cook, a research fellow at King’s College London specializing in AI, agreed with Guzdial’s assessment, adding that it highlights the gap between “reasoning” in the AI sense of the word and our own thinking processes.

“[Inference-time search] doesn’t ‘elevate the reasoning process’ of the model,” Cook said. “[I]t’s just a way of us working around the limitations of a technology prone to making very confidently supported mistakes […] Intuitively if your model makes a mistake 5% of the time, then checking 200 attempts at the same problem should make those mistakes easier to spot.”

That inference-time search may have limitations is sure to be unwelcome news to an AI industry looking to scale up model “reasoning” compute-efficiently. As the co-authors of the paper note, reasoning models today can rack up thousands of dollars of computing on a single math problem.

It seems the search for new scaling techniques will continue.

Share This Article