Hugging Face shows how test-time scaling helps small language models punch above their weight - RocketNews

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

In a new case study, Hugging Face researchers have demonstrated how small language models (SLMs) can be configured to outperform much larger models. Their findings show that a Llama 3 model with 3B parameters can outperform the 70B version of the model in complex math problems.

Hugging Face has fully documented the entire process and provides a roadmap for enterprises that want to create their own customized reasoning models.

The work is inspired by OpenAI o1, which uses extra "thinking" to solve complex math, coding and reasoning problems.

The key idea behind models like o1 is to scale "test-time compute," which effectively means using more compute cycles during inference to test and verify different responses and reasoning paths before producing the final answer. Scaling test-time compute is especially useful when there is not enough memory to run a large model.

Since o1 is a private model and OpenAI has remained tight-lipped about its internal workings, researchers have been speculating about how it works and trying to reverse engineer the process. There are already several open alternatives to o1.

Hugging Face work is based on a DeepMind study released in August, which investigates the tradeoffs between inference-time and pre-training compute. The study provides comprehensive guidelines on how to balance training and inference compute to get the best results for a fixed budget.

In addition to using extra inference-time compute, the success of the technique hinges on two key components: A reward model that evaluates the SLM's answers, and a search algorithm that optimizes the path it takes to refine its answers.

The simplest way to use test-time scaling is "majority voting," in which the same prompt is sent to the model multiple times and the highest-voted is chosen. In simple problems, majority voting can prove useful, but its gains quickly plateau on complex reasoning problems or tasks where errors are consistent across generations.

A more advanced reasoning method is "Best-of-N." In this technique, the SLM generates multiple answers, but instead of majority voting, a reward model is used to evaluate the answers and choose the best one. "Weighted Best-of-N," a more nuanced version of this method, factors in consistency to choose answers that are both confident and occur more frequently than others.

The researchers used a "process reward model" (PRM) that scores the SLM's response not only on the final answer but also on the multiple stages it goes through to reach it. Their experiments showed that Weighted Best-of-N and PRMs brought the Llama-3.2 1B near the level of Llama-3.2 8B on the difficult MATH-500 benchmark.

To further improve the model's performance, the researchers added search algorithms to the model's reasoning process. Instead of generating the answer in a single pass, they used "beam search," an algorithm that guides the model's answer process step by step.

At each step, the SLM generates multiple partial answers. The search algorithm ...

APK Oasis

Hugging Face shows how test-time scaling helps small language models punch above their weight - RocketNews

POPULAR CATEGORY

Software

Artificial_Intelligence

Internet