The Revolution of Token-Level Rewards

Training large language models (LLMs) to master complex tasks, especially those requiring structured outputs like generating precise code or engaging in multi-step reasoning, is challenging even for current state of the art (SOTA) models. Reinforcement Learning (RL) offers a powerful theoretical framework for teaching models to do "what works", but applying these techniques to LLMs has been messy to execute in practice.

We’ve run into this problem at our startup, Levro. We want to be the easiest medium to conduct international commerce (includes high yield deposits, low FX!). This requires tools for where business and banking intersect. For example, if you want to update basic information about who owns your business, having the correct documentation to satisfy banking regulations is important and nuanced. (To mention one case, while many consumers think of Wise as a bank, Wise statements rarely qualify as official bank statements for anti-money-laundering / KYC purposes.) A truly ‘personal’ banker should advise you how to meet all these requirements so you don’t have unpleasant surprises later.

Since this domain is highly structured but also highly complex, we wanted to build agents to handle complex tasks like technical customer support and structured reasoning. Specifically, we wanted models that generated Python code to handle user queries, call tools and APIs, and process structured data to resolve a range of user inquiries. This is complex enough that the LLMs weren’t getting everything right out of the box, so we turned to RL.

To get RL to actually help, though, we had to solve a persistent problem: how do you give your model feedback that is specific enough to improve it without crushing the parts it already does well? One of the challenges that the DeepSeek team ran into while they built their product is this: if your model generates 8 imperfect responses (which often happens!) and you have a judge that penalizes these incorrect responses because they are not perfect, you end up in a situation where your model does not improve because every answer was flawed. As a practical example of this situation, if we ask the LLM to make the tool calls that generate 1099s for all the vendors a business paid more than $600 USD to during the period 2024, and the error that the LLM made in executing this request was to mistakenly ignore non-USD payments, using a per-token model rewards the important parts that the LLM got right. This in turn helps the LLM get the “true” answer much more quickly.

Techniques Used in Reinforcement Learning

So how do models get better today?

You generate multiple outputs for a given prompt.

for a given prompt. You use a reward model to score these outputs .

. You fine-tune the model (often using techniques like GRPO) to increase the likelihood of the model producing high-scoring outputs in the future.

This process involves two main “moving pieces”:

... continue reading