Overclocking LLM Reasoning: Monitoring and Controlling LLM Thinking Path Lengths

This work investigates how large reasoning models internally track their thinking progress and how such processes can be monitored and controlled. We focus on reasoning models that explicitly segment their computations using and tokens (e.g., DeepSeek-R1), allowing us to study the internal dynamics of the "thinking phase."

1. Monitoring the Thinking Phase

We hypothesize that hidden states encode a token's relative position within the thinking phase. To test this, we collect hidden representations from the final layer of the model for each token in a thinking trajectory $T = w_1w_2...w_N$. Each token is paired with a normalized position:

$$p_j^{(k)} = j / N_k$$

This creates a dataset $D = \{ (h_j^{(k)}, p_j^{(k)}) \}$, where $h \in \mathbb{R}^d$ is the hidden state and $p \in (0, 1]$ is the relative position. We learn a regression function:

$$\theta^* = \arg\min_\theta \sum (f_\theta(h) - p)^2$$

We compare a linear regressor (TPV: Thinking Progress Vector) with a 2-layer FFN and find no improvement from the latter, favoring the simpler TPV model. For improved temporal modeling, we also train a single-layer GRU on full token sequences:

$$D' = \{ (h_1, ..., h_N), (p_1, ..., p_N) \}$$

The GRU outperforms TPV, especially in generalizing from MATH-500 to GSM8K in both fine-tuned and zero-shot setups.

2. Controlling the Thinking Phase

... continue reading