We are excited to introduce Grok 4.1, which brings significant improvements to the real-world usability of Grok. Our 4.1 model is exceptionally capable in creative, emotional, and collaborative interactions. It is more perceptive to nuanced intent, compelling to speak with, and coherent in personality, while fully retaining the razor-sharp intelligence and reliability of its predecessors. To achieve this, we used the same large scale reinforcement learning infrastructure that powered Grok 4 and applied it to optimize the style, personality, helpfulness, and alignment of the model. In order to optimize these non-verifiable reward signals, we developed new methods that let us use frontier agentic reasoning models as reward models to autonomously evaluate and iterate on responses at scale.
We conducted a gradual silent rollout of preliminary Grok 4.1 builds to a progressively larger share of production traffic across grok.com, X, and mobile apps. During the two-week silent rollout we ran continuous blind pairwise evaluations on live traffic.
Grok 4.1 vs. previous Grok
Compared to the previous production model in traffic, Grok 4.1 is preferred 64.78% of the time.
Grok 4.1 establishes a new standard in blind human preference evaluations.
LMArena Text Leaderboard Overall Style Control Elo 1325 1525
In LMArena's Text Arena, Grok 4.1 Thinking (code name: quasarflux ) holds the #1 overall position with 1483 Elo —a commanding margin of 31 points over the highest non-xAI model. Grok 4.1 in its non-reasoning mode (code name: tensor ) uses no thinking tokens for an immediate response and ranks #2 at 1465 Elo. Grok 4.1 non-thinking surpasses every other model’s full-reasoning configuration on the public leaderboard. Grok 4.1 significantly surpasses Grok 4, which had an overall rank of #33.
To measure progress on our model’s personality and interpersonal ability, we evaluated Grok 4.1 on EQ-Bench3. EQ-Bench is a LLM-judged test, evaluating active emotional intelligence abilities, understanding, insight, empathy, and interpersonal skills. The test set contains 45 challenging roleplay scenarios, most of which constitute pre-written prompts spanning 3 turns. The benchmark evaluates the performance of the models by validating the models’ responses against several criteria. Additionally, the benchmark conducts pairwise comparisons to report a normalized Elo computation for each model in the leaderboard.
We report the rubric score and normalized Elo score by running the official benchmark repository. The scores were computed with the default sampling parameters, prescribed judge (Claude Sonnet 3.7), and no system prompt in accordance with the benchmark.
EQ-Bench Emotional Intelligence Benchmark - Elo (Normalized)
... continue reading