Just now, Elon Musk quietly released Grok 4.1, whose general capabilities outperform all other models.
Almost without warning, Elon Musk's artificial intelligence company, xAI, has released its latest model, Grok 4.1.
Just now, xAI announced that Grok 4.1 is now available to all users and can be accessed on the Grok official website, X, as well as on iOS and Android applications.
Grok 4.1 will be immediately pushed in Auto mode and can also be manually selected in the model selector.
This time, Grok 4.1 will bring significant improvements in real - world usability, especially excelling in creativity, emotional interaction, and collaborative interaction. Grok 4.1 has a stronger ability to perceive subtle intentions, makes conversations with users more engaging, and has a more coherent overall personality, while fully retaining the powerful intelligence and reliability of its predecessor models.
Elon Musk promotes his own model on X.
To achieve these improvements, xAI further optimized the model's style, personality, helpfulness, and alignment on the same large - scale reinforcement learning infrastructure that supports Grok 4. Moreover, to optimize these non - directly verifiable reward signals, xAI developed a new method that can use cutting - edge agent - based reasoning models as reward models, enabling large - scale autonomous evaluation and iterative output results.
Compared with the previous online production models, Grok 4.1 has a 64.78% probability of being preferred by users in comparative evaluations.
Next, let's look at the capabilities and features of Grok 4.1.
State - of - the - Art General Capabilities
Grok 4.1 has set a new benchmark in blind human preference evaluations.
On the Text Arena leaderboard of LMArena, the reasoning mode of Grok 4.1 (code name: quasarflux) ranks first on the overall list with an Elo score of 1483, leading the highest non - xAI model by a full 31 points.
The non - reasoning mode of Grok 4.1 (code name: tensor) can respond instantly without using thought tokens and ranks second on the leaderboard with an Elo score of 1465. Even without enabling reasoning, Grok 4.1 outperforms all other models with their full reasoning configurations enabled.
Compared with Grok 4, the overall performance of Grok 4.1 has significantly surpassed it. The former previously only ranked 33rd on the overall list.
Emotional Intelligence
To evaluate the model's progress in personality and interpersonal interaction abilities, xAI tested Grok 4.1 on EQ - Bench3.
EQ - Bench is a test judged by large language models to evaluate active emotional intelligence, including emotional understanding, insight, empathy, and interpersonal skills. The test set contains 45 challenging role - playing scenarios, most of which consist of pre - written three - round dialogue prompts. This benchmark validates the quality of the model's answers through multiple criteria to evaluate the model's performance. Additionally, it calculates a normalized Elo score for each model in the leaderboard through pairwise comparisons.
xAI ran the tests using the official benchmark repository and reported the rubric score and normalized Elo score. All scores were calculated under the conditions of following the benchmark requirements: using default sampling parameters, the specified judging model (Claude Sonnet 3.7), and without adding a system prompt.
The results show that the reasoning mode and non - reasoning mode of Grok 4.1 rank among the top two on the list.
The following example shows how Grok 4.1 responds to emotion - related prompts:
Creative Writing
xAI also evaluated the performance of the 4.1 series models on the Creative Writing v3 benchmark test.
In this benchmark, the model needs to generate answers for 32 different writing prompts and conduct 3 rounds of iterations. Similar to EQ - Bench, the scoring is calculated based on both rubrics and the normalized Elo score of model match - ups.
The results show that the reasoning mode and non - reasoning mode of Grok 4.1 rank second and third in the benchmark test, only after the early GPT 5.1.
The following example shows how Grok 4.1 responds to creative writing prompts:
Reducing Hallucinations
Fast (non - reasoning) models equipped with search tools can provide instant answers. However, due to limited reasoning depth and a limited number of tool calls, they are more prone to factual errors.
During the post - training process of Grok 4.1, xAI focused on reducing factual hallucinations in information - query prompts. Subsequently, xAI observed a significant decrease in the hallucination rate in sampled production - environment information - query prompts.
xAI used real information - query requests from production traffic, conducted stratified sampling by category to evaluate the model's hallucination rate. At the same time, it also evaluated FActScore, a public benchmark test containing 500 biographical questions about different people.
For more technical details of Grok 4.1, please refer to the model card:
Model card address: https://data.x.ai/2025-11-17-grok-4-1-model-card.pdf
Official blog: https://x.ai/news/grok-4-1#silent-rollout-november-114-2025
This article is from the WeChat official account “Machine Intelligence”. Author: Machine Intelligence Editorial Department. Republished by 36Kr with permission.