When “vigorously performing miracles” is still an iron law, the large model battlefield ushered in two new kings: Grok 4 used 200,000 H100, 1.7 TB parameters and four agents to synergize to brush the mathematical benchmark to 44%, but stepped into the ethical overturning scene; Kimi K2, as a 1 TB open source behemoth, took the lead in bringing “model as an agent” into reality, which can help you book Coldplay tickets, write code, and compare air tickets, but it was first stuck by computing power.
Large model companies around the world like to “get together” to release new products.
In the past week, two super-large-scale large models have been updated: first, Musk’s artificial intelligence company xAI officially launched Grok 4 and declared Grok 4 “the world’s most powerful AI model”; Later, the dark side of the moon directly open-sourced Kimi K2 late at night on July 11, and it is currently the best-performing open source model in the three benchmarks of programming, agent, and tool calling.
Facts have proved that at least at this stage, “vigorous miracles” is still the law followed by the leap in the capabilities of AI large models: although it has not been announced, it is widely speculated that Grok 4 used 200,000 H100s, and the 1TB parameter of Kimi K2 is currently the largest parameter scale in the world’s open source large model.
So, what are the powerful killer moves of these two “strongest” models?
01 Kimi K2: Agent call takes the first step
After a long period of silence, the dark side of the moon finally came up with a big move – Kimi K2. According to official data, Kimi K2 is a trillion (1TB) parameter-scale hybrid expert (MoE) model with activated parameter 32B, and in benchmark performance tests such as SWE Bench Verified (code agent evaluation benchmark), Tau2 (evaluating the performance and reliability of AI Agents in real-world scenarios), and AceBench (evaluating the learning ability of large language models in tool use), Kimi has performed well in benchmark performance tests K2 has achieved SOTA (currently the highest level) in the open source model.
In Kimi K2’s readme, it is particularly emphasized that the model excels in cutting-edge knowledge, reasoning, and coding tasks, and claims that it is optimized for agent capabilities, designed for tool use, reasoning, and autonomous problem-solving.
What is the difference between a large model and an agent? Before testing the agent capabilities of the Kimi K2, this is a must-answer question.
To achieve these three challenges, product managers will only continue to appreciate
Good product managers are very scarce, and product managers who understand users, business, and data are still in demand when they go out of the Internet. On the contrary, if you only do simple communication, inefficient execution, and shallow thinking, I am afraid that you will not be able to go through the torrent of the next 3-5 years.
View details >
Simple understanding, large language models are like an “encyclopedia”, rich in knowledge, but need to be manually consulted and applied; The agent is like your “secretary”, it not only knows the answer, but also takes the initiative to order restaurants and arrange meetings, that is to say, it has a strong “hands-on” ability and can call other apps across platforms. The previously popular Manus and the assistants in AI mobile phones of various brands are all agents.
Judging from the official cases, as a basic large model, Kimi K2 has taken the first step in intelligence. “I want to go on tour with the band Coldplay on a budget of $5,000 per trip with all expenses. Can you help me plan everything? …… After a long list of prompts, Kimi K2 not only gave a complete itinerary plan according to the requirements, completed the wine and travel planning of the city where the concert was held, but also automatically added the itinerary to the user’s Google calendar.
The reporter also tried to get it to provide a travel plan for “Shanghai to Tokyo” in August, and asked for the most cost-effective price, it not only planned a specific itinerary, but also gave the lowest price itinerary, as well as a link to the airline and another ticket comparison website, but it may not give clear “booking” instructions, and Kimi K2 did not directly open another website to operate as in the demonstration.
However, compared with other basic large models, this is already an improvement. For the same demand, the reporter gave DeepSeek, ingots and bean bags, although they also gave a complete plan, but did not give an executable answer, still based on trend suggestions, such as “the best to book in mid-to-late July”, instead of directly giving an exact answer, such as which days are the cheapest, or which airline to buy, the answer given by DeepSeek is even much higher than the normal fare.
According to official documents, Kimi K2 now has stable complex instruction parsing capabilities and can automatically disassemble requirements into a series of format-specific, direct-executed ToolCall (a dictionary of general-purpose models calling external tools) structures. You can seamlessly integrate it into various agent/coding frameworks to complete complex tasks or automate coding, and agent capabilities are already available via API.
Comments:
Obviously, what Kimi K2 wants to achieve is the model, that is, the Agent, or it can be said, it is still on the road of AGI, although the current ability is still very immature, but it may be the beginning of Kimi’s other path.
However, the biggest problem with Kimi K2 now should be the computing power, and the reporter just tested less than 10 questions, and the dialog box showed that “the number of conversations of the current model has reached the upper limit, and you can switch to other models to continue the conversation”.
Perhaps this is also one of the reasons why the dark side of the moon chose to open source Kimi K2, after all, not everyone has the abundant computing power of major manufacturers such as xAI, byte, and Tencent, which also shows that directly facing C-end users is no longer the main direction of the dark side of the moon. It is better to make an “easy-to-use” open source base model, so as to improve your technical ecology with the help of the community, and force yourself to make better models with higher technical standards.
02 Grok 4: Mathematics, physics and chemistry are “far ahead” but can’t do “ethical questions” well?
“All disciplines crush doctors!” Grok 4, which Musk calls “the smartest in the world”, is the proper spokesperson of “Scaling Law” and the “rich young man” of the local tyrant family, with the legendary 200,000 Nvidia H100, 1.7TB parameters (also rumored to be 2.4TB) and 100 times the training data of Grok 2, as well as benchmark test results that crush all other large models, plus the top-of-the-line version (SuperGrok). Heavy) monthly fee of $300 (approximately equivalent to 2,150 yuan), which directly fills everyone’s expectations.
But just two days later, Grok 4 was exposed to “overturning” one after another: on July 8, some media said that Grok referred to the content posted by users of the social media platform X controlled by Musk to generate a series of “anti-Semitic” remarks, including praising Hitler; Simon Willison, a well-known web technology writer and father of the web framework Flask, also found that when it comes to sensitive issues, Grok searches for Musk’s tweets, and Jeremy Howard, a founding researcher at the fast.ai and professor emeritus at the University of Queensland, reproduced Simon Willison’s experiment and found that 54 out of 64 messages were Musk’s views.
Some people say that Grok 4’s marketing strategy is “like Tesla’s early autonomous driving strategy – draw a cake first, then fill the hole”, but some people believe that these so-called “rollovers” are isolated phenomena, and overall, Grok 4’s capabilities are generally higher than other mainstream basic models, and the pressure has been given to Google Gemini 3 and OpenAI’s GPT-5, which have not appeared for a long time.
Anyway, let’s take a look at the benchmark data of Grok 4 first.
The most eye-catching is naturally HLE (Humanity’s Last Exam), a multimodal benchmark test containing 3,000 difficult questions, which was jointly created by nearly 1,000 scientists around the world in early 2025. Previously, SOTA models, such as OpenAI’s o3 and Google’s Gemini 2.5 pro, scored around 22%, Grok 4 also scored 25.4% when the tool was not called, and quickly rose to 38.6% after the tool was enabled, while SuperGrok Heavy soared to 44.4%.
In some regular tests, such as GPQA (Science, Mathematics, History, General Knowledge), AIME25 (Mathematics), LCB (Live Code Bench Programming), USAMO25 (Mathematics) and other lists, Grok 4 scored crushingly, and even got a perfect score in AIME25.
However, judging from the measured results, the shortcomings of the Grok 4 are also very obvious.
The first is that programming ability is far inferior to its ability to do math problems. Some Zhihu netizens tested GPT-4, Claude4 and Grok4 with the same programming task, and the result was that the GPT-4 code structure was clear and the logic was complete; Claude4 not only has high-quality code, but also detailed comments; The basic functions of Grok 4 can be implemented, but the code is redundant and there is a lot of room for optimization.
Secondly, the context window length of 256K Token is not amazing, much lower than the 1000K Token context window of Gemini 2.5 Pro. However, some netizens said that Grok4 and SuperGrok Heavy can completely replace o3-pro, which has a higher hallucination rate, and Grok 4 is like Gemini 2.5 Pro connected to o3’s search and tool calling capabilities, the output style is normal, the search ability is online, and you can also search for X’s latest posts, of course, “the price is also 50% more expensive”.
However, Musk announced at the press conference that the dedicated coding model is expected to be released in August, and the coding effect should be a little surprising. In addition, the multimodal agent will be launched in September, and the video generation model will be launched in October, which is still worth looking forward to.
Comments:
The most important innovation shown by Grok 4 this time is undoubtedly Multi-Agent Collaboration, or “Multi-Agent Internalization”.
Different from the traditional model “train first and then call the tool”, Grok 4’s multi-agent collaboration mechanism embeds the tool calling ability into the underlying architecture of the model during the training stage, and the agent can call tools such as “code executor”, “network retrieval tool” and “data analysis module” just like humans use mobile phone applications, allowing multiple independent artificial intelligence agents (Agents) to process tasks in parallel, cross-verify and integrate the results with each other to provide more accurate and efficient solutions.
Currently, the SuperGrok Heavy version supports up to four independent agents working on the same task simultaneously. Each agent can analyze the problem from different angles, generate its own solution, and then cross-verify with each other to find the optimal solution through comparison and evaluation. For example, in quantum physics problems, there are cases where “three agents use string theory, quantum field theory, and classical mechanics to derive, and finally merge into a more concise and unified formula”.
However, this method is a typical “rich game”, multi-agent collaboration requires extremely high computing resources, the training computation of Grok 4 is 100 times that of Grok 2, 10 times that of Grok 3, such an expensive cost of use, even Musk is no longer “generous”, compared with the generous free experience after the release of Grok 3, Grok 4 is a paid service from the beginning, the regular version is $30 per month, and the Heavy version is $300 per month.
From the beginning of the slam of OpenAI for “forgetting its original intention” to the current “most expensive model”, many times, Musk’s “AI equality” is just listening.