The development of AI Agents: capabilities, technical architectures, and software and hardware forms

From the evolution of biological intelligence to the iteration of computing architecture, AI Agent is moving from a technical concept to a new industrial paradigm. How does it break through the boundaries of capabilities, build a technical foundation, and reshape the human-machine relationship? The context of this intelligent revolution is hidden in the collaborative evolution from the individual to the society.

If 2023 was the year of the explosion of generative AI, then 2025 is undoubtedly becoming the “first year of AI Agents.” From tech giants to startups, from software applications to smart hardware, almost all products are being reconstructed by the concept of “Agent”. AI Agent is rapidly evolving from a technology concept to a new paradigm in the tech industry.

But behind the heated discussions, there are different opinions: Where is the boundary of AI Agent’s capabilities? In the technical architecture, should it play the role of “super application” or “new generation operating system”? How will its emergence fundamentally reshape the human-machine relationship? To answer these questions, we can’t look at the AI Agent itself in isolation, but try to put it in a broader frame of reference.

Just like the use of electrical energy by power equipment, AI Agent is the best carrier for unleashing machine intelligence. The development path of its ability can be found in the evolution of biological intelligence from species to individual and then to society.

These capabilities need to be realized by relying on intelligent computing architectures. Just like traditional instruction-based computing releases computing power as a production factor through the three-layer structure of “computing element-operating system-application software”, the intelligent computing architecture is also gradually building a system for releasing intelligent production factors along the path of “model base-agent operating system-vertical intelligent agent”.

The final software and hardware form of AI Agent is shaped by the evolving human-machine relationship. The core is to match the abstraction of human needs from the instruction level, task level, intention level to the role level, and build a more efficient, natural, and autonomous intelligent collaboration method.

This article will review the evolution of biointelligence, computing architecture, and human-machine relationships, and understand the current progress and possible future development directions of AI Agents from three perspectives: capabilities, technical architecture, and software and hardware forms.

From the evolution of biological intelligence, the development of AI agent capabilities

Professor Ma Yi proposed a four-stage path of intelligent evolution: species intelligence -> individual intelligence -> social intelligence -> machine intelligence, taking machine intelligence as a continuation of the first three stages of natural evolutionary intelligence.

What does a product manager need to do?
In the process of a product from scratch, it is not easy to do a good job in the role of product manager, in addition to the well-known writing requirements, writing requirements, writing requirements, there are many things to do. The product manager is not what you think, but will only ask you for trouble, make a request:

View details >

Intelligence is inherently interconnected. If machine intelligence is considered as a new form of intelligence, its evolution is likely to repeat the three-stage path of biological intelligence – > individual-> society.

1. Species intelligence

The source of biological intelligence is single-celled organisms that can only respond reflexively to stimuli, following a fixed “stimulus-response” pattern. Similarly, the earliest mechanical computers strictly followed 0/1 machine instructions on punched paper tapes, based on a linear process of “instruction-execution”.

In the vertebrate stage, the central nervous system coordinates the perception and movement of the whole body, realizing the regulation and adaptation of environmental changes. Since then, the mammalian cerebral cortex has become more complex, and its hierarchical structure can gradually integrate, process and abstract massive sensory information, forming a more advanced representation of the internal world. On this basis, the frontal lobe region of primates has increased significantly, and subsequently gradually developed the ability of tool use, model learning, social interaction, tool use and imitation learning, opening the door to modern humans.

In machine intelligence, this process is embodied in the development of program execution logic based on manual coding and clear rules to data-driven and probability-based model inference. AI models give machines the ability to generalize to understand data and make inferences. Multilayer neural networks further implement hierarchical representation mechanisms similar to biological cortex, capable of extracting and learning complex features from raw data. As the scale of the data grows, the learned features become versatile and cross-tasking.

When the scale of the model and data crosses a certain critical point, the potential of these general features is released through “pre-training-fine-tuning”, and machine intelligence gradually forms an infrastructure that can unify perception, memory, generation and reasoning.

2. Individual intelligence

Entering the stage of individual intelligence, the development focus of biology and computing has shifted from “hardware” to “software”.

In terms of biological intelligence, the evolution of species-level physiological structures determined by genes has slowed down and shifted to the development of individual intelligence driven by learning and experience.

Machine intelligence has gone through a similar path. The focus of the species intelligence stage is to design excellent model architectures, such as SVMs, probabilistic graph models, CNNs, etc. After the model architecture based on transformers matures and stabilizes, the improvement of model intelligence mainly comes from the data side: from the Internet stock corpus in the pre-training stage, to the new inference data generated by model sampling in the post-training stage, and then to the behavior data generated in the interaction with the environment in the inference stage.

In biological intelligence, the emergence of language is a cognitive revolution. Language is not only a tool for communication and a carrier of thinking, but also realizes the transmission of knowledge across time and space.

Corresponding to machine intelligence, the memory mechanism of the current AI Agent focuses on a single agent and is committed to optimizing the access of internal short-term and long-term memory. Another important problem to be solved by memory research is how to build an external memory network so that agents can share contextual context and long-term experience.

Biological intelligence entered the stage of the agricultural revolution and gradually established a relatively stable infrastructure, including farming knowledge, calendar system, settlements, etc. Based on this simple social operating system, there is a professional division of labor such as reclamation, sowing, and irrigation within agricultural production, and cooperation with each other further improves production efficiency.

Vertical intelligence agents in machine intelligence correspond to the professional division of human labor, which requires combining domain data and professional tools, and designing workflows or setting reward functions to achieve it. The development of these vertical intelligent agents also requires an abstracted common basic layer, namely the agent operating system, which is responsible for resource scheduling, permission control, memory management, etc.

This layer of abstraction can greatly liberate vertical intelligent agent developers, allowing them to focus on high-level business logic and accelerate the breadth and depth of vertical intelligent agent applications.

3. Social intelligence

According to different abilities and roles, AI Agent can be translated into intelligent agents and agents respectively, corresponding to the two stages of individual intelligence and social intelligence.

In the individual intelligence stage, AI agents are passive agents that aim to complete human-assigned tasks. After entering social intelligence, AI Agent has the ability to exchange identity, credit, and value, and becomes one of the social subjects, and can independently initiate goals based on social identity.

In order to better understand the development of AI Agent in the swarm intelligence stage, let’s first review the formation process of human urban civilization. The agricultural revolution brought about a specialized division of labor, which in turn led to the need to exchange products and services. As production efficiency continues to increase, this exchange unfolds between strangers who are not related by blood. Therefore, currency, contract, law and other systems were invented, which constituted the underlying agreement of urban civilization and large-scale social cooperation.

In the stage of social intelligence, different AI agents not only collaborate to complete a certain human task, but can act as independent social nodes, forming a social collaboration network. Each AI Agent can independently select partners and exchange value based on rules, incentives, and goals, forming a decentralized collaboration model.

As this collaborative network evolves and expands, machine intelligence will enter a stage similar to the human “industrial revolution”.

In the Industrial Revolution, machine power large-scale replaced human manual labor, leading to exponential growth in productivity and fundamentally reshaping production relations and social structures.

Correspondingly, as collaborative networks and credit systems mature, computational social intelligence will enter a stage of local autonomy, similar to the fifth-level AGI-Organization defined by OpenAI. At this stage, human mental labor and organizational management work were gradually replaced on a large scale. AI Agents can autonomously form “companies”, conduct R&D and innovation, manage supply chains, and provide services.

The human-machine relationship will undergo a shift from “human-in-the-loop” to “human-on-the-loop” to “human-on-the-loop” in some scenarios, where AI calls human capabilities on demand. The role of human beings has shifted more to value definition, institutional design, and ethical boundary guidance.

The evolution of biological intelligence points out the path for the development of machine intelligence and AI agent capabilities: from the evolution of the physiological structure of species, to the acquired learning of individuals, and then to the social collaboration of groups. The realization of these capabilities requires the continuous evolution of intelligent computing architecture, which is the technical base.

Looking back at the development of our familiar computing architecture, from tubes to large-scale integrated circuits, from stand-alone operating systems to global cloud computing, every technological evolution has been more efficient by building a new technology layer to release and organize computing power more efficiently.

Next, we will try to compare this development history with today’s intelligent computing, and see how the technical architecture gradually builds a technical foundation that can support complex intelligence through layer by layer abstraction.

From the evolution of computing architecture, the development of AI Agent technology architecture

From the perspective of technology implementation, the development of machine intelligence can be roughly divided into two stages:

Imperative computing: From mechanical computers to modern computers based on von Neumann’s system, machines strictly execute explicit instructions written by humans. Given the same input, the output is always unique and accurately reproducible.

Intelligent computing: Represented by traditional machine learning, large models, and AI agents, its nature is probabilistic, generative inference is made in high-dimensional space, and the same input can produce diverse, contextually relevant results.

Although instructional computing and intelligent computing are very different in terms of operating mechanisms, they have a highly consistent underlying goal: to continuously reduce the marginal cost of “core production factors” and maximize the release of capabilities. Instructional computing reduces the cost of computing power and leads the information age. Intelligent computing is reducing the cost of intelligence and has become the core driving force of the intelligent era. The same goals make them exhibit similar hierarchical evolution paths in technical architecture.

From the perspective of computational theory, Solomenov summarized that “computable prediction” is the program with the shortest description length of the Turing machine. Whether writing explicit instructions or training probabilistic models, the essence is approaching the same limit – finding the optimal information expression and execution path within the computable boundary. Therefore, we can expect intelligent computing to evolve along a technological route very similar to imperative computing.

The computing component-operating system-application software hierarchy architecture of instructional computing presents a new correspondence in the era of intelligent computing: model architectures such as large models are equivalent to computing components such as CPUs, general agents develop to AgentOS, and vertical agents play the role of application software. The large model provides the original inference computing power, and AgentOS is responsible for resource management, task scheduling, memory persistence, and permission isolation, exposing a consistent call interface for the upper-level vertical agents, which then deliver value to specific scenarios.

1. Compute Elements (Model Architecture)

The development of traditional computing architectures basically follows Moore’s Law: the exponential increase in the number of transistors integrated on a single chip has led to a continuous increase in computing performance-to-power ratio and a continuous decrease in cost. This process began with the early large and high-power tubes, to miniaturized and reliable transistors, and finally to the era of integrated circuits.

In 1971, Intel introduced the world’s first commercial microprocessor, the 4004, which centralized compute, control, and register logic on a single chip, marking the beginning of VLSI. Since then, CPUs have improved parallel processing power by integrating transistors on a single chip and integrating multiple computing cores at the same time. Led the development of PC and the information revolution.

Corresponding to intelligent computing architecture, the core clue of its development is dominated by the “Scaling Law”: the growing computing power can be effectively transformed into more powerful model intelligence by expanding the scale of model parameters and data.

The tipping point for quantitative change to produce qualitative change occurred in 2020, when OpenAI launched GPT-3. GPT-3 validates the potential of large-scale pre-trained models, equivalent to 4004 for imperative computing, enabling the first large-scale integration of general-purpose language capabilities. Since then, the number of parameters of large models has continued to expand, and it has developed from a language model to an inference model with stronger planning and logical capabilities. At present, the two models are moving towards integration, forming a hybrid model with both “fast thinking” and “slow thinking” capabilities.

In terms of trend, traditional computing is moving towards heterogeneous computing, using CPU+GPU+NPU to handle different types of tasks and continue to optimize the performance-to-power ratio. In terms of scale, it is developing to two extremes: miniaturization (mobile SoCs, low-power chips) and gigantic (supercomputing, data center-level chips).

This also corresponds to the current development trend of intelligent computing. On the one hand, the autoregressive model and the diffusion model are developed in tandem: autoregression is good at sequence prediction and logical planning, and diffusion is good at global distribution prediction and high-fidelity generation, and has a fast generation speed.

On the other hand, large models are also developing towards miniaturization and giganticism at the same time. Miniaturization aims to popularize applications. The development of model lightweight technology allows large models to be deployed in resource-constrained scenarios such as mobile phones and wearable devices. Of course, it is still doubtful whether the Transformer architecture can continue to shrink according to Moore’s Law like a transistor. However, it is clear that by combining model lightweight and chip capability, the intelligence level of the model that can be run on terminal devices will continue to improve steadily.

Gigantic is designed to explore the limits. By continuing to expand the scale of models and computing power, explore the upper limit of intelligence. Taking Stargate as an example, more concentrated and huge resources will be invested in solving the grand problems of human society, including new drug discovery, materials science, controlled nuclear fusion, etc.

2. Operating System (AgentOS)

With a powerful computing element (CPU) or model base (large model), the upper layer is the operating system: responsible for resource scheduling, shielding the underlying complexity, and undertaking the upper-level applications. In the evolution of imperative computing and intelligent computing, we can see the development of a similar “middle layer”, which plays a key role in unlocking underlying capabilities and supporting upper-level applications.

In the intelligent computing architecture, the role of AgentOS is being assumed by the general agent (or the development goal of the general agent): as an intermediate layer connecting large models and vertical agents, it is gradually taking over the core functions of traditional operating systems. The structural correspondence between the two can be seen from the following six aspects:

Resource management: Traditional OS schedules hardware resources such as CPU and memory; AgentOS uniformly deploys large models, tool calls and memory systems. For example, ChatGPT calls code interpreters and search plugins, and Manus supports connecting shells, crawler APIs, and other external tools. Agent communication protocols such as MCP and A2A also belong to this layer.
Task scheduling: Similar to process scheduling, AgentOS needs to plan tasks and disassemble subtasks. For example, ChatGPT relies on the model’s own capabilities to plan the execution chain, and Manus assigns tasks to sub-agents for parallel processing based on workflows and prompt templates.
Memory management: In addition to the model’s context window, AgentOS also needs to maintain a more durable memory structure. ChatGPT provides “SavedMemories”, and Manus builds an editable and structured knowledge base that supports task continuity and knowledge reuse.
Device drivers: While traditional drivers connect to hardware, AgentOS drivers are oriented towards digital environments, such as controlling file systems and browsers. For example, Manus can simulate user operation of web pages to realize form filling and page clicks.
User interface: AgentOS provides an interactive interface with natural language as the core. For example, ChatGPT’s ChatUI and Canvas panels, Manus provides a “Manus’s Computer” visual interface to display the execution process in real time.
Permission management: Like the sandbox mechanism of traditional OS, AgentOS ensures execution security through data isolation and permission configuration. ChatGPT Enterprise supports organization-level data control, while Manus uses a cloud virtual machine isolated execution environment.

It is worth noting that unlike the physical boundaries between computing components and traditional operating systems in traditional computing architectures, the boundaries of each level in intelligent computing architectures are dynamically changing. Large models are gradually integrating many capabilities of the general agent layer, from task scheduling, GUI operation, to memory and permissions.

Currently, there are several types of companies suitable for building general-purpose agents and developing towards AgentOS: (1) large model companies, such as ChatGPT; (2) There is a front-end user and back-end tool ecosystem, such as WeChat Yuanbao; (3) There are operating systems/hardware portals such as Apple and Microsoft.

The functional similarities between AgentOS and traditional OS are due to the fact that both are constantly evolving to manage and schedule increasingly complex underlying resources. In imperative computing architecture, the development of operating systems follows the “Andy-Beale Law”, that is, the improvement of CPU performance is absorbed by updates and iterations of the software layer. This law drives the operating system from the early single-task command line, to the graphical user interface and multitasking management, to the support of multi-machine communication and concurrent processing, and finally to the development of a cloud-native platform that supports elastic scaling, container scheduling, and resource pooling. The core thread is to downward manage more powerful hardware and provide a more robust running environment for upper-layer applications.

The development of AgentOS under the intelligent computing architecture also follows a similar “law of intelligent consumption”: the intelligent resources (such as tokens) required to complete a single task continue to grow. The original tooling used simply to convert the model inference token into an instruction that called the function. After entering the task orchestration stage, the agent can break down a high-level target into multiple subtasks and complete each step of the task by scheduling the model and tools sequentially or in parallel.

In the current multi-agent collaboration stage, multiple professional agents communicate with each other, divide roles, and collaborate dynamically. The inferred consumption of each agent itself, combined with the consumption of interactions between them to maintain contextual consistency, increases the total amount of tokens.

In the future, AgentOS needs to abstract and pool multi-model capabilities, basic tool interfaces, knowledge and memory modules to form a basic capability layer that can be automatically called. Developers don’t need to care which model to use, which vertical agents to combine, and which tool to call. They only need to define business logic and end goals, and AgentOS dynamically and automatically orchestrates and schedules the resources they need to complete the task. Agents with roles and long-term goals make decisions and actions autonomously, continuously inferring and consuming tokens. Microsoft recently proposed the “Agentic Web” concept, which aims to become an operating system that connects and coordinates intelligent agents. Its builds natively support MCP on the operating system and rely on the Azure cloud platform to provide the infrastructure for AI Agents to run, communicate, and manage.

3. Application Software (Vertical Intelligent Agent)

Finally, to the application software layer.

The operating system provides the running environment of the application software, and the construction of the application software also requires a development engine. The operating system layer provides a unified interface for abstraction and invocation of hardware resources, while the application development engine supports a complete set of processes from coding, debugging to deployment.

Under the intelligent computing architecture, agent development platforms such as Coze and LangChain are trying to play a similar role. However, a significant difference is that due to the natural language interaction and context understanding capabilities of large models themselves, the development of agents can be completed in low-code or even zero-code methods, and the necessity of independent development tools/platforms seems to be reduced.

For example, Coze currently mainly supports application construction within its own ecosystem. More vertical agent developers choose to directly connect with model capabilities and build using native development interfaces provided by large model vendors such as Anthropic. From this perspective, Claude Code is more like a development tool based on the Claude API that can quickly verify the boundaries of model capabilities and build vertical agent prototypes.

The development clue of traditional application software is the ease of using standardized functions: from installation packages to web pages to SaaS, software has gradually changed from offline to online, from on-premises to cloud.

The development of vertical intelligent agents is synchronized with the intelligent operating system (AgentOS), and its development clue is the improvement of flexibility and customization. In the multi-agent collaboration stage, multiple vertical agents with different professional capabilities can perform complex collaborative operations based on unified protocols and AgentOS scheduling. Collaboration can be workflow-driven or model-native planning-driven (non-prompt-triggered). AgentOS needs to support both programming methods, both defining precise operation logic in preset workflows and dynamic programming based on model native inference to solve open tasks.

Further developing to the “Agent as a Service” stage, autonomous service intelligent agents will present software forms that are not found in instructional computing architectures. Vertical intelligent agents are not limited to passively executing preset tasks, but can independently discover tasks, schedule resources, and continuously interact with the environment. In addition, unlike traditional software that can only call predefined functions and is based on a fixed UI, Agent can use AI coding capabilities to dynamically create new tools required for tasks online, or even build new vertical intelligent agents on the fly, and generate corresponding user interfaces in real time according to specific task needs.

Through the above comparison, we can see that intelligent computing is building a three-layer technical architecture of “large model-AgentOS-vertical intelligent agent” along a path that is highly parallel to imperative computing. This architecture will provide a solid technical foundation for more complex and autonomous intelligent capabilities.

From the evolution of human-machine relationship, the development of AI Agent software and hardware forms can be seen

In the first two parts, we refer to the evolution path of biointelligence and computing architecture, and roughly analyze the development of AI agent capabilities and the direction of technology implementation. This part will start from the evolution of human-machine relationships and discuss more specifically what form AI Agents will present.

The future is difficult to predict. Therefore, we first determine the basic principles of human-machine relationship evolution, construct a thinking framework based on this, and then start from this framework to carry out specific discussions on terminal devices, operating systems, and application software forms.

Principle: to meet human needs at an increasingly abstract level

Jobs summed up the development of computers 40 years ago: “In the last 20 years, we have used computers at an increasingly high level of abstraction.” The machine language that was originally punched through paper tapes and buttons was at the very bottom, and was fully adapted to the machine’s binary code; Although assembly language corresponds to machine code one by one, it has a certain semantics, making programming relatively easy. Advanced languages are closer to human natural language and have higher expressive capabilities and efficiency.

This summary still holds true today and serves as a principle for understanding the continued evolution of AI Agents:

From machine code to high-level language, humans need to learn the language of machines and step by step command the computer to complete specific instructions, that is, “how to do”. At this time, the human-machine relationship is that people call tools.

In the large model stage, for the first time, humans can not worry about the underlying implementation process, but give a clear task through natural language, that is, “what to do”. This marks a shift in the positioning of machines from “tools” to “assistants”: humans delegate cognitive activities such as understanding and analysis to machines.

The current AI Agent stage takes it a step further, where instead of delegating a single isolated task, users can express complex intents, i.e., “what I want.” The machine understands the intent, plans the task, then calls the resource and completes the execution.

Following this trend, AI Agents will continue to meet human needs at a higher level of abstraction. When the requirements are abstract enough to be expressed as a “who you are” – such as a “travel butler” – the human-machine relationship will also reach a qualitative “singularity”, from delegation to delegation: humans empower machines to make decisions and act autonomously in a certain role, and machines can make autonomous decisions, initiate actions, continuously interact with the environment, and even assign tasks to humans when necessary. This marks the dawn of an era of human-machine symbiosis, where AI Agents can continue to create value for humans in the digital and even physical worlds.

Taking travel planning as an example, the need that task-level AI can handle is “help me book a ticket to Shanghai tomorrow”, which is a one-time task with clear boundaries. The need for intent-level AI is “I want to plan a family trip to Europe for the summer,” which requires machines to break down tasks, but the goal is still specific and ended. For role-level AI, we can grant the machine a continuous role: “From now on, you are my home travel butler.” AI goes into a state of continuous service, proactively initiating travel suggestions and planning itineraries for human decision-making when it finds the best time to travel, such as wedding anniversaries and discounts on target routes.

Thinking Framework: Better Understanding & Better Execution of Abstract Intent

In response to the gradual abstraction of human needs, the focus of human-computer interaction shifted from “operation” to “expression”: from the initial precise control of execution details (instruction level), gradually to the description of the target (task level), to the expression of more abstract intentions (intention level), and finally to the direct definition of the role of the machine (role level). Interaction methods have also changed: from physical instructions (punched paper tapes, buttons), program commands and graphical interfaces (mouse, multi-touch), to more natural language and multimodal interaction, to environmental interactions that integrate gestures, positions and other full-context signals. In the environment interaction stage, the system may no longer rely on an explicit interface and interact through continuous awareness of the environment and user state.

Correspondingly, the positioning of the machine rises from “execution” to “understanding + planning + execution”, and finally moves towards “autonomous decision-making and continuous action”. In order to support this positioning upgrade, the form of terminal devices, operating systems and application software is also constantly changing: terminal devices have evolved from the earliest mechanical computers, to personal computers, smartphones, and then to AI-native terminals, and may eventually develop into ubiquitous spatial computing platforms. The operating system has shifted from a command-line OS, desktop OS, and mobile OS for hardware resource scheduling to an AgentOS that organizes and services intelligent resources such as models and memories, and finally evolves into a social AgentOS that manages social relationships with multiple agents. Application software has evolved from an application that meets clear needs to an intelligent agent that can complete complex delegated tasks, and finally develops into a social agent authorized with social identity.

The clues to the development of AI Agent software and hardware are: better understanding of abstract intent + better execution of abstract intent”. At the “understanding” level, you need to get the most complete and real-time task context possible; At the “execution” level, it is necessary to better integrate hardware resources, large model capabilities and various tools and services to achieve accurate response to user intentions. This provides a basic framework for discussing the shape of smart terminal devices and operating systems. Based on this framework, we explore a possible development path.

1. The form of intelligent terminal equipment

Smart terminal devices play a role in bringing AI into the physical world. While laying out large models, leading technology companies are also developing their own hardware ecosystems: Apple has iPhone and Vision Pro, Google has Pixel and glasses, Meta is developing glasses and gesture hardware, Amazon is connecting smart homes through Echo, etc.

OpenAI recently acquired LoveFrom, the smart hardware company of former Apple chief design officer Jony Ive, and is also building its own AI-native terminal devices. Sam Altman paints an interesting scenario: “If you subscribe to ChatGPT, we will send you a dedicated terminal device that you use to use ChatGPT. ”

From the perspective of the development law of hardware, smartphones will still be the main terminal device for a long time. It has irreplaceable advantages in screen display, mobile computing, and network connectivity. But at the same time, there will be new AI-native terminal devices, but they will not compete with mobile phones/PCs, but will form a complementary relationship.

The phone itself, especially its operating system, will gradually be optimized in the direction of supporting AI Agent. For example, future phones may focus more on intent recognition, task scheduling, and cross-device collaboration capabilities. But it will not disappear in the short term, but will develop into an intelligent center on the “edge”.

Better understand abstract intent: Expose sensors, sense context

Understanding the user’s abstract intent requires a combination of physical context and digital behavior. For example, when a user says, “I’m a little tired,” the information needed to understand this abstract intent might include:

(1) Physical context: current time (9 p.m.), user’s location (home or office), ambient noise (quiet or not), user’s physiological state (e.g., high heart rate detected by the device, abnormally low number of steps), lighting conditions, etc.

(2) Digital context: whether there are any unfinished important tasks in the schedule, recent records of continuous overtime work, the habit of “I want to adjust my schedule when I am tired” in user preferences, and the system’s default treatment of “I am tired” in history.

The physical context is mainly obtained by the sensors of the end device. Only by consistently getting context from both dimensions can the agent respond reasonably, such as postponing tonight’s schedule, playing meditation music, turning off message notifications, and reminding the user of tomorrow’s morning schedule.

Although current smartphones have a variety of built-in sensors (such as accelerometers, gyroscopes, microphones, cameras, etc.), they cannot always be exposed to the environment due to the limitations of size and wearing style, making it difficult to continuously capture changes in the physical context.

New AI-native endpoints need to have two characteristics:

(1) “All”: comprehensive perception. It can perceive user gestures, voice intonation, expressions, context, physiological signals, etc.

(2) “long”: always online. Lightweight and easy to wear, with the characteristics of low power consumption and long battery life, it can operate continuously, respond at any time, and support long-distance dialogue and continuous interaction.

One possible form of device is a brooch, clip, or button:

(1) Lightweight and easily fixed on clothing, so that sensors such as microphones and cameras are always facing the external environment, and continuously collect information such as voice, movement, and ambient light;

(2) No screen, no visual interaction, focus on contextual awareness, and rely on mobile phones/PCs for results presentation. Most of the time, AI Agents can communicate and collaborate directly with each other without the need for a GUI interface; Only when a human needs to confirm or view the results is the information displayed on the phone’s screen.

On the other hand, AI-native hardware products that failed not long ago, such as AI Pin and Rabbit R1, are trying to exist independently of the phone’s main device ecosystem. AI Pin provides an independent holographic projection GUI, resulting in the display module being too bulky, power consumption, and heat dissipation out of control; The Rabbit R1, on the other hand, wants to replace the mobile phone as a whole, ignoring the user’s dependence on the existing mobile phone ecosystem, habits and functions.

Therefore, new AI-native terminals may not subvert existing devices, but focus on the goal of “better understanding user intentions”, and develop in tandem with existing devices such as mobile phones and PCs and complement each other’s advantages.

Better execution of abstract intent: end-edge-cloud collaboration

In order to better implement the abstract intention of users, the terminal architecture will develop in the direction of “end-edge-cloud” collaboration. The “end” is the AI-native terminal, which serves as the entrance to perception and interaction; “Edge” is a smartphone/PC or other edge device that undertakes task coordination and moderately complex inference calculation, and provides capabilities such as display and network connection; The “cloud” serves as the cognitive center, responsible for running basic large models, calling external tools and services, and handling complex task chains.

As an edge node, smartphones are no longer just communication tools or content consumption devices, but have become a hub connecting “terminal” and “cloud”, so they need to have stronger heterogeneous computing and multi-device collaboration capabilities. On the one hand, mobile phone chips will integrate more powerful AI capabilities to provide extended computing power for AI native terminals on the device. On the other hand, mobile phones need to be equipped with higher bandwidth network connection modules to ensure stable real-time communication with device-side devices.

In addition, the I/O modules such as cameras, screens, and speakers of mobile phones are no longer operated only by users, but may be redesigned for AI Agents, which can be scheduled according to task needs. For example, it provides visual or auditory auxiliary feedback during voice interaction to achieve a more natural and efficient human-machine collaboration experience.

PC, mobile phone, and AI native terminals will constitute a complete intelligent ecosystem for a person:

(1) PC: Handle relatively complex productivity tasks;

(2) Mobile phones: as the center of mobile computing and communication;

(3) AI-native terminals: As a continuous bridge to the physical world, it is always aware of the environment and understands the context, allowing other devices to serve more intelligently and proactively.

2. The form of intelligent operating system

With the change of terminal device form, the interaction and execution logic of intelligent operating systems must also change. Especially on terminals with complete interfaces, such as mobile phones and PCs, the operating system is no longer just a scheduling platform for applications, but becomes the center of the intelligent agent system: it is responsible for understanding user intent and coordinating models, tools, and vertical intelligent agents to execute intentions.

Better understanding abstract intent: from “behind the scenes” to “in front of the stage”

In an imperative computing architecture, the main entry point for user interaction is the application layer such as web pages, software, and apps. But in an intelligent computing architecture, the OS will be the core interface for users to express their intent – if not the only entry point, it will be the main starting point.

Specifically, task initiation can take two main forms:

(1) Expressing intents at the Agent OS layer, the OS Agent is responsible for understanding the intent, planning the task, and coordinating multiple vertical agents or directly calling tools to complete the task.

(2) Use the vertical agent as the entry point, which will judge whether it needs to call other tools or collaborate with other agents.

This change will also refactor the UI of the operating system. The application layer will only retain a few core apps as independent entry points for vertical agents. This can include user-customized vertical agents that meet specific needs, such as agents that grade homework for children: photographing homework, identifying error points, and marking and explaining the reasons for errors. Most applications are degraded into service interfaces for the OS Agent in the form of ChatUI to call when needed.

To better understand abstract intent, the operating system also needs to have strong contextual integration capabilities. Intelligent operating systems need to provide a solution that breaks through the “data silos” of the application ecosystem, with the ability to access, organize, and reference various types of data in the digital world, such as calling calendars, emails, file systems, and third-party app information at the same time to determine the priority and execution path of a task.

At the same time, the OS also needs to open up the perception data of the physical world, have cross-terminal perception capabilities, and can uniformly process data from various sources such as AI native terminals, wearable devices, and smart homes. Realize the integration of physical + digital context in the whole scenario, and support more complete and accurate intent understanding.

In addition, in order to support the ability of terminal devices to continuously perceive and respond at any time, the agent operating system layer needs to support resident agents. These agents run in the background and have state memory, context tracking, and event triggering capabilities.

Better execution of abstract intents: refactoring for AI Agents

In order to translate users’ abstract intentions into executable behaviors, intelligent operating systems need to coordinate multiple intelligent resources, including long-term memory banks, knowledge graphs, large models, vertical agents, and various tool interfaces. In Part 2, we discussed the main responsibilities of this layer in the evolution path of AgentOS. Building on this, we further focus on the two main building paradigms for the current Agent: workflow-based vs. model-based. Reviewing the development of traditional operating systems can help better understand the differences and applicable scenarios between these two methods.

In the imperative computing architecture, hardware-oriented operation instructions were initially written directly through assembly language, and each step was manually designed and explicitly invoked, similar to today’s workflow-based agent construction. The developer clearly specifies the trigger conditions, call order, and control structure for each execution step. This approach is highly controllable and interpretable, but has a low level of abstraction and lacks flexibility.

Model-based agents are built more like writing programs in high-level languages. It no longer relies on explicit process definitions, but automatically generates task sequences after understanding user intentions through large models, and dynamically calls appropriate tools or sub-agents to complete the goal. This method has a higher level of abstraction, can handle vague and volatile user requests, and is more suitable for complex interactions in open environments.

Of course, just as there are still a few high-performance, low-level control scenarios that still rely on assembly languages (such as chip drivers, security modules, and resource-limit computing tasks), workflow-based agent construction is still indispensable in some scenarios with high accuracy requirements, limited resources, or strong security. For example, industrial automation, compliance process approval, or critical business nodes require a clear and stable execution path that is suitable for explicit description of workflows.

Finally, let’s discuss the result presentation of the intelligent operating system. The interface of traditional operating systems is designed for human operation: windows, icons, buttons, touch screen gestures, and other UI elements are designed to help humans complete specific instruction actions. In AgentOS, the core interaction logic shifts to AI-oriented task collaboration, and the UI mainly assumes two functions: the entrance to express intentions and the exit to display results.

As AI coding capabilities continue to grow, intelligent operating systems can dynamically generate the most suitable UI interfaces based on the current task. Just like the browser automatically layouts and renders web pages when they load. Such a UI is task-driven and results-oriented, rendered on-demand, created temporarily, and disappeared when used up.

When the operating system UI no longer focuses on complex human-machine operations but focuses on accurately conveying intent inputs and result outputs, it can become a medium for efficient communication between humans and AI agents, supporting increasingly abstract intent understanding and execution.

3. Application software form

In his AI Startup School talk at YC, Andrej Karpathy divided the development of software into three stages: manual explicit writing in 1.0, generative neural networks through data training in 2.0, and today 3.0, which is based on large models, i.e., natural language programming based on prompts.

Based on Karpathy’s summary, combined with the latest progress of AI Agent and possible future development directions, we further extend the two new stages of Software 3.5 and Software 4.0.

Software 1.0: Explicit programming at its core. Developers explicitly encode instruction-level requirements through machines, assembly, or high-level languages, and then build applications by compilers that are finally executed in the process execution environment.

Software 2.0: Expression of task-level abstraction by preparing training samples. For example, to train a sorting model, you need to prepare sample pairs before and after sorting. The trained neural network model runs on deep learning inference frameworks such as TensorFlow and PyTorch. Note that Software 1.0 can only perform tasks such as sorting, which programmers can clearly instruct. Starting with 2.0, the introduction of models enables computers to deal with obscure problems that cannot be directly programmed, such as face recognition.

Software 3.0: With large models as the core, users can directly express their needs through natural language. The software carrier presents two forms: one is that the large model is used as a programming tool to explicitly generate code, and then compile to obtain the application; the other is to directly complete the task through In-context Learning, using the instantly configured large model itself as a software carrier (ICL large model).

Software 3.5: i.e., the intelligent agent stage. In terms of human-machine relationship, it is “delegated” with 3.0, but the abstraction of requirements has been raised to the intent level. Users can build complex personal intents into customized vertical intelligent agents through the intelligent agent development platform. The runtime environment AgentOS is responsible for providing the necessary Agent runtime capabilities such as task planning, tool calling, and memory management.

We can clearly see the difference between software services in phase 3.5 and traditional software. From the perspective of meeting needs, traditional software can often only cover high-frequency, standardized, and static demand scenarios, while vertical agents can deeply handle long-tail, personalized, and dynamic problems, and even deal with complex tasks temporarily proposed by users in context. In terms of usage, the software is changing from a “process-oriented” software interface that requires manual operation and process control to a “result-oriented” one: the user only needs to express their intentions, and the agent can automatically plan, execute, and deliver the final result.

Software 4.0: The social agent stage. At this time, AI is no longer just an agent performing tasks, but a subject authorized to make autonomous decisions and actions in a specific field under the definition of role-level requirements. Users use the social agent modeling platform to construct the environment, role, and boundary rules of the social agent’s operation.

The software form at this stage also corresponds to the operating system and terminal form discussed earlier in this part. The corresponding operating system is social AgentOS, which not only supports the operation of a single agent, but also needs to provide group management functions such as identity credit management, environmental sharing, and social rule engine. The corresponding interactive carrier may not be a terminal device, but a spatial computing platform that can integrate multiple terminals and realize global environment awareness.

Based on the above division of software stages, next, we will focus on the main form of current application software – SaaS, and take a look at the ideal development trajectory that SaaS products in vertical fields may present in the process of moving from software 3.0 to 3.5 and even 4.0.

The evolution of SaaS products to AI Agents can be seen as a process from providing standard tools, empowering customized services, and then building a domain ecosystem. It can be divided into three main stages:

Agentization: The main change is the shift from traditional, click-based graphical interfaces to more natural, conversational interfaces. At the same time, SaaS manufacturers have begun to embed preset intelligent assistants, upgrading the functions of manual operation and information query by users to intelligent services of “goal-oriented + automatic execution”. For example, SaaS in the field of investment advisory can develop “research report analysis agent”, “asset allocation agent” or “market sentiment tracking agent”. These preset agents have a natural language conversation interface that can understand high-level objectives, complete task planning, tool calls, and deliver final results.
Platformization: As the complexity of requirements increases, the preset agent can no longer cover all scenarios. At this time, the core capabilities used to build standard agents can be transformed into an open capability base to provide services to the outside world. This includes preset professional workflow templates, toolsets that encapsulate standard functions and data sources, and more. Based on this open platform, users can access their own knowledge base, proprietary data sources, and professional tools (such as strategy models) to build highly customized and exclusive agents.
Ecology: When the number and variety of customized agents are abundant enough, the platform can turn to building an agent operating system in a vertical field to support different participants in the ecosystem to share, distribute, and exchange agent resources. This ecology can be presented in two complementary forms:

(1) 2B marketization, modeled after the App Store model, to provide professional developers and institutions with an agent store to publish, subscribe to and sell their professional agents and tools;

(2) 2C’s community, similar to the AI Agent version of Xiaohongshu/GitHub, supports ordinary users to publish, share, and subscribe to lightweight agents in the creator community, and carry out secondary creations. These two forms can be the underlying infrastructure, including model bases, tool interfaces, credit systems, etc., but provide differentiated UIs for different user groups and can be interoperable through cross-platform sharing. It is foreseeable that this stage will see the emergence of a closed-source ecosystem led by core vendors and an open-source agent ecosystem driven by the community, jointly promoting the intelligent process in vertical fields.

From the perspective of the evolving human-machine relationship, we observe that terminal devices are evolving from smartphones to AI-native terminals and spatial computing platforms, operating systems are also being reconstructed into AgentOS with intelligent scheduling and intent understanding as the core, and application software is moving towards a vertical agent-based agent system with stronger autonomy and collaboration. These changes in software and hardware forms are precisely to understand human intentions and release machine intelligence at a higher level of abstraction.

Looking back at the discussion of the development path of AI Agents throughout the article, it seems that a more basic evolutionary law is followed behind the layer-by-layer evolution from capabilities, technical architecture to software and hardware forms, which can be called the “law of scale of intelligence”: the scale of complexity that an intelligent system can effectively cope with determines its level of intelligence.

From the perspective of biological intelligence, this law is manifested in the continuous expansion of three dimensions: representation (the amount of data and information that can be processed), execution (tool use and logical reasoning), and collaboration (social relationship level). From the perspective of computing architecture, it manifests as an increase in the “depth of execution” of the backend: the chain of operations completed by machines is getting longer and longer, and the execution logic and processes are becoming more and more complex. From the perspective of human-machine relationship, this law is reflected in the improvement of the front-end interaction abstraction level: from how to do to who you are, humans use less information and mobilize intelligent resources on a larger scale.

The evolution of biointelligence, computational architecture, and human-machine relationships provides a reference frame for understanding the development of AI agents. It’s always easy to find patterns and similarities in the “rearview mirror”. However, when looking forward, on the one hand, it is necessary to keenly judge what “rhyme” will be pressed in the future, and on the other hand, it is necessary to fully consider the underlying differences between biological intelligence and machine intelligence, and between instructional computing architecture and intelligent computing architecture.

Biological intelligence evolution is driven by natural selection and is full of chance. The development of machine intelligence is currently mainly driven by human intention and engineering implementation, with stronger targets and faster iterations, so it is possible to skip some stages of biological intelligence development. Similarly, imperative computing is driven by determinism and logic, emphasizing the uniqueness and reproducibility of results. Intelligent computing is probabilistic, context-driven, and its results are often generative and non-unique, focusing more on reasoning, self-adjustment, and feedback in uncertainty.

In addition, from the perspective of human-machine relationship, this paper still focuses on the “human-centric” intelligent agent stage of computing architecture and software and hardware form. When we truly enter the “AI-centric” stage, such as human-machine symbiotic social agents and autonomous agents, the form and technical implementation of AI agents will become more ambiguous and unpredictable. When discussing this further future, we should remain open enough: we are not just facing a smarter and more powerful tool, but a new type of intelligent subject that may have autonomous behavioral logic and higher-order goals.

The development of AI Agents: capabilities, technical architectures, and software and hardware forms

From the evolution of biological intelligence, the development of AI agent capabilities

1. Species intelligence

2. Individual intelligence

3. Social intelligence

From the evolution of computing architecture, the development of AI Agent technology architecture

1. Compute Elements (Model Architecture)

2. Operating System (AgentOS)

3. Application Software (Vertical Intelligent Agent)

From the evolution of human-machine relationship, the development of AI Agent software and hardware forms can be seen

1. The form of intelligent terminal equipment

2. The form of intelligent operating system

3. Application software form

JD.com vs. Meituan, Cudi won

Several variables affecting JD.com’s takeaway appeared at the same time

Exceeded expectations! Taobao flash sale opened up nationwide in advance, and joined forces with Ele.me to reverse the takeaway war

JD.com VS Meituan: The final deduction of the “takeaway war”

Why is a Hello bicycle more expensive than a bus?

Xiaohongshu Entertainment live broadcast sprints urgently, appearing in the background in early May, and the voice hall may appear, are you ready?

o3 In-depth Interpretation: OpenAI Finally Uses Tool Use, Is Agent Products Dangerous?

The Truth Behind AI App Hits: From Cursor to Arc, PMF’s Key Insights That Determine Life and Death

In-depth Interview Practical Guide: Say goodbye to awkward chats and superficial information, and dig into user treasures

How does AI programming choose the right large model? 4 stages + 6 recommendations

Brief review of WWDC25: System innovation, AI pragmatism

After talking to AI chickens and ducks 100 times, I simply wrote a “coach” prompt for myself

Is smart healthcare really smart?

In addition to red, yellow and blue, the home of Douyin Kuaishou’s local life is sinking

Will Uniqlo co-brand with Labubu trigger another rush to buy?