Andrej Karpathy YC Incubator Wanzi Insight: Entrepreneurial Opportunities in the Software 3.0 Era, Iron Man Suit is the ultimate form of human-agent collaboration

At the YC AI Startup School in San Francisco, Andrej Karpathy delivered a speech that delved into the evolution path of software from 1.0 to 3.0, pointing out that large language models are reshaping software development and interaction, emphasizing the future direction of human-machine collaboration, enhancing human capabilities like an Iron Man suit, and analyzing current entrepreneurial opportunities.

At the YC AI Startup School held in San Francisco on June 16, Andrej Karpathy delivered a keynote speech, systematically sorting out the deep changes in the software paradigm and proposing key directions for future AI system design.

He pointed out that software is entering an era of rapid rewriting and structural restructuring, and the core driving force of this change is the rise of large language models.

Based on Karpathy’s evolution path from Software 1.0 to Software 3.0, the talk delves into key topics such as natural language programming, agent applications, product autonomy sliders, and humans in the loop. The presentation not only reviewed the progress of AI and software convergence over the past decade, but also envisioned a new software world based on human-machine collaboration.

Andrej Karpathy points out that the software industry is in a phase of profound change. Now is the perfect time to enter the software industry, not only because the industry is growing so fast, but also because software itself is undergoing fundamental changes like never before.

In the past 70 years, the basic paradigm of software has hardly changed substantially, but in recent years, there have been two consecutive underlying changes, resulting in the task of large-scale rewriting and refactoring of the existing software system. Using the “software map” as a metaphor, Karpathy references a visualization tool called GitHub Maps that shows the structure of the entire software world. These code repositories represent a collection of all the instructions people write to computers and are the underlying logic for completing tasks in the digital space.

A few years ago, new software forms began to emerge, and Karpathy proposed the concept of “Software 2.0”. Different from the “handwritten code” method in the Software 1.0 era, the core of Software 2.0 is the weight of the neural network, and developers no longer directly write program logic, but adjust the dataset and train the appropriate model parameters with the optimizer. The essence of this process is to train a network capable of performing complex tasks through data, rather than a pile of precise instructions. At that time, neural networks were often seen as classification tools, such as models such as decision trees, but the term “Software 2.0” more accurately describes this paradigm shift.

How can product managers do a good job in B-end digitalization?
All walks of life have taken advantage of the ride-hail of digital transformation and achieved the rapid development of the industry. Since B-end products are products that provide services for enterprises, how should enterprises ride the digital ride?

View details >

It corresponds to GitHub in the era of Software 1.0, and Software 2.0 also has a platform foundation. Hugging Face is seen as the GitHub of this era, and another platform, Model Atlas, visualizes the model “code” itself. On this platform, for example, the parameters of the image generator Flux are presented as image nodes, and each time you fine-tune the LoRA parameters, it’s like “committing a Git update”, and a new image generation model appears in the system.

Therefore, the evolution of the three generations of software can be summarized as:

Software 1.0: Programming computers with code;

Software 2.0: “Programming” the neural network with weight parameters;

Software 3.0: Program language models with English prompts.

Take AlexNet as an example, this image recognition model is a representative work of Software 2.0. The key change is that the emergence of large language models (LLMs) has enabled neural networks to be programmed, a new form of computing that Karpathy has named “Software 3.0.”

In the software 3.0 paradigm, prompts become programs, and language models become general-purpose executors. The most groundbreaking thing is that these programs no longer use specialized language, but are expressed in natural language, such as English. This shift significantly lowers the barrier to boundaries for software interaction and disrupts traditional development processes.

For example, if you also classify emotions, you can choose:

Writing Python scripts (Software 1.0);
training neural network models (Software 2.0);
Provide a prompt directly to the LLM (Software 3.0).

This evolution from code to weight to language represents a gradual abstraction of the computing paradigm and a fundamental innovation in the way humans interact.

The emergence of few-shot prompts gives language models greater flexibility and can behave completely differently with slight modifications, and prompts are becoming a new era of programming language. The current stage is a period of parallel coexistence of software 1.0, 2.0, and 3.0 paradigms, and the code on GitHub has long ceased to be a pure programming language, but is heavily mixed with English natural language, which marks the birth of a new form of code, a new programming paradigm based on human mother tongue has emerged.

Karpathy recalls being blown away when he first became aware of this trend. He immediately wrote and pinned a comment on Twitter: “We are programming computers in English. ”

This trend became more pronounced during Tesla’s development of autonomous driving systems. The system input for autonomous driving comes from sensor data, which is processed by a complex software stack and outputs steering and acceleration instructions. The system was initially filled with traditional software code written in C++, known as Software 1.0, mixed with neural network modules for image recognition. However, as the Autopilot system continues to iterate, neural networks gradually take over more functions, traditional code is constantly deleted, image processing that originally relied on hard coding is replaced by neural network models, and the entire system is gradually “swallowed” by software 2.0.

This wave of “devouring traditional code” is once again staged, but this time it is the logic that software 3.0 has begun to replace part of 2.0. In an environment where three paradigms coexist, new requirements are placed on engineers: not only are they proficient in the three tools, but they also need to be able to switch between them and choose the most appropriate way to build the system as needed.

Part1: How to understand large models?

The paradigm change brought about by the rise of language models has also prompted people to rethink the definition and structure of “computer”. Andrew Ng once likened AI to the “new electricity,” and Karpathy thinks this metaphor is very apt. Laboratories such as OpenAI, Gemini, and Anthropic are building smart grids with huge capital expenditures (CapEx) to train models, while users use the model’s intelligent services on a pay-as-you-go basis through APIs, which is essentially an “on-demand energy” model.

This energy delivery model has all the characteristics typical of public infrastructure: low latency, high availability, good consistency, and high scalability. Just as traditional power grids allow switching to solar, battery, or diesel power generation when the main power supply fails, tools like OpenRouter have emerged in the LLM ecosystem to enable seamless switching between multiple models. Because language models do not compete for physical resources, this “intelligent multi-vendor” architecture is possible.

This is further confirmed by the recent downtime of many mainstream models. When GPT, Claude, and Gemini interrupted services at the same time, AI users around the world instantly fell into a state of “intelligent power outage”, which is a typical “brain blackout” crisis, showing the high dependence of modern people on these models.

However, Karpathy further pointed out that these models are not just tools at the “utility” level, and their training complexity and cost also give them some characteristics of “fabs”. Training an LLM requires far more resources than a power plant, involving not only huge data pipelines, but also cutting-edge hardware, exclusive accelerators, and long-term algorithm accumulation, and the whole process is a highly concentrated reflection of deep technology, R&D capabilities, and intellectual property.

Even so, these metaphors are not appropriate. Because software has reproducibility and low moat barriers, it is easier to be challenged or refactored than hardware systems. If you compare the fab and chip industry:

Companies that have their own chips (like Google’s TPU) and train their own models are similar to Intel;

Companies that only use a common hardware model are similar to fabless.

But in Karpathy’s view, the most appropriate analogy is the “operating system”. Language models are not static commodities like electricity and water, but an increasingly complex technology stack with platform attributes. Just as operating systems include Windows, macOS, and Linux, there are also closed-source and open-source models in the LLM ecosystem: closed systems such as GPT, Claude, and Gemini constitute the “mainstream platform”, while open-source models such as LLaMA represent the Linux of the AI world.

The underlying structure of the language model begins to take on the form of an operating system-like organization: the model itself is the “CPU”, the context window is the “memory”, and the model running mechanism assumes the role of scheduling, such as managing a complete set of task systems. Similar to traditional computing platforms, an application such as VS Code can run on different operating systems; LLM applications like Cursor can also run on top of multiple models by simply switching the model engine.

This pattern is very similar to the computing environment of the 1960s. In those years, hosts were deployed in data centers, and users requested resources through remote terminals. Nowadays, LLMs, as a high-cost centralized computing resource, can only be deployed in the cloud, and people interact with them through APIs, which is essentially a “time-sharing computing” system.

Personal intelligence has not really arrived yet. Not because no one has tried, but because computing power and model resources do not yet have sufficient localization deployment conditions. Some devices, such as the Mac Mini, can already run inference tasks on a small scale (batch=1), limited by memory rather than computing power. However, Karpathy pointed out that the real “personal LLM” is yet to come, and it needs new hardware architectures, interface paradigms, and product forms to achieve “desktop-level intelligence”.

The interaction method represented by ChatGPT is now more like using system functions through a “command line terminal”. While some tasks already have graphical interfaces (GUIs), there is no unified GUI for all LLM applications. This “missing graphical interface” means that the language model has not yet completed its evolution from tool to platform.

In Karpathy’s view, what is most fascinating is that large language models have subverted the historical path of technological proliferation. Every major technology in history—from electricity to computers, from GPS to the Internet—was initially exclusive to the government or military, and then gradually passed on to businesses and individual users. But LLMs do the opposite, and this time, the earliest and most in-depth application scenarios are daily tasks such as “boiling eggs” and popular applications such as writing, researching information, and planning trips.

This is a diffusion process from the underlying social structure, and the government and enterprises have become “catch-ups”. Tools such as ChatGPT have almost “fallen from the sky” on the screens of users around the world, and the popularity of technology and the breadth of application are unprecedented.

Summarize the core characteristics of this paradigm shift:

LLM labs are the new “smart factories”;
The model platform constitutes the operating system ecosystem;
It is currently in the 1960s-style centralized computing pattern;
LLMs are used as public infrastructure;
The path of technology diffusion has been reversed, and Volkswagen has become the first batch of users;

Models quickly penetrate the global desktop environment in the form of terminals.

It is in this context that a new generation of developers is beginning to program these “new computers” into a completely different world of engineering.

Part 2: LLM Psychological Mechanisms

Before officially starting to write prompts and build agents, it is more important to have a deep understanding of the psychological mechanism and structural nature of language models.

Karpathy defines LLMs as “stochastic simulations of people.” It is not a computing system in the traditional sense, but a collection of pan-human language experience and reasoning patterns, implemented by the Transformer architecture, and the prediction of each token consumes a similar amount of computation, simulating the dialogue and reasoning process.

These models form “pan-human language simulators” through a large number of network text training, and naturally show some psychological “personality defects”: such as showing superhuman abilities in some reasoning tasks, but may make low-level mistakes in spelling or logical judgment.

For example, Karpathy says that these models may insist on “9.11 > 9.9” or misspell “strawberry”, making it difficult to tell if they contain two “r”. This is the dual nature of language models – both an omniscient “digital rain man” and an occasional error beginner.

This “jagged intelligence” is extremely difficult to predict, and it behaves like a genius in some ways and a muddy bug in others. Language models also have anterograde amnesia, like a colleague who restarts every day, unable to accumulate experience or consolidate memory. It does not have the ability to learn naturally, does not become intelligent on its own, and must rely on external artificially managed its “working memory”, the context window.

This is also the root cause of many people’s misuse of language models. It does not remember user preferences, styles, or historical communications, nor can it build cognitive and competency systems incrementally like humans. Therefore, Karpathy highly recommends watching “Memento” and “50 First Dates”, where the protagonist wakes up every day and loses his memory, a setting that is highly similar to the operating mechanism of a language model: fixed weights, short-lived memories, and every interaction is a “restart” from scratch.

In addition to memory limitations, language models also have significant security risks. It is highly vulnerable to prompt injection attacks, which can unknowingly leak information or expose sensitive data. We must always keep in mind that this is a “super colleague” with superpowers but cognitive defects, and how to guide his abilities and avoid his shortcomings is the core issue of engineering and product design.

Part3: Karpathy saw the opportunity

Moving on to a more realistic topic: How to use language models effectively? What are the emerging directions worth paying attention to? Karpathy proposes a product form that deserves attention: “partial autonomy apps”.

Taking code writing as an example, although you can directly open ChatGPT, copy and paste bugs, ask questions, and get answers and then post back, this method is inefficient, context management is complex, and the interaction experience lacks structure. A more efficient solution is to design a dedicated application for a specific task, combining a graphical interface with the task flow to obtain a better solution.

Cursor is one of the most widely used LLM programming applications and represents the prototype of this type of “partially autonomous application”. Its key features include:

Automatically manage context: identify what the developer is currently doing and proactively load relevant file content;
Orchestrate multi-model calls: Background calls are made to build code indexes by embedding models, and then call language models to process logic and patches to achieve multi-agent collaboration.

Cursor not only retains the interaction habits of traditional IDEs but also overlays the intelligent capabilities of LLMs to achieve a more natural engineering flow. Karpathy emphasizes that graphical user interfaces (GUIs) are critical in this type of application and should not rely solely on text interaction to complete tasks.

In the case of code patches, instead of having the language model output natural language to explain the changes, use visual red-green diff tags to show the changes in the changes. The at-a-glance interface, shortcut support, and clear accept/reject mechanisms allow humans to efficiently participate in decision-making and oversee the “hands-on capabilities” of the model.

He further proposed the concept of an “autonomy slider,” which allows users to flexibly control the level of intervention by LLMs. In Cursor, you can:

Use only tap completion;
Use the shortcut key to rewrite the selected paragraph (Command+K);
Rewrite the entire file (Command+L);
or give the agent full permission to operate throughout the repository (Command+I).

This “permission adjustment mechanism” allows users to freely switch between human-led and AI-led tasks according to the complexity of the task, effectively balancing the efficiency and controllability of the agent.

Another successful example is Perplexity, which has a feature structure that is highly similar to Cursor:

strong information integration ability and clear interface;
Ability to schedule multiple LLM models to work together;
have a GUI that allows users to retroactively reference and review content;
Provides an autonomy slider to generate a full research report after ten minutes, covering a wide range of use cases, from low intervention to high autonomy.

These product forms inspire a deeper question: most software will evolve into “partially autonomous systems” in the future, so what should they look like?

For product managers and developers, the core challenge is: How do you make a product have controllable intelligent autonomy?

Questions to answer individually include:

Can LLMs “see” all the information that users can see?
Can LLMs act in a way that is acceptable to users?
Can users intervene, supervise, and correct AI behavior at any stage?

These problems constitute the starting point of “new software product design”. In traditional software, behavior paths are defined by hard-coding; In the era of software 3.0, product behavior is constructed by prompts, tool calls, context awareness, and multi-model orchestration, while GUI and “autonomous sliders” have become key mechanisms for undertaking human-machine co-creation logic.

Language models are not perfect systems, they need humans to be “always in the ring” to operate. Just like Photoshop’s diff presentation, a large number of human-oriented buttons, parameters, and panels in traditional software will face reconstruction due to the introduction of AI. GUIs must be designed around how humans can collaborate with AI more efficiently.

The speed of the build-validation closed loop determines the output efficiency of LLM applications. There are two main ways to speed up this closed-loop: one is to optimize the verification process, especially through the graphical interface. GUIs activate the brain’s visual system, making images easier to understand and review more efficiently than processing text. The second is to tie a “rope” for AI to prevent it from acting overly proactive and beyond human control.

Karpathy points out that while the agent’s capabilities are exciting, it’s not advisable to have it commit 1000 lines of diff in the code repository at once. Even with the speed of generation, review is still a bottleneck. He tends to “take small steps” and introduce minimal controllable changes each time, ensuring that each step is correct and reasonable. When AI is too proactive, the system is harder to use and more prone to errors.

Some blogs are starting to summarize best practices for collaborating effectively with LLMs. For example, when writing prompts, it must be specific and clear, otherwise failed validation will result in multiple invalid iterations. The clearer the prompt, the more efficient the verification and the smoother the overall collaboration process. In the future, each user will develop their own style of prompt writing and AI collaboration.

He further explores the impact of the AI era on the education system. Instead of having ChatGPT teach knowledge directly, the teaching task is split into two systems: one for teachers to generate course content; One is for students and is used to convey the lesson. The course itself, as an “intermediate product”, can be reviewed, unified style, and prevent biased questions, which is an effective way to “tethes AI”. This structural design is easier to implement and more in line with the practical requirements of AI teaching systems.

Karpathy compares LLM applications to autonomous driving systems, noting that the core concepts and challenges of the two are very similar. He worked at Tesla for 5 years and was deeply involved in the construction of Autopilot’s “partially autonomous system”. The dashboard is a GUI, and the autonomous slider controls the degree of intervention, and these design concepts can be migrated directly into the AI agent system.

Recalling his first experience with autonomous driving in 2013, he still remembers the feeling of riding in a Waymo test car and wearing Google Glass. The shock of that experience made people think that autonomous driving was about to become universal. However, 12 years later, autonomous driving has not yet been conquered, and a lot of remote takeover and manual intervention are still required. This shows that systems that truly change the world often go through a much longer technological evolution cycle than expected.

The complexity of language models is comparable to that of autonomous driving. The statement “2025 is the year of AI Agents” is too hasty, and it is more realistic to say that this is the ten-year cycle of AI Agents, which requires long-term iteration, continuous human presence and gradual progress. This software form is not a one-time demo project, but an infrastructure evolution process that must be taken seriously.

He again cites his favorite analogy, the Iron Man suit. The suit is an augmentation tool, an agent, and a concrete representation of the autonomy slider. The ideal product should be an “intelligent combat suit” for human-machine co-driving, rather than a fully autonomous “AI robot”. The focus should be on building partially autonomous products, with built-in GUI, UX, interaction logic, and permission adjustment mechanisms, so that human-led, AI-assisted models become smooth and efficient.

The goal of the product should be to build tools that accelerate the closed-loop of generation-validation, while preserving the path to future decentralized automation. The product architecture should have adjustable “autonomy sliders” and design mechanisms to support the gradual release of permissions over time, making the system more and more intelligent.

There are a lot of potential opportunities in this type of product form, and it is a development direction worth focusing on at present.

Talking about the revolutionary significance of natural language as a programming language, Karpathy said that in the past, people needed to learn a programming language for 5~10 years to participate in software development, but now they only need to master natural language. Everyone is a programmer, as long as they can speak, they can control the system, which is an unprecedented change.

He proposed the concept of “vibe coding” – what started as a tweet that flashed in the shower and unexpectedly went viral and was even made into a wiki entry. Vibe Coding is a new programming method based on natural language that collaborates with intelligent systems, transforming the programming process from writing code to “adjusting the system vibe” and describing intent.

He mentioned that Hugging Face’s Thomas Wolf once shared a video of children doing vibe coding. Without any code knowledge, children describe the construction experience through natural language, showing a strong desire for creativity and immersion. He called this video one of his favorites, “How is it possible to feel pessimistic about the future after seeing this video?” ”

Vibe Coding is the “entry drug” for a generation into the software world, which will evoke the creative impulse of billions of people with a very low threshold, bringing an unprecedented wave of developers. He was excited about this generation and personally participated in the attempt.

He built an iOS app with vibe coding, allowing even those who don’t know Swift to collaborate with prompts and LLMs to create a working app. The entire development process took only one day, and despite the simplicity of the application, he was surprised: intelligent collaboration tools have lowered the barrier to development to a very low level, allowing software development to enter an era of true democratization.

Karpathy concluded his talk by sharing his real-world experience building products based on LLMs, further supporting the idea that “everyone is a programmer.” He didn’t learn Swift, but he was still able to build a working iOS app in a day. No longer requires five days to chew through tutorials to get started, Vibe Coding opens up a new creative portal for “people who can’t code”.

He also introduced the MenuGen app he developed, which originated from a real pain point – not being able to understand restaurant menus. He wanted the menu to be accompanied by pictures, so he started developing the tool: users open the web page, take pictures of the menu, and the system automatically generates images of the dishes. $5 free credit per user. Although the app is currently a “huge loss center”, he still enjoys it.

The simplest is the model prototype, and the most complex is the deployment of operations. MenuGen’s web features are completed in a few hours, and the real time-consuming process is the tedious process of going live, integrating authentication, processing payments, configuring a domain name, etc. He cited the integration of Google sign-in as an example, which requires the introduction of the Clerk library and the multiple jump, configuration steps as instructed. This experience made him reflect: if the system already has agent capabilities, why do these repetitive tasks still need to be done manually? The GUI and process logic are still designed for humans, not agents.

This leads to a central question: Can we start building software systems for agents?

There are already standard interfaces such as GUI and API for human interaction with web pages. However, AI Agent, as a “human-like calculator”, also needs to obtain an adaptation layer. He envisioned whether the website could provide a robots.txt-like llm.txt file that clearly explains the page’s purpose, structure, and behavioral logic in a Markdown structure, providing a standardized and efficient interaction portal for LLMs.

At present, most technical documents are written for people to see, containing decorative information such as images and typography, which is not conducive to LLM understanding. Companies such as Vercel and Stripe have begun to use Markdown to write documents, which is an important direction for LLM readability.

He gave an example of his experience learning from the animation framework Manim: by copying the entire document to the language model and describing the animation he wanted to do, the LLM generated code exactly as expected at once, without having to consult the documentation or understand the structure of the library. This means that when documents are properly formatted, non-programmers can also use LLMs for professional creation, unlocking huge creative potential.

However, Karpathy points out that the restructuring of the document is only the first step, and more importantly, changing the way the content is expressed. Descriptions such as “click here” in traditional documentation have no semantics and cannot be executed by LLMs. For example, Vercel is replacing all “click” commands with executable curl commands, allowing the Agent to call the API instead of relying on the GUI. This is an information architecture update that is proactively adapted to the agent.

Anthropic’s Model Context Protocol further defines how agents interact with digital information as consumers. He is very optimistic about this type of protocol, and the agent is no longer just a passive execution model, but becomes an active role of the web, the first type of information citizen.

To support this structural shift, some tools are starting to transform web page data into LLM-usable structures. For example, changing github.com to get-ingest.com can automatically generate a complete directory structure and integrate content text, adapting to LLM processing logic. Another example is Deep Wiki, which not only displays repository code but also calls AI agents for structural analysis and automatically generates documentation pages, greatly improving the reading efficiency and understanding capabilities of language models.

He particularly appreciates these gadgets that allow you to obtain LLM-readable data by simply “changing the URL,” calling them elegant and practical. Although LLMs can perform clicks and interfaces, these operations are extremely expensive and error-prone, and they should be proactively lowered to interact and optimize paths at the infrastructure layer in advance.

At present, a large number of web systems are not designed for agents, and the development activity is low and the structure is chaotic. The long-tail system will never be able to adapt automatically. Establishing an intermediary layer for LLMs is the only way to be compatible with the old system. For modern systems, it is fully capable of building an interactive interface that adapts to the agent from the source.

Finally, Karpathy once again returns to his favorite analogy – the Iron Man suit. This is not a fully automated bot, but an augmentative tool, the ultimate form of agent-human collaboration. In the next 10 years, the entire industry will be like this suit, starting from assistance and gradually delegating power to build an intelligent system with a high degree of automation and high controllability.

Andrej Karpathy YC Incubator Wanzi Insight: Entrepreneurial Opportunities in the Software 3.0 Era, Iron Man Suit is the ultimate form of human-agent collaboration

Part1: How to understand large models?

Part 2: LLM Psychological Mechanisms

Part3: Karpathy saw the opportunity

JD.com vs. Meituan, Cudi won

Several variables affecting JD.com’s takeaway appeared at the same time

Exceeded expectations! Taobao flash sale opened up nationwide in advance, and joined forces with Ele.me to reverse the takeaway war

JD.com VS Meituan: The final deduction of the “takeaway war”

Why is a Hello bicycle more expensive than a bus?

Xiaohongshu Entertainment live broadcast sprints urgently, appearing in the background in early May, and the voice hall may appear, are you ready?

o3 In-depth Interpretation: OpenAI Finally Uses Tool Use, Is Agent Products Dangerous?

The Truth Behind AI App Hits: From Cursor to Arc, PMF’s Key Insights That Determine Life and Death

In-depth Interview Practical Guide: Say goodbye to awkward chats and superficial information, and dig into user treasures

How does AI programming choose the right large model? 4 stages + 6 recommendations

Recently, I wrote two more plug-ins with AI, self-media artifacts

Open and close, Ali is still unwilling

How to balance the stability and scalability of the architecture in the SaaS transformation of HIS system

Xiaomi Mall APP homepage comprehensive revision – in-depth analysis (above)

1.6 million daily active users, slashing $30 million ARR! This AI companion is popular, and its money-making efficiency is far more than C.AI