The era when you can code just by speaking has arrived. Here are the highlights of a 40 - minute speech by AI guru Karpathy.
On June 21st, it was reported that on June 18th, Andrej Karpathy, the former co-founder of OpenAI and a deep learning expert, delivered a 40-minute keynote speech titled "Software Is Changing (Again)" at the AI Startup School event hosted by Y Combinator (YC) at the Moscone Convention Center in San Francisco, USA. He systematically explained how large language models are shifting software development from "writing code/tuning parameters" to "commanding AI with natural language."
Karpathy revealed in his speech that software development has entered the "Software 3.0" stage. He proposed that the era of traditional handwritten code, i.e., Software 1.0, and the era of Software 2.0, which involves training neural network weights, are being replaced by Software 3.0, where "prompts are programs." Natural language is becoming the new programming interface for directly controlling computers.
Meanwhile, Karpathy defined the three core attributes of large language models: Large language models combine the infrastructure service attribute similar to the power grid, the capital-intensive investment attribute of hundreds of billions similar to chip foundries, and the complex ecosystem construction and hierarchical management attribute similar to operating systems.
When referring to the cognitive defects of large language models, Karpathy said that large language models mainly have two key cognitive defects: One is "Jagged Intelligence," which is characterized by outstanding ability in handling complex tasks but frequent errors in basic logic such as numerical comparison and spelling; the other is that information cannot be retained once it exceeds the set context window.
In response to the challenge of autonomous control of large language models, Karpathy proposed a dynamic control framework inspired by Iron Man's suit. The core of this framework is to achieve a hierarchical decision-making authority allocation similar to Tesla's Autopilot from L1 to L4 through an autonomy regulator.
Just like Iron Man's suit, people can dynamically adjust the autonomy level of AI according to the complexity and risk level of tasks, from simple auxiliary suggestions to full autonomous decision-making, allowing humans to always maintain ultimate control over the system.
The following is a complete compilation of Karpathy's speech (Zhidongxi made certain additions, deletions, and modifications without violating the original meaning to improve readability):
01 The Evolution Path of Software: From Writing Code, Teaching Computers, to "Speaking" to Command AI
I'm very excited to be here today to talk to you about software in the AI era. I heard that many of you are students, undergraduates, master's students, doctoral students, etc., and are about to enter this industry. Now is actually an extremely unique and very interesting time to enter the industry.
The fundamental reason is that software is undergoing a fundamental transformation. I say "again" because it keeps changing dramatically, which always gives me new material to create new speeches.
Roughly speaking, I think software hasn't changed much at the fundamental level for 70 years, but it has changed rapidly twice in the past few years. This has brought about a huge amount of software writing and rewriting work. I observed a few years ago that software was changing, and a new type of software emerged, which I called Software 2.0.
My idea is: Software 1.0 is the computer code you write; Software 2.0 is essentially the weights of neural networks. You don't write it directly but create these parameters by adjusting the dataset and running the optimizer.
At that time, neural networks were often regarded as just another classifier, but I think this framework is more appropriate; now, we have something similar to GitHub in the Software 2.0 field. I think Hugging Face is the GitHub in the Software 2.0 field, and its Model Atlas also plays an important role in it.
As an extremely influential platform, Hugging Face provides developers with rich resources and convenient tools. Just like GitHub in traditional software development, it promotes technological exchange and innovative development in the Software 2.0 field. The Model Atlas is like a huge model resource library, further enriching the platform's ecosystem and allowing developers to more easily access and use various models, facilitating the development and implementation of different projects. The Model Atlas is an open-source tool for visualizing model repositories, designed for Software 2.0.
For example, the large central circle represents the parameters of the Flux image generator. Every time you make adjustments based on it, it's like a git commit, creating a new image generator.
So, Software 1.0 programs the computer by writing code, while Software 2.0 programs the neural network through the weights of neural networks such as AlexNet.
Until recently, these neural networks had fixed functions. I think a fundamental change is: Neural networks have become programmable through large language models. I think this is very novel and unique, a new type of computer, worthy of being called Software 3.0.
In Software 3.0, your prompts are the programs for programming large language models. It's worth noting that these prompts are written in English, which is a very interesting programming language.
For example, if you want the computer to perform sentiment classification and determine whether a comment is positive or negative, there are different methods.
The old method of Software 1.0: You have to be like an experienced craftsman and write a lot of code yourself, telling the computer which words count as praise and which count as criticism; the more evolved Software 2.0: You're like a coach, finding a lot of comment examples labeled as "praise" or "criticism" and letting the computer figure out the rules by itself; Software 3.0: You're like a boss, directly giving an order to the large language model: "See if this comment is praise or criticism! Only answer 'praise' or 'criticism'!" With just this one sentence, the AI understands and immediately gives you the answer. If you change the order to "Analyze whether this comment is positive or negative," its way of answering also changes accordingly.
We can see that the code on GitHub is no longer just code but also mixed with English. This is a new and growing code category. This is not only a new paradigm but also what surprises me is that it uses English. This shocked me a few years ago and made me post a tweet.
When I was developing Autopilot at Tesla, I observed that initially, the bottom of the stack was sensor input, which was processed by a large amount of C++ (1.0) and neural networks (2.0), and then driving instructions were output. As Autopilot improved, the ability and scale of neural networks increased, and C++ code was deleted. Many functions originally implemented by 1.0 were migrated to 2.0. The Software 2.0 stack literally "ate" the 1.0 stack.
▲Observation of the 2.0 stack "eating" the traditional code stack during the development of Tesla's Autopilot
We're seeing the same thing happen again: Software 3.0 is "eating" the entire stack. Now we have three completely different programming paradigms. I think it's wise to be proficient in all three when entering the industry, as they each have their own advantages and disadvantages. You need to decide: Should a certain function be implemented with 1.0, 2.0, or 3.0? Should you train a neural network or prompt a large language model? Should it be explicit code? We need to make these decisions and may need to switch smoothly between paradigms.
02 Large Language Models Become the New Operating System, and Computing Adopts a Time-Sharing Sharing Model
Software is undergoing a fundamental change, and this change has never been so drastic in the past 70 years. For about 70 years, the underlying paradigm of software has hardly changed, but in the past few years, it has undergone two structural changes in succession. Now, we're on the wave of software rewriting, with a lot of work to do, a lot of software to write, and even rewrite.
A few years ago, I noticed that software began to evolve into a new form, and I named it Software 2.0 at that time. Software 1.0 is the code we write by hand in the traditional sense, while Software 2.0 refers to the parameters of neural networks. We no longer directly write "code" but adjust data and run optimizers to generate parameters.
Now, there are also things similar to GitHub in the Software 2.0 world, such as Hugging Face and the Model Atlas. They store different models like code libraries. Every time someone makes adjustments based on the Flux model, it's like creating a code commit in this space.
And now, the emergence of large language models has brought about a more fundamental change. I think this is a brand-new computer, even worthy of being called Software 3.0. Your prompts are now the programs for programming large language models, and these prompts are written in English. This is a very interesting programming language.
Andrew Ng once said that "AI is the electricity of the new era," which hits the nail on the head. For example, companies like OpenAI, Google, and Anthropic invest capital to train models and then "deliver intelligence" to developers through APIs with operating expenses. The models are priced by tokens and are "metered and used" like electricity. Our requirements for these models are also very similar to "infrastructure": low latency, high availability, and stable output.
▲Companies like OpenAI, Gemini, and Anthropic invest capital to train models, similar to building a power grid
However, large language models not only have the attributes of public utilities but are more like complex software operating systems. OpenAI and Anthropic are like Windows and macOS, while open-source models are more like Linux. The role of an operating system is not to "run a certain function" but to build a "platform" to carry more functions.
▲Closed-source suppliers like Windows and Mac OS have open-source alternatives like Linux
More precisely, large language models don't complete tasks independently but function as a "runtime system" that carries components such as prompts, tools, and agents. These components are embedded in the large language model framework like plugins and coordinate through the inference ability of the model to jointly handle complex tasks.
In terms of the computing mode, our current large language model computing is at the stage of the 1960s. The inference cost of large language models is still high, and model computing is centrally deployed in the cloud. We access it remotely through the network like a thin client.
This is like the "time-sharing sharing" computing mode: Multiple users queue up to use the same model, and the cloud executes tasks sequentially in a "batch processing" manner, just like multiple people taking turns using a supercomputer and getting computing resources in order.
Interestingly, large language models have reversed the traditional direction of technology diffusion. Usually, new technologies are first used by the government and enterprises and then spread to consumers. But large language models are different. They first serve ordinary people, such as helping users cook eggs, while the government and enterprises are lagging behind in adopting these technologies.
▲Large language models help users cook eggs
This completely reverses the traditional path and may also inspire us that the real killer applications will emerge from individual users.
In summary, large language models are essentially complex software operating systems. We're "reinventing computing" just like in the 1960s. And they're now provided in a "time-sharing" way, distributed like public utilities.
What's really different is that they're not in the hands of the government or a few enterprises but belong to each of us. Everyone has a computer, and large language models are just software. They can spread across the entire planet overnight and enter the devices of billions of people.
Now, it's our turn to enter this industry and program this "new computer." This is an era full of opportunities. We need to be proficient in the three programming paradigms of Software 1.0, 2.0, and 3.0 and use them flexibly in different scenarios to maximize their value.
03 Having Super Memory but Suffering from "Memory Fragmentation" - Like Amnesia and Cognitive Errors
When studying large language models, we need to spend some time thinking about what they really are. I especially want to talk about their "mentality." In my opinion, large language models are a bit like the human soul, a static simulation of humans. The simulation tool here is the autoregressive transformer. The transformer is essentially a neural network that processes information token by token, and the amount of computation consumed for each token is almost the same.