Will it no longer be humans who change the world in the future? The chief scientist of OpenAI said bluntly: AI is the key force.
On August 16th, in the latest podcast released by OpenAI, host Andrew Mayne (a former OpenAI engineer) was joined by special guests - Chief Scientist Jakub Pachocki and Researcher Szymon Sidor.
The pair looked back on their journey from high - school classmates in Poland to colleagues at OpenAI. They also delved deep into key issues in AI development, including the definition and measurement criteria of Artificial General Intelligence (AGI), landmark achievements of technological breakthroughs, challenges faced in benchmark testing, and the actual impact of AI on education, scientific research, and society. The core viewpoints are as follows:
● Evolution of AGI's Definition and Measurement: AGI has evolved from an abstract concept to a multi - dimensional set of capabilities. Milestones such as winning a gold medal in the International Mathematical Olympiad (IMO) are meaningful, but isolated breakthroughs are no longer sufficient. In the future, more attention should be paid to its impact on automated scientific research and real - world applications.
● Trajectory of AI Technological Breakthroughs: From the limitations of early sentiment analysis to the iterations of the GPT series of models, these models can now participate in competitions like the IMO, the International Collegiate Programming Contest (ICPC), and Japan's AtCoder, demonstrating strong reasoning and creative thinking abilities.
● Challenges and "Saturation" in Benchmark Testing: Many benchmark tests have shown "saturation," where models are approaching or exceeding human levels, but it's difficult to comprehensively reflect intelligence. Measurement criteria need to shift towards actual utility and the ability to discover new insights.
● Impact of AI on Education and Talent Development: AI can serve as an educational aid, but the emotional support provided by teachers is irreplaceable. Education needs reform to cultivate soft skills such as structured thinking and critical thinking, and programming is an effective way to master these abilities.
● Future Breakthrough Directions and the Trust Threshold: The model's persistence and ability to focus on problems for extended periods are important development directions. AI needs to overcome the trust threshold, balance value and security when accessing personal data, and avoid abuse.
● Wide - ranging Impact and Development Pace of AI: Similar to the impact of the Internet on the economy, the role of AI is difficult to measure with a single indicator. Although its development speed seems to have a "bottleneck," in the long run, the progress is significant, and it will profoundly reshape industries and society.
Here is the essence of this podcast:
Andrew Mayne: Hello, everyone. I'm Andrew Mayne. Welcome to the OpenAI podcast. Today's guests are Jakub Pachocki, the Chief Scientist of OpenAI, and Szymon Sidor, a senior researcher. We'll discuss how to measure the progress of AI, the definition of AGI, and the possible directions of the next breakthrough.
Our models can accurately identify areas where they haven't made progress. At the same time, we're seriously considering whether, as an organization, we're ready to handle such a rapid pace of development. When planning OpenAI's research projects, our goal is to create a highly general intelligence.
I'd like to know your specific responsibilities first. Pachocki, are you the Chief Scientist of OpenAI?
Jakub Pachocki: Yes, I hold the position of Chief Scientist.
Andrew Mayne: What does the Chief Scientist do specifically?
Jakub Pachocki: My main responsibility is to set the research roadmap for the company, which means determining the technological paths we'll bet on and the underlying research directions we'll pursue in the long term.
Andrew Mayne: Sidor, what's your job?
Szymon Sidor: My work is quite diverse. I mainly work as an Individual Contributor (a role in an organization that directly completes work tasks and creates value through personal professional skills without taking on management responsibilities), and I occasionally take on some leadership roles. Overall, I do what's most valuable.
Andrew Mayne: You two knew each other before joining OpenAI, right?
Jakub Pachocki: Yes, we attended the same high school.
Szymon Sidor: That's right, the same high school.
Andrew Mayne: Were you friends in high school?
Szymon Sidor: I think we became close friends after high school. Coming to the United States was, in a way, an emotional journey that deepened our friendship. In high school, we were more like academic peers.
Andrew Mayne: What kind of high school could produce talents like you?
Jakub Pachocki: Our high school was in Poland. At that time, we were all attracted to the school by a computer science teacher named Ryszard Sobolewski. Before we enrolled, he had already achieved remarkable results in cultivating computer scientists and programmers, especially focusing on programming competitions and striving for students to excel in this field.
Szymon Sidor: That experience had a profound impact on our growth. He was an excellent mentor. His programming teaching went far beyond the ordinary high - school curriculum, covering topics like graph theory and matrices. I hope that with ChatGPT today, people can more easily engage in such in - depth learning because without a suitable tutor and a lot of effort, it's hard to replicate that learning experience.
1 Evolution and Measurement Criteria of AGI
Andrew Mayne: How should we define and measure the ability of something like ChatGPT to instantly generate interactive multimedia and solve teaching problems? If we talk about AGI, how would you explain it from a technical and a layman's perspective?
Jakub Pachocki: Taking the teaching scenario you just mentioned as an example, ChatGPT can indeed play an important role - it can explain concepts more clearly, offer diverse teaching methods, and work well with educators like Sobolewski. However, it's important to emphasize that AI can't replace teachers because teachers not only impart knowledge but also provide emotional support and a learning atmosphere, which is currently difficult for AI to achieve independently.
Andrew Mayne: This is very important. People often say that AI will replace education, but this view often overlooks this point. I've met some teachers whose knowledge might not always be accurate, but their dedication and care are sincere, and they're willing to patiently answer questions. So, these tools are actually educational aids, and teachers can use them more efficiently. Back to the topic of AGI, I'd like to hear a non - technical description first. How would you explain it to your siblings?
Jakub Pachocki: When we talked about AGI a few years ago, despite the great potential of deep - learning technology, the concept of AGI was still quite abstract and seemed far away. At that time, human - level intelligence, the ability to have natural conversations, solve math problems, and conduct research all seemed to belong to the same category. But as technology has developed, we've found that these are actually very different abilities.
Currently, AI can have natural conversations on a wide range of topics and solve math problems. For example, winning a gold medal in the IMO, which is one of the long - recognized milestones in AGI development, has already been achieved. Solving all the problems in the IMO is even more challenging and is another milestone.
However, I increasingly feel that this kind of point - based measurement is no longer sufficient, so we've started to focus on the actual impact of AI on the world. Personally, when thinking about how AI can truly have a meaningful impact on the world, the first thing that comes to mind is its potential in automating the discovery and production of new technologies.
We usually attribute new ideas and fundamental technological advancements to human creativity and measure progress through major inventions and technological revolutions. But it's hard to imagine that most of this process can be automated. A large - scale computer might come up with ideas that fundamentally change our understanding of the world. I think this day is not far off. Therefore, thinking about how far we are from this goal and the possible impact of this technology is my top priority.
2 OpenAI's Mission: Prioritize General Intelligence
Andrew Mayne: I just ordered a Mac Studio and want to run the open - source model GPT4All (an open - source project developed by Nomic AI, aiming to allow users to run large language models locally on personal devices without relying on cloud services), having it generate content and process tasks 24/7. This idea really appeals to me. But you mentioned the issue of large - scale automated scientific research. What kind of discoveries or results might we see first?
Jakub Pachocki: In planning OpenAI's research projects, we regard creating highly general intelligence as our core mission. We prioritize building AI systems capable of automated research, rather than narrowly confining technology to specific - domain applications. Although focusing on specific domains can yield local results relatively quickly, looking at the history of technological development, truly transformative major discoveries and the most significant technological breakthroughs for human progress usually come from intelligent systems with strong generalizability.
The application of AI technology varies significantly across different fields. Fields that require in - depth reasoning, a strong combination of specialized knowledge, and intuitive judgment are more suitable for current AI systems.
Take the medical field as an example. We've witnessed many exciting results. AI is playing an increasingly crucial role in medical image analysis, disease diagnosis assistance, drug research and development, etc. It can help doctors more accurately identify diseases, formulate personalized treatment plans, and greatly improve the quality and efficiency of medical services, which makes us very optimistic about the future of AI in the medical field.
As a professional company focused on AI research, we think about how to automate our own research work. Imagine if AI could reach the level of independently conducting AI research, it would greatly accelerate the research process and bring immeasurable value, undoubtedly a major leap in the development of AI.
Similarly, exploring how to use AI technology to automate AI alignment and safety research is also of great practical significance. Through automation, we can more efficiently detect and prevent potential risks posed by AI, ensure that AI systems are consistent with human values, and develop in a safe and reliable direction, laying a solid foundation for its large - scale application.
Andrew Mayne: The results in the IMO are obviously impressive. I'd like to add that a few years ago, when we talked about Pachocki's participation in the IMO, we were still struggling to define AGI. We once considered a criterion: whether a model could solve all the problems in the IMO. We thought this criterion was appropriate because if a model had such excellent mathematical reasoning ability, it should revolutionize many fields that can be mathematically modeled.
This podcast provides a great opportunity for us to share some insider perspectives. I'm shocked by the development speed of AI. Sometimes we see reports saying that the impact of AI on the economy is only 3% or 5%, followed by comments about the slowdown of AI development and over - hype. When I see these reports, I think about my experience with deep - learning for natural language processing about ten years ago, when the technology was very ineffective. I remember Pachocki came to test a sentence sentiment recognition technology we developed. "This movie is terrible" was correctly classified as negative; "This movie is good" was correctly classified as positive; but "This movie is not terrible" was misclassified as negative by the model.
Szymon Sidor: That was ten years ago. Since then, we've gradually solved such tasks, like determining whether a word is a noun or a verb, which is the so - called "Sentiment Neuron" phenomenon.
Then, GPT - 1 and GPT - 2 came out. They could generate meaningful text paragraphs, which was a major breakthrough at that time, although it seems simple now. Next were GPT - 3 and GPT - 4. For me, when GPT - 4 appeared, I had my personal "AGI moment" because it sometimes said surprising things, making me wonder if this model could really bring surprises. At that time, ChatGPT seemed to me just a slightly better tool than Google and not particularly important. But suddenly, in in - depth research, it could answer questions accurately and rarely fabricate content, which made it very useful.
Now, our models can participate in programming competitions, which is an achievement that has not come easily for me personally and for the whole team. From the perspective of those working on this technology, the development speed is amazing. So, when seeing the 3% figure, think about ten years ago, when this proportion might have been only about 0.00001%. These numbers need to be considered in context. There's no reason not to believe that this proportion could reach 10% in a year, 20% in two years, and so on.
3 Challenges and Future Directions in Benchmark Testing
Andrew Mayne: I've heard that if we look at the global economic chart since the early 1990s, it's difficult to find a clear turning point marking the impact of the Internet on the economy. There's no single moment when one can say, "This is the moment when Tim Berners - Lee announced the birth of the Internet." I think it's similar with AI. It's hard to measure because it's difficult to track who is using it and how.
Your mention of long - term observation really resonates with me. I remember training a simple next - character predictor on my computer, and the results were poor, partly because of limited computing power at that time. But even so, the situation improved later in sentiment analysis and with the BERT model. When GPT - 2 came out, I even read through its outputs on GitHub line by line because I realized something big was happening.
This is also why I eventually joined OpenAI. After getting in touch with GPT - 3, I was even more convinced that this was the right direction. But now the situation is a bit strange. If there's no new breakthrough in six weeks, people feel that "progress has stalled." The problem is that benchmark tests can only reflect limited progress, and some test answers are even inaccurate. Even if a model gets the answer right, it may not get a full score. We've also discussed the concept of "saturation" (referring to the situation where a model has reached or is close to human levels in benchmark tests, and further optimization can't significantly improve the results) internally - that is, models have reached or are close to human levels in standardized tests, and it's difficult to significantly surpass the existing results even with further optimization. What do you think?
Jakub Pachocki: Currently, we do face some problems in benchmark testing. One obvious problem is saturation, where models have reached human levels in many standardized intelligence or ability tests. For example, in highly difficult high - school competitions, models can compete with the world's top players, making strict quantitative measurement quite difficult.
Previously, in the development paradigm from GPT - 1 to GPT - 4, benchmark tests were actually just measuring the overall improvement. Now, many more efficient data - training methods have been developed in this field, which can train models for specific abilities, resulting in models that are far better in one area (such as math) than in others (such as writing). These models perform better in math benchmark tests, but this doesn't represent their overall intelligence in other fields. Considering these two problems, we really need to focus on the actual utility of these models, especially their ability to discover new insights.
Andrew Mayne: Exactly. A model may be good at passing tests but not necessarily practical. Ideally, a model should perform well in tests, but a high score doesn't equal real - world value. When people evaluate a model, they often hope it performs well in hundreds of different application scenarios, which is very difficult. Some models are good at creative writing but not at math, and vice versa. This is the challenge we're facing. We previously discussed math competitions like the IMO: Why are these benchmarks important? What's the significance of having models participate in human competitions?
Jakub Pachocki: Competitions like the IMO and the International Olympiad in Informatics are interesting test cases. They have clear constraints and don't require a large knowledge base, but they can truly test a person's ability to think deeply and reason within a few hours. The questions are extremely difficult, and many people work hard and compete in them, making these competitions important milestones in the development of AI.
Andrew Mayne: A model that reaches the gold - medal level in the IMO doesn't use calculators, other tools, or frameworks and completes tasks solely through reasoning.
Jakub Pachocki: Yes. Two years ago, such a model couldn't even correctly calculate the product of two four - digit numbers. Now, in the IMO questions, they show creative thinking rather than rote formula - applying.
Andrew Mayne: But the challenge is that once we go beyond the math field, things get more complicated. You can design a test like the "Human's Last Exam" (a hypothetical concept). Although it's ingenious, you'll find that some models may be better at solving these problems after learning to use specific tools. I'm wondering what kind of benchmark tests we need. How would you objectively measure an ability?
Szymon Sidor: I remember a small thing. When I was excited about the progress in the IMO, my colleague Anna Makandu asked, "What's that?" This made me realize that we might be a bit "in - group." The IMO and the ICPC are very important in my and many colleagues' lives