From Transformer to GPT-5, listen to "first-principles thinking about large models" from Lukasz, a scientist at OpenAI.
In 2017, a paper with a seemingly simple and even somewhat arrogant title appeared online: "Attention Is All You Need."
In the AI research community at that time, this was an earth - shattering declaration. It proposed to completely abandon the long - held recurrent neural network (RNN) and use only a mechanism called "attention" to process language. Initially, many people were skeptical. However, this 15 - page paper soon sparked a wildfire. The Transformer architecture it proposed reshaped the landscape of artificial intelligence with an overwhelming force. Today, from the predictive text that powers your phone input, to the stunning image - generating DALL - E, and to the world - changing ChatGPT, the underlying core all stems from that paper. As of the time of publication, it has been cited 197,159 times on Google Scholar.
The popularity of the paper also drew the research community's attention to the authors behind it - eight scientists at Google at that time: Ashish Vaswani, Niki Parmar, Jakob Uszkoreit, Illia Polosukhin, Noam Shazeer, Llion Jones, Lukasz Kaiser, and Aidan Gomez. Subsequently, they became well - known in the AI technology circle for this groundbreaking work and were called the "Eight Transformer Pioneers."
Several years later, as the influence of the Transformer continued to expand, the artificial intelligence field witnessed an entrepreneurial boom. Seven of the eight pioneers embarked on their entrepreneurial journeys and became business tycoons in the AI industry wave. Only one chose a completely different path. He gave up the opportunity to build a business empire and joined OpenAI, which aims to achieve AGI as its ultimate mission. He was deeply involved in and led the core R & D work of GPT - 4, GPT - 5 and the inference models codenamed "o1" and "o3", continuing to be a persistent explorer on the frontier of human knowledge. He is Lukasz Kaiser.
This October, this legendary figure will return to the center of the stage to share his vision of the future.
From Paris to Mountain View
The story doesn't start in a garage full of caffeine and code in Silicon Valley, USA. Instead, it begins in the tranquility of European classical academic halls, in the pure world of logic, mathematics, and games. From the very beginning, Lukasz Kaiser's academic DNA was engraved with an ultimate exploration of systems, structures, and rules.
He obtained a double master's degree in computer science and mathematics from the University of Wroclaw in Poland and then went to Germany to pursue a doctorate at the prestigious RWTH Aachen University. Here, he chose an extremely difficult and abstract field: "Logic and Games on Automatic Structures". This is not just about code; it's a philosophical speculation about the most fundamental rules of the computing world. He tried to answer how machines can understand and manipulate infinite, complex structures defined by finite automata. It's like pre - designing the most fundamental operating system for the future AI brain, a set of meta - rules about "how to understand the world".
In 2008, he completed his doctoral dissertation. The following year, a piece of news shocked the logic community: Kaiser won the E.W. Beth dissertation prize. This award is one of the highest academic honors in the global field of logic, language, and information, specifically given to the most groundbreaking doctoral dissertations. The judging criteria are extremely strict - "technical depth, strength, and originality". Kaiser's win was like a coronation, proving that he had reached the pinnacle of the purest theoretical science field in the world.
This honor not only brought him a 3000 - euro bonus. More importantly, it revealed Kaiser's underlying logic for thinking about problems: he is used to starting from the first - principles to build a grand, self - consistent, and elegant system to solve problems. This thinking paradigm echoes fatefully with his later participation in building the Transformer, an architecture that also has the characteristics of grandeur, self - consistency, and elegance.
After graduating with his doctorate, he naturally followed the "standard path" of top European scholars: he continued his post - doctoral research in Aachen and was then employed by the LIAFA Laboratory of Paris Diderot University in 2010, becoming a tenured researcher at the French National Center for Scientific Research (CNRS).
In Paris, he had one of the most enviable positions in the European academic community - a stable position, sufficient funding, and complete academic freedom. His life trajectory seemed to be set: to become a respected theorist, spending the rest of his life in front of the blackboard, exploring the profound universe of logic and games.
However, history always shows amazing similarities at critical moments. Just as the former physics prodigy Stephen Wolfram, after shocking the theoretical physics community in his early twenties, finally chose to leave the ivory tower and devote himself to building a new computing world - Mathematica. Deep in Kaiser's heart, he also felt another stronger, irresistible call.
It was an impulse from "proving" to "building". He sensed that a global technological storm was brewing across the ocean in California, and he had to be there.
The Siege of RNN and the Glimmer of "Attention"
In 2013, Kaiser made a decision that shocked all his colleagues: he resigned from his tenured researcher position in France and joined Google Brain.
This was a choice full of great uncertainty. He gave up a clear, glorious, and stable path and rushed into a field that seemed "illusory" to many at that time - deep learning. In a later interview, he half - jokingly explained his thought process for this change: "It's much easier to be a theoretical computer scientist because you can do the same thing for 20 years. You may prove different theorems, but in the big picture, it's the same thing." ("It's much easier because you do the same thing for 20 years...it's in the big picture it's the same thing." - Future of LLMs, Pathway Meetup, 2024).
Behind these seemingly relaxed words lies a top - notch intellectual's boredom with "repetition" and an extreme desire for "change". He then said: "Deep learning is completely different. Every two years, you have to do something completely different." ("Deep learning is not like that, every two years you do a totally different thing." - Future of LLMs, Pathway Meetup, 2024).
He keenly sensed that a new era was coming. When he stepped into Google's office in Mountain View, the field of natural language processing (NLP) was surrounded by a huge wall, and the name of the wall was "Recurrent Neural Network" (RNN).
At that time in the NLP field, RNN and its variant LSTM were the absolute rulers. They processed text in a sequential manner, like a human reading, word by word. However, this mechanism had a fatal flaw: forgetfulness. When sentences became long, the model often forgot the information at the beginning, which was called the "long - distance dependency problem". The entire AI community was trying to strengthen this siege, such as designing more complex gating mechanisms, but no one thought that maybe it could be torn down.
Kaiser and his team became the earliest "besiegers". He clearly pointed out the root of the problem: "When neural networks first came out, it was built for image recognition... but sentences and images are completely different." ("When neural networks first came out, it's built for image recognition to process inputs with the same dimension of pixels. Sentences are not the same as images." - AI Frontiers Conference, 2017).
Images are parallel and all - at - once, while RNN forced language processing to become a linear, step - by - step "pipeline".
What's even more fatal is that the serial nature of RNN goes against the development trend of hardware. "RNNs are very slow. They can only process one sentence at a time, very sequentially. This doesn't match well with the GPUs and TPUs being built at that time." ("These RNNs they were quite slow... they were very sequential. So so it was not a great fit for the GPUs and TPUs that were being built at the time." - AI for Ukraine Talk, 2023).
Just then, a glimmer of light appeared. In 2014, Ilya Sutskever and others proposed the Seq2Seq model, bringing a breakthrough. However, Kaiser and others soon found that it was still weak in processing long sentences. So, they introduced a mechanism called "Attention". The essence of this idea is to allow the model to "look back" at all parts of the input sentence when translating or generating text and dynamically decide which words are the most important, rather than just relying on the last hidden state.
This glimmer of light was initially just an "enhanced patch" for RNN, but Kaiser and his colleagues realized that it might have far greater potential. A subversive question began to brew in the team: What would happen if we got rid of the RNN wall and only kept this beam of "attention" light?
The Eight Pioneers Assemble and Achieve Legendary Status in One Battle
This crazy idea brought together the top minds at Google Brain: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Illia Polosukhin, and Lukasz Kaiser.
They faced an unprecedented engineering and scientific research challenge. To quickly iterate this new model completely based on attention, they needed a powerful experimental platform. This important task fell on Kaiser and Aidan N. Gomez, who was an intern at that time. They started to develop a new open - source library - Tensor2Tensor (T2T).
This is not just about writing code. T2T reflects Kaiser's profound thinking about "inclusive AI". He always thought that the threshold for deep learning at that time was too high: "We found it is still quite hard for people to get into machine learning, start their first model, get their system working." ("We found it is still quite hard for people to get into machine learning, start their first model, get their system working." - AI Frontiers Interview, 2018).
In 2017, the paper was completed. The title, proposed by Jakob Uszkoreit, was full of confidence and even a bit of "arrogance": "Attention Is All You Need" (All you need is attention!). This title perfectly summarized their core idea: the attention mechanism is not a supporting role; it is everything.
In the footnote of the paper, there was a humble and touching sentence: "Equal contribution. Listing order is random."
This not only reflects the team's spirit of collaboration but also adds a strong legendary color to this story.
"Attention is All You Need" is not just an academic paper. It is a foundational article for large - model theory, the key to opening a new era of artificial intelligence, and it opens an unprecedented door to artificial general intelligence (AGI).
When it was published on arXiv, the entire AI community felt a strong shock. Ilya Sutskever, the then co - founder of OpenAI, later recalled that when he read this paper, he immediately realized that "this was all we needed".
This transformation from doubt to shock and then to complete conviction spread rapidly. With its unparalleled parallel computing ability and excellent capture of long - distance dependencies, the Transformer architecture completely destroyed the RNN wall, quickly becoming a new paradigm in the NLP field and soon radiating its influence to almost all sub - fields of AI, such as computer vision, speech recognition, and bioinformatics.
The eight authors achieved legendary status in one battle.
While everyone was cheering for the success of the Transformer, Kaiser's eyes were already looking further ahead.
"One Model to Rule Them All"
In the same year that "Attention Is All You Need" was published, Kaiser, as the main author, and several of the eight pioneers published another paper that seemed less "mainstream" at that time but was more ambitious - "One Model To Learn Them All".
In this paper, they proposed a single model called MultiModel that can simultaneously handle eight completely different tasks, such as image classification (ImageNet), multi - language translation (WMT), image description (MS - COCO), speech recognition, and syntactic analysis. Although its performance in each individual task did not surpass those "specialist" models, this was the first time in history that researchers seriously proved that a unified deep - learning architecture has the potential to jointly learn knowledge from multiple fields.
This paper was Kaiser's first public "whisper" of his pursuit of artificial general intelligence (AGI). The core question he raised was: "Could we create one deep - learning model to solve tasks from multiple domains?" ("Could we create one deep - learning model to solve tasks from multiple domains?" - AI Frontiers Interview, 2018).
In an interview at that time, he candidly reflected: "Does this model understand the world? Does it really give us something more general than the specific intelligence that we have now? It is hard to say, but we're on the way. And maybe, in a few years, we can tell more."
This sentence is like a prophecy. It predicts Kaiser's career trajectory, which will inevitably lead from Google Brain, which solves "specific" problems, to the place with the ultimate mission of "generalization".
Meeting Legends and Witnessing the Future
The great success of the Transformer gave rise to an entrepreneurial boom in the AI field. The life trajectories of the eight authors began to diverge. Aidan Gomez founded Cohere, Noam Shazeer founded Character.ai, Ashish Vaswani and Niki Parmar founded Adept AI Labs... They all became CEOs and CTOs, having a great influence in the capital market and transforming Transformer technology into business empires.
However, Lukasz Kaiser made a different choice again. In 2021, he left Google, where he had worked for eight years, and joined the most radical organization on the AGI path at that time - OpenAI.
He became the only scientist among the "Eight Transformer Pioneers" who has not started a business so far and has chosen to continue to stay at the forefront of technological research.
This is a fateful choice. It stems from Kaiser's pure curiosity about the ultimate questions of AI, a curiosity that surpasses the pursuit of wealth and business success. He seems to be answering his own question from many years ago with his actions - he chooses to continue on the path to "general intelligence", no matter how long and difficult this path is.
At OpenAI, Kaiser's talent was fully unleashed. He was deeply involved in the R & D of large models such as GPT - 5, GPT - 4 and ChatGPT and co - invented the inference models codenamed "o1" and "o3". These works represent the forefront of the current development of large - language models.
Kaiser's story is an epic about wisdom, perseverance, and foresight. He is a poet of logic, a dream - builder of AI, and a lonely traveler who always chooses to follow the inner flame in the tide of the times. Every choice he makes is not a shortcut to fame and fortune but points to that