The person who works on Chinese language at OpenAI
OpenAI research scientist Boyuan Chen posted an article on Zhihu with a very direct opening:
“Hello everyone, I'm Boyuan Chen, a research scientist on the GPT Image team. I was the main person responsible for training the GPT image generation model released last week!”
He also mentioned that this time they finally fixed the Chinese rendering issue of the model. If Chinese users have any feedback, they can directly reply to him.
After the release of ChatGPT Images 2.0, many people's first reaction was: The Chinese language ability of this model is incredibly strong.
Past image models were somewhat “illiterate”. They could draw landscapes and people, but once it came to Chinese, they would easily turn into an indistinguishable mess. However, GPT-image-2 is different. It can not only write Chinese characters correctly but also do typesetting, paragraphing, and generate Chinese infographics with logical structures.
The old method of “judging whether it's AI-generated based on the text” no longer works with this generation of models.
Boyuan Chen is one of the people who really stepped into the spotlight during the training and capability demonstration of GPT Image 2. At the press conference, he demonstrated the text rendering ability together with Altman. After the release, he explained many behind-the-scenes stories about the official website pictures on Zhihu: During the LMArena double-blind test, GPT Image 2 was codenamed “duct-tape”; many pictures in the official blog were created by him using the model; Chinese comics, inscriptions on rice grains, multilingual texts, visual proofs, and automatically generated QR codes. These pictures that seem like promotional materials are actually a series of capability tests with specific design purposes.
He gave a very interesting explanation for the “duct-tape” code name:
“As for why it's named duct-tape... of course, it's because you can use duct tape to stick a banana on the wall!”
01 He's Asking a More Fundamental Question
Boyuan Chen is not the kind of researcher who can be easily remembered at first glance. He doesn't give frequent public speeches, nor does he deliberately manage his personal expression. He writes blogs and posts some light-hearted content, but these are more like records rather than building influence.
In contrast, his presence mainly comes from the model itself.
He is currently a researcher at OpenAI, participating in the training of image models. Before that, he completed his Ph.D. in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology, with a minor in Philosophy. He also participated in the research of multimodal models at Google DeepMind.
These experiences are already quite impressive, but more importantly, it's the questions he has been focusing on for a long time.
From DeepMind to OpenAI, Boyuan Chen's research direction has hardly changed. While most people are still discussing whether the model can write better or draw more realistically, he is concerned about a more fundamental level: what exactly is the model “understanding”?
Specifically, it can be seen as three questions: How does the model understand images? What is the relationship between images and language? When a model faces the real world, is it generating results or simulating the world?
These questions may sound abstract, but they almost determine the boundaries of today's generation of models.
On his personal homepage, he clearly states his research directions: World models, Embodied intelligence, and Reinforcement learning.
The so-called world model can be understood as one thing: enabling AI to form an internal judgment of the world.
It not only needs to know what's happening in front of it but also be able to predict what will happen next.
This is a bit different from today's common LLM (Large Language Model). LLM is more about processing language, while the world model is closer to a structure: it needs to understand space, time, causality, and the consequences of actions.
To give a simple example, if an AI really “understands” the world, it should know that a plastic cup will bounce when it falls on the ground, while a glass cup will break.
Embodied intelligence and reinforcement learning can be seen as an extension of this question - if a model really understands the world, it shouldn't just answer questions but also be able to take actions and continuously correct its judgment in the process.
The work he participates in is often not about optimizing a single task but trying to connect generative models, visual understanding, and decision-making systems.
One of his most representative works is a research project called Diffusion Forcing.
This research tries to solve a very fundamental question: Does the model generate step by step or all at once?
LLM belongs to the former. It is good at flexible generation but prone to errors in long content. The diffusion model is closer to the latter. It is more stable but lacks structure.
Boyuan Chen's approach is to combine these two methods in the same model, allowing the model to generate step by step while also imposing overall constraints.
If Diffusion Forcing is about unifying in the time dimension, then another project he participated in, SpatialVLM, is about complementing capabilities in the spatial dimension.
This project addresses a long-standing issue: Although the model can describe what it sees in a picture, it doesn't really understand spatial relationships. It doesn't know about distance, size, or the relative positions of objects.
To solve this problem, his team built a three-dimensional spatial reasoning system, enabling the model not only to “see” but also to “reason”.
A similar idea also appears in other works, such as the History-Guided method that uses historical information to guide generation or the research that unifies visual, action, and language modeling. These works may seem scattered, but they all point in one direction: making the model form a stable internal representation rather than just outputting results.
Beyond his serious research directions, Boyuan Chen also occasionally shows a very vivid personal interest.
For example, the article he published on Zhihu this time, or the fact that he specifically introduced his interest in making boba tea on his personal homepage. Even his Zhihu username is “MIT Boba Shop Owner”.
He also wrote a blog post ranking top computer science schools in the United States, not based on their research strength but on the quality of boba tea.
He ranked UC Berkeley first because the campus is “almost surrounded by high-quality boba tea shops”, while MIT got a relatively low score because “there are too few boba tea shops nearby, and the quality is unstable”.
This kind of expression is very light-hearted, but it shows his research habit: breaking down complex problems, finding comparable dimensions, and then making judgments.
His work is also doing something similar, except that the object is the model.
02 He Avoided the Easier Path
If we look at the development path of image models, the logic in the past was actually quite clear: larger datasets, higher resolution, and a more stable generation process. Most improvements focused on “drawing more realistically”.
However, as models start to handle more complex content, this path has reached a bottleneck: When an image contains not only visual elements but also text, structure, and even logical relationships, the problem is no longer just about resemblance but how these pieces of information can coexist.
The problem has shifted from generation quality to structural consistency.
Not all researchers are willing to work on this kind of problem. It doesn't directly correspond to a specific evaluation metric, and it's difficult to translate into product effects in the short term. In contrast, improving resolution, style, and details often leads to more obvious improvements.
Boyuan Chen's path happens to avoid those “easier” directions: Since his academic research, he has not focused on the capabilities of a single modality but on how different capabilities can be connected.
For a long time, visual models, language models, and decision-making systems have developed independently. They can be connected through interfaces, but internally, they are often separate. Therefore, although the model can “call capabilities”, it's difficult for it to show consistent understanding.
Boyuan Chen's work is trying to change this situation.
Many of the model's capability demonstrations this time took place at the intersection of “images, text, memes, real objects, and cultural contexts”.
Boyuan Chen said that many pictures in the official blog were created by him. The entire blog was generated using pictures, with no ordinary text at all. In other words, many examples users see on the official website are not just promotional materials but part of the model's capabilities.
Take the Chinese Easter egg comic as an example.
He wanted to create a very funny comic, so he used the “catching the meme” and “banana meme”. To demonstrate the text ability, he specifically asked the model to include multilingual text in the picture and generated very small Chinese characters in the lower right corner of the hometown poster to test how fine details the model could handle.
More importantly, this picture was not pieced together - according to him, the entire picture, including the pictures within pictures and pictures within pictures within pictures, was generated all at once. He was worried that people might think it was a spliced picture, so he added a note at the bottom of the picture.
This just shows where the difficulty of GPT Image 2 lies. In the past, if an image model could write a few large, error-free Chinese characters, it was already considered quite good. But GPT Image 2 has to handle a whole hierarchy: it needs to know that this is a picture of a comic book, there are pictures in the comic book, and there are pictures within those pictures; it needs to put text in different languages at different levels; and it needs to make the relationship between the text and the picture valid, rather than randomly scattering the text in the picture.
Another example is the inscriptions on rice grains.
Boyuan Chen said that at first, he thought ordinary text rendering was not impressive enough, so with the suggestion of his teammates, he created a 4K picture: the picture shows a pile of rice grains, and there are inscriptions on one of the grains.
This tested the model's text control ability at a very small scale.
There's also the blackboard visual proof.
Boyuan Chen said: “If I asked it to solve ordinary math equations, it would seem too easy. The nano banana seems to be able to do it through a thinking mode + text rendering approach. So I thought of a visual proof I really like to truly test the unique visual reasoning effect of GPT Image 2. The prompt in the picture says to prove visually (rather than algebraically) on the blackboard that the sum of odd numbers starting from 1 is a perfect square. Ordinary models can actually easily deduce the algebraic solution, but only a visual model can do the graphical solution.”
This is also one of the most notable changes in the release of GPT Image 2: it can start to turn an abstract relationship into an image structure and then express this structure visually.
So, rather than saying GPT Image 2 is “generating pictures”, it's more accurate to say it's generating a structured visual expression.
Comics, posters, visual proofs... these things are not just pure pictures in essence. They simultaneously contain text, typesetting, hierarchy, object relationships, task goals, and aesthetic judgments.
Past image models often failed here because they treated images as pixel results. But this generation of stronger image models must treat images as a structured expression.
03 He's Not Alone
Inside OpenAI, there aren't many people actually involved in model training. After the release of GPT-image-2, Gabriel Goh, the research leader, publicly thanked the team members on social media.
The list is not long, only about a dozen people.
This is more like a small team rather than a large engineering system.
The team members are working in different directions. Some are working on vision, some on the generation mechanism, and some on the system structure, but they all aim at the same thing: enabling the model to have the ability to handle images, language, and structure simultaneously.
The illustration in the tweet is, to some extent, like a metaphor: a group of people are sitting around, each responsible for a part, and finally putting together the same picture.
The model's structure, ability boundaries, and even “what an image should be” are all gradually developed within such a team.
It's worth noting that there are quite a few Chinese names in this core team of about a dozen people.
In addition to Boyuan Chen, it also includes Jianfeng Wang, who works on visual language