Text is Dead, Vision Shall Reign: Karpathy Raves About New DeepSeek Model, Ending Tokenizer Era

Karpathy, itching to do something, can no longer stand the tokenizer.

New AI Breakthrough! DeepSeek-OCR Processes Text at the Pixel Level, with a Compression Ratio of Less Than 1/10, Leading in Benchmark Tests. It Gained 4.4k Stars Overnight on Open Source, and Karpathy Can't Wait, Envisioning the Versatility of Visual Input.

DeepSeek Has Once Again Amazed the World!

Their Latest Achievement, DeepSeek-OCR, Has Fundamentally Changed the Game —

Text Is Not the Universal Input. Instead, Vision Will Take Its Place!

Moreover, in the Optical Character Recognition (OCR) Task, the DeepSeek-OCR Model Truly Lives Up to Its Name and Is a Pinnacle of Engineering —

🚀On a Single A100-40G GPU, It Can Reach Approximately 2500 Tokens per Second, Running at an Incredibly Fast Pace.

🧠While Maintaining a 97% OCR Accuracy Rate, It Can Compress the Visual Context to 1/20 of Its Original Size. Even in Regular Use, the Compression Ratio Can Easily Be Less Than 1/10.

📄In the OmniDocBench Benchmark Test, Using Fewer Visual Tokens, It Can Outperform GOT-OCR2.0 and MinerU2.0.

How Stunning Is the Effect?

A Whole Page of Dense Text Can Be Compressed into Just 100 Visual Tokens, Achieving Up to 60 Times Compression on OmniDocBench!

DeepSeek-OCR Literally Turns Text into Pixels, Just Like Compressing a 100-Page Book into a Single Photo, and the AI Can Still Read It.

Fewer Parameters, High Compression Ratio, Fast Speed, and Coverage of 100 Languages... DeepSeek-OCR Has It All.

Not Only Does It Have Great Theoretical Value, but It Also Has High Practicality and Has Received Overwhelming Praise:

The Open-Source Project DeepSeek-OCR on GitHub Gained 4.4k Stars Overnight 🌟:

DeepSeek-OCR Proves with Facts That Physical Pages (Such as Microfilm and Books) Are a Better Data Source for Training AI Models — Rather Than Low-Quality Internet Text.

Karpathy, a "Born Computer Vision Researcher," Former AI Director at Tesla, and a Member of the OpenAI Founding Team, Can't Hide His Excitement and Strongly Supports DeepSeek's New Model.

Karpathy Can't Wait and Is Fed Up with Tokenizers

Karpathy Really Likes the New Paper on DeepSeek-OCR.

But the More Interesting Part Is, for Large Language Models, Is Pixel Input Better Than Text Input? At the Input End, Is Text Tokenization a Wasteful and Terrible Way?

DeepSeek-OCR Is Shaking the "Core Position of Text in AI," and Vision May Become the Mainstream Again!

Karpathy Claims to Be a "Born Computer Vision Person" Who Is Just Temporarily in the Natural Language Processing Circle, So He Is Particularly Interested in the Above Questions.

Perhaps All Inputs for Large Language Models Should Only Be Images, Which Makes More Sense. Even If You Have Pure Text Input, It May Be Better to Render It into an Image First and Then Feed It to the Model:

Higher Information Compression Ratio => Shorter Context Window, Higher Efficiency.

Significantly Enhanced Versatility of the Information Flow => No Longer Limited to Text, It Can Also Process Bold, Colored Text, and Even Arbitrary Images.

The Input Can Now Easily and by Default Use Bidirectional Attention Mechanism Instead of Autoregressive Attention — Which Is Much More Powerful.

Get Rid of the Tokenizer (at the Input End)!!

Especially the Last Point, Karpathy Has Been Tolerating It for a Long Time and Has Complained Many Times That the Tokenizer Is Terrible —

The Tokenizer Is Ugly and Independent, Not an End-to-End Component.

It "Introduces" All the Drawbacks of Unicode and Byte Encoding, Carries a Heavy Historical Burden, and Brings Security/Jailbreaking Risks (Such as the Consecutive Byte Problem).

It Makes Two Characters That Look Exactly the Same to the Naked Eye Become Two Completely Different Tokens Inside the Network.

A Smiling Emoji 😄 Is Just a Strange Token to the LLM Model, Instead of a Real Smiling Face with Pixels and Rich Information for Transfer Learning.

In Short, Karpathy Thinks That the Tokenizer Has "Committed Many Crimes" and Must Be Gotten Rid of This Time.

Moreover, He Envisions the Prospect of Vision Becoming a Universal Input:

OCR Is Just One of Many Applications of "Vision to Text." And "Text to Text" Tasks Can Also Be Transformed into "Vision to Text" Tasks, but Not the Other Way Around.

So, Maybe the User's Input Message Is an Image, but the Decoder (i.e., the Response of the "Intelligent Assistant") Is Still Text.

As for How to Actually Output Pixels or Whether You Really Want to Do So, It's Far from Clear.

Now, Karpathy Says That He Has to Hold Back from Starting a Side Project of "nanochat" That Only Uses Image Input.

Why Is Image Input More Friendly to AI?

Some Netizens Asked:

First, Why Can Images Easily Obtain Bidirectional Attention, While Text Can't?

Second, Although Images Don't Have a "Tokenization" Process Like Text, When We Cut the Input Image into Image Patches, Don't We Get Similar or Even Less Desirable Results?

In Response, Karpathy Says That In Principle, It's Possible. However, for the Sake of Efficiency, Text Generation Is Usually Trained in a Simple Autoregressive Manner.

We Can Imagine an Intermediate Training Stage Using the Bidirectional Attention Mechanism to Fine-Tune Conditional Information, Such as Those Tokens Representing User Messages That We Don't Need to Predict or Generate.

In Principle, You Can Bidirectionally Encode the Entire Context Window Just to Predict the Next Token. But the Cost of Doing So Is That You Can't Parallelize the Training.

As for the Second Question, He Thinks That Strictly Speaking, It Has Nothing to Do with "Pixels vs. Tokens." The Core Issue Is That Pixels Are Usually Encoded, While Tokens Are Decoded.

As for Karpathy's "nanochat Side Project Theory," Netizens Don't Agree:

DeepSeekOCR Proves That It's Not Just About Compression — It's Also About Semantic Distillation.

The Tokenizer Era Means Literacy, While the Pixel Era Is About Perception.

Nanochat Shouldn't Be a Side Project. It's the Beginning of "Optical Cognition."

Under the Post, Netizens Beg Karpathy to Start a "nanochat" That Only Uses Image Input!

Karpathy's Former Boss and "Good Buddy" Musk Gave a More Sci-Fi Speculation:

In the Long Run, More Than 99% of the Input and Output of AI Models Will Be Photons.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Text is dead, vision shall reign. Karpathy raves about the new DeepSeek model, ending the era of tokenizers.

Karpathy Can't Wait and Is Fed Up with Tokenizers

Why Is Image Input More Friendly to AI?