Redis Father Steps In to Build Dedicated Inference Engine for DeepSeek V4

DeepSeek V4 can now run on Mac.

Henry from Aofeisi, QbitAI | WeChat official account QbitAI

DeepSeek V4 has already forced overseas developers to build an exclusive highway for it.

Just two weeks after its release, the first batch of native infrastructures for V4 have emerged in the open - source community.

Moreover, it's not a "minor fix" by putting a shell on the existing framework.

It's not a general GGUF loader; not a wrapper of llama.cpp; and it doesn't even support other models at all.

It only does one thing:

Run DeepSeek V4 Flash to the extreme on Mac.

This "exclusive highway" is called ds4.c. And the person who built it is quite remarkable —

Salvatore Sanfilippo. Programmers in the circle are more familiar with his other name: antirez.

He created Redis (74,000 stars on GitHub) single - handedly and led this world's most popular in - memory database for a full 11 years.

Now, his new project, ds4.c, is a local inference engine specially designed for DeepSeek V4 Flash.

On the timeline, some netizens have already run it on a 128GB Mac.

It can be said that this time, the Mac inventory has been cleared again by DeepSeek.

It's really worth it for the "whale" (DeepSeek).

A Local Inference Engine Specially Designed for V4 Flash

On April 24th, DeepSeek released the V4 series. Among them, V4 Flash is the efficiency model: 284B total parameters, 13B active parameters, and a context of 1 million tokens.

In the past, a model of this scale was almost always assumed to be run on the cloud.

What antirez wants to do is to fit it into a Mac. Thus, ds4.c was born.

This is an inference engine written from scratch using C + Metal.

The entire project consists of only a few files, with C accounting for 55.4%, Objective - C 30.2%, and Metal 13.8%. It's Metal - only, with no runtime, no framework dependencies, and no abstraction layer.

Metal - only.

Metal is Apple's own graphics and computing API. It's used to call the GPU on Mac, iPhone, and iPad, which is equivalent to CUDA in the Apple ecosystem.

The fact that ds4 only uses Metal means that this engine only runs on Apple Silicon, regardless of Nvidia graphics cards or AMD.

The entire project has only one goal:

Make V4 Flash not only "runnable" but truly "usable" on local Apple machines.

The current test results are quite astonishing:

On a 128GB MacBook Pro M3 Max, with 2 - bit quantization and a 32K context, the pre - fill speed for short prompts is 58.52 tokens/s, and the generation speed is 26.68 tokens/s.

On a 512GB Mac Studio M3 Ultra, for long prompts (11709 tokens), the pre - fill speed can reach 468.03 tokens/s, and the generation speed is 27.39 tokens/s.

For a MoE model with 284B parameters, this speed is usable on a local machine.

How is it achieved?

The key lies in three things.

First, asymmetric quantization.

ds4 doesn't compress all parameters to 2 - bit. Instead, it only quantizes the MoE expert layer of the router. The up/gate uses IQ2_XXS, and the down uses Q2_K. These layers account for the majority of the model space.

For other components, including the shared expert layer, projection layer, and routing layer, the Q8 precision is fully retained.

antirez wrote a very straightforward sentence in the README:

These 2 - bit quantizations are not a joke. They perform well under the coding agent and can reliably call tools.

Second, move the KV cache to the hard drive.

Current LLM agent clients are stateless, and the entire conversation is resent for each request.

The common practice of general engines is to perform prefill each time.

ds4 writes the KV state to the disk. When the next request comes, it matches the token prefix. If it hits, it directly loads from the disk and skips the prefill.

The key of the cache is the SHA1 hash value of the token ID sequence.

This is especially useful for agent scenarios like Claude Code, where a 25K token initial prompt is sent each time it starts. After the first prefill is completed, subsequent sessions can be directly restored from the disk.

Third, built - in compatibility layers for OpenAI and Anthropic APIs.

/v1/chat/completions follows the OpenAI protocol, and /v1/messages follows the Anthropic protocol. Tool calling is also adapted. The README directly provides configuration examples for three agent clients: opencode, Pi, and Claude Code.

Regarding why he did this.

antirez's answer is that there are many excellent projects in the field of local inference. However, as new models are constantly released, the attention is immediately drawn to the next model to be implemented.

To be compatible with all models, general engines must make abstractions. Abstraction means compromise. What he wants to do is a deliberately narrow path. He bets on one model at a time, uses official logits for verification, conducts long - context tests, and integrates enough agents to confirm that it is truly usable.

As soon as the framework was released, many netizens reported that they had already run it on Mac.

Are you ready to run V4 locally?

One Model, One Inference Framework

This incident has sparked a bigger discussion in the developer community:

Will the future become a situation where there is one inference framework for each model?

A highly - voted comment on Hacker News proposed an interesting direction. What if we start building super - optimized inference engines for precise GPU and model combinations?

GPUs are becoming more and more expensive. If we remove enough abstraction layers and directly code for precise hardware and models, we may be able to achieve a lot of optimization.

The cost of this path is also obvious. The same comment pointed out that once the model becomes obsolete, everything has to start from scratch.

antirez himself also admitted this problem. He said that ds4 is currently betting on DeepSeek V4 Flash, but the model may change.

The constant constraint is that local inference should run reliably on high - end personal machines or Mac Studio, starting with 128GB of memory.

The README leaves a hint about what the future will be like.

Currently, it's Metal - only. In the future, it may support CUDA. But he wrote very cautiously, maybe it will, but that's all. This project deliberately remains small, fast, and focused.

What's more worthy of attention is an opinion he put forward in the README. Local inference should do three things well together and be ready - to - use out of the box.

An inference engine with an HTTP API, a GGUF specially designed for this engine and this set of assumptions, and a set of tests and verifications for docking with the coding agent.

This is a full - stack local inference idea. It's not about piecing together components but designing the entire chain as a product.

If this path is successful, it may change the way of local inference.

When model manufacturers release new models, someone in the community will step up to create a dedicated engine, dedicated quantization, and dedicated agent access for it. Each generation of models will have its own "antirez".

There is also a very frank detail in ds4. The README has a statement that this software was developed with the "strong assistance" of GPT 5.5, and humans are responsible for ideas, testing, and debugging.

antirez said if you don't accept code developed with AI assistance, this software is not for you.

In just two weeks, from forking llama.cpp for adaptation to writing a dedicated engine from scratch, it couldn't have been done without AI assistance. This fact itself may be more worthy of attention than ds4.

One more thing

Finally, let's talk about antirez.

His real name is Salvatore Sanfilippo. He was born in Sicily in 1977. He created Redis in 2009 and led the project for eleven years, leaving in 2020.

When he left, he wrote a passage saying that he wrote code to express himself. Code is a work of art, not just a useful tool. He would rather be remembered as a bad artist than a good programmer.

He returned to Redis at the end of 2024, taking on the role of an evangelist.

In addition to Redis, he also wrote Kilo (a text editor with less than 1000 lines of C code), dump1090 (an aviation ADS - B signal decoder), and linenoise (a miniature replacement for readline).

He is also playing with Flipper Zero, writing an RF protocol analysis tool and porting Asteroids to it. In 2022, he published a science - fiction novel "WOHPE", with themes of AI, climate change, programmers, and the interaction between humans and technology.

The first line of his personal homepage says, "I spend most of my professional time writing code and novels."

Regarding the birth of Redis, he wrote a passage on his personal homepage:

My wife said that in the first few years of Redis, I wrote most of my code sitting on the toilet, using an 11 - inch MacBook Air. I really wish I could say she was wrong, but she was exactly right.

This kind of style runs through all his projects. Small, precise, and self - contained.

ds4.c is also in the same vein.

Reading the note about the macOS bug in the ds4 README, you can immediately feel this person

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The father of Redis stepped in and built a dedicated inference engine for DeepSeek V4.

A Local Inference Engine Specially Designed for V4 Flash

How is it achieved?

One Model, One Inference Framework

One more thing