HomeArticle

"I may no longer recommend studying computer science," a Turing Award winner lambasted half of the industry and asserted that AI agents will ultimately boil down to database issues.

极客邦科技InfoQ2026-04-30 19:13
The "pioneer" of databases has poured cold water on AI: Agents can't avoid old problems, and large models are far from qualified to write SQL.

"If I were to start over today, I'm not sure if I'd still recommend that 18 - year - olds study computer science."

The person who said this is Mike Stonebraker, a Turing Award winner in the database field, often translated into Chinese as "Shi Potian". He is the key creator behind Ingres and Postgres and one of the most important figures in the database field. In his view, computer science may no longer be a growth - oriented industry in the future.

In this interview, Stonebraker criticized almost half of the database industry.

He criticized Oracle and directly said that Larry Ellison was "lying" back then: selling unimplemented features to customers, presenting the future as the present, and then asking the first - batch customers to help with debugging.

He criticized Google, saying that it was "stupid" for Google to promote MapReduce and eventual consistency back then. Many people blindly believed that Google must know what it was doing just because "Google is smart". But in Stonebraker's view, Hadoop was extremely inefficient, and eventual consistency was only suitable for very few scenarios. When Spanner came out, Google itself also admitted in a way that old database problems such as transactions and consistency could not be bypassed.

He also criticized AWS: Amazon maintains about 15 different database systems at the same time, while he believes that only about 3 are really needed. In his view, many graph databases and databases with redundant functions do not have sufficient performance and market reasons to continue existing.

But what's more interesting is his view on today's wave of AI.

In his view, today's so - called agentic AI is essentially "a large - scale model + a layer of system packaging", and most of it is still in the "read - only" stage. Once it enters the real "read - write" world, such as making a transfer or updating inventory, the problem immediately returns to the old database problems: transactions, consistency, and atomicity. This is not an AI problem but a distributed database problem.

Another point is his judgment on large - scale models writing SQL.

In public benchmarks, the model has achieved an accuracy of over 80%, seemingly just one step away from being put into production. But in the tests they conducted using real data warehouses, the result was - 0%. Even with RAG and even directly feeding the join conditions to the model, the accuracy could only reach 35% at most. While a skilled human engineer can achieve an accuracy of over 90%. So he directly concluded that this technology is not qualified to enter the production environment, at least in the foreseeable future.

Here is the full interview.

1  Postgres: The Best Starting Point, Not the Endpoint 

Host: I'd like to start with the origin of Postgres. But before that, I'd like to ask from the very beginning: How did you get into the database field?

Mike Stonebraker: After I graduated, I was very lucky to be hired by Berkeley. At that time, I was very clear that if I continued in the direction of my doctoral research, there would be no future, either then or now. The best way was to find a really knowledgeable mentor to guide me.

So Gene Wong took me under his wing and said let's work on something together. It was 1971, the year after Edgar F. Codd published his groundbreaking paper in CACM (Communications of the ACM).

Gene said, let's study databases. At that time, there were mainly two camps: one was the Codasyl proposal, which you may not have heard of. It was a low - level "spaghetti - like" network structure where you had to traverse and execute queries through pointers; the other was IBM's solution, which was IMS, a hierarchical data structure, essentially a tree.

Actually, IBM also realized at that time that the tree structure was not universal and couldn't solve many problems, so they added some extensions and transformed it into a restricted network structure. But that was obviously a very bad patch.

Codasyl also had many problems: it was very low - level, difficult to debug, and once your schema (not called that way at that time) changed, you basically had to start all over again because it was completely bound to the physical layer.

In contrast, Codd's relational model was very reasonable. So Gene said, let's implement this. This is what we should do next. So we started working on Ingres in 1972 when I was still an assistant professor at Berkeley. As you know, assistant professors have about five years to prove themselves, either get tenure or be eliminated. Ingres was the key project for me to get tenure, and I got tenure in 1976.

That's how it all started. Later, there were some other opportunities. At that time, many people would create prototype systems, basically student - level code that could run on their own but couldn't be used by others. We first completed the first 90% and got it running; then, for some reason, we spent an additional "90%" to really polish it into a usable system.

The Berkeley version of Ingres was truly usable. In the next few years, about 100 universities started using it because Unix became popular. It was a free database system that could run on Unix and was very popular in the academic community. So many people came to Berkeley to visit and said this thing was cool and asked what our biggest application scenario was. But we could only say that it wasn't really big.

This problem was completely exposed in a project at Arizona State University. They considered using Ingres to manage the data of 40,000 students. They could accept using the non - official operating system from Bell Labs and our "non - official" database system, but the project failed in the end because there was no COBOL on Unix, and they were a COBOL - based organization.

The lack of support for the operating system, the database system, and the absence of COBOL made us completely irrelevant.

The only way out was to start a business. So in 1980, we got venture capital at that time, founded the Ingres company, migrated the system to a "real operating system" like VMS, and provided commercial support. This was the beginning of commercialization.

Host: I saw that Ingres was competing with Oracle Corporation at that time. Technically, you were obviously better, but Oracle still won. How did they do it?

Mike Stonebraker: Larry Ellison is a very good salesman. He would make no distinction between "now" and "future", which was essentially lying to customers.

He would sell unimplemented features and then ask the first - batch customers to help with debugging. I think this is an unethical business practice, and lying to customers is unacceptable.

For example, there is a feature called "referential integrity". For instance, if you fire an employee who is the last person in a department, do you delete the department or keep an "empty department"? It's similar logic.

Ingres implemented this feature. While Oracle's approach at that time was: write two pages of documentation in the manual to explain what referential integrity is (everyone agreed on the definition), but write at the bottom of the page - "Not yet implemented".

Host: I've interviewed people from Sun Microsystems, and their evaluation of Ellison is similar. There is also a saying that when Oracle acquired MySQL, people turned to Postgres, which made Postgres the mainstream open - source database. So, what's the biggest change from Ingres to Postgres?

Mike Stonebraker: The most core change actually came from an initial requirement. Back then, we wanted to support a GIS (Geographic Information System), which required handling data types such as points, lines, and polygons. But Ingres only supported standard types like integers, floating - point numbers, and strings and couldn't efficiently support GIS, so it completely failed in this direction.

This matter always stuck in my mind.

There was another example. Around 1985, the relational database introduced a date - time standard, and Ingres implemented the Gregorian calendar time according to the standard. Then a customer called and said we implemented it wrong.

I said how could that be? We implemented it exactly according to the Gregorian calendar, and the date calculation was also completely correct. He said that's not what he wanted. He was in the bond business, and in his world, the monthly interest was fixed, regardless of whether the month had 28 or 31 days. That is to say, his "date subtraction" rule was different from the real world. For example, subtracting February 15th from March 15th, he thought it was 30 days. But in Ingres, these logics were hard - coded. He had to take out the data, calculate it at the application layer, and then write it back, which directly reduced the efficiency by 2 to 3 times.

He asked me why he couldn't customize the subtraction. That's the problem. This is a scenario where you need "bond time", just like you need points, lines, and polygons. So Postgres was designed as an extensible type system. You can define any data type, and it has high operating efficiency. This is the most core idea of Postgres.

Of course, for most business scenarios, standard types are enough. But as databases gradually expand into more fields, such as abstract data types and stored procedures, these all require extensibility.

In addition, Postgres also supports inheritance (which was needed by AI researchers at that time) and "time travel" (historical data query), but the implementation was very bad and was later removed. But overall, it contains a large number of very interesting features.

Host: You mentioned that you're very good at recruiting excellent engineers. How do you identify these "very talented people"?

Mike Stonebraker: Usually, you can tell at a glance. I have a sense of "difficulty". If a student completes three times the amount of work that I think is reasonable, then he is very excellent.

Host: You also said something interesting: "I can't stand people who aren't smart enough. It's hard to communicate with them." So how do you judge if a person isn't smart enough?

Mike Stonebraker: It's very simple. Just talk to him for a while. Ask him about technical details, such as what his master's thesis was about, how it was specifically implemented, how error handling was done, how many processes were used, and why threads weren't used - ask these in - depth questions, and you'll quickly find out.

Host: You previously put forward a view called "One size fits none", that is, "a one - size - fits - all database is not the optimal solution, and in fact, it doesn't suit anyone".

Mike Stonebraker: Yes, a general - purpose database system is not the optimal solution. The so - called one - size - fits - all actually often doesn't fit anyone. What you really need is a database solution customized for specific needs.

Host: Then among the database products you see now, which ones still belong to this "one - size - fits - all" type?

Mike Stonebraker: When I wrote that paper in 2004, we happened to have an academic project in hand, which later became StreamBase. A stream - processing engine and a relational database seem completely different. At the same time, we also had a general idea of using column - based storage for data warehouses, which was later popularized by Vertica, and column - based storage and row - based storage also seem to be completely different types of systems.

So at that time, there were already three very different implementations in front of us, with almost no similarities to each other, but in their respective scenarios, their performance was an order of magnitude higher than traditional solutions. This already says a lot. As long as the database system is not designed for your scenario, you will directly lose an order of magnitude in performance.

I think it's still the same today. For example, ClickHouse uses column - based storage. Pinecone is also faster in text - based vector processing than the solution that uses user - defined types forcefully. So this still holds true today. I also don't think it's difficult to put a unified parser on top of multiple different implementations. It's just that Postgres still hasn't done this. It hasn't really implemented column - based storage, so it has no competitiveness in large - scale data warehouse scenarios. It also doesn't support multi - node, which is already the most basic requirement for large - scale data warehouses. So I think this still holds true today as it did back then.

However, another thing that also holds true is that if you just want to get things started and have a database problem at hand, the answer is usually still to choose Postgres. It has a huge developer community, various data type implementations, is free, and it's easy to recruit people who understand Postgres, so you can start quickly.

So I think it's a very good option to meet the minimum general requirements. As long as you're not aiming for a million transactions per second or supporting a PB - level data warehouse, it's completely sufficient. That is to say, in low - end scenarios, the "general solution" is Postgres, no problem at all; but in high - end scenarios, this conclusion doesn't hold.

2  Once Indexes Appear, GPUs Can Hardly Play a Role 

Host: Will GPUs bring some new opportunities for database optimization?

Mike Stonebraker: Maybe. But I think the biggest challenge is that GPUs are essentially SIMD, that is, single instruction, multiple data. And this conflicts with indexes.

As long as indexes are the right answer, GPUs are probably not a good answer.

Also, you have to architect the entire system well to ensure that the bandwidth from storage to computing won't become a bottleneck. If the GPU is just attached to the CPU as an add - on, then the bus between the CPU and the GPU often becomes the bottleneck.

Host: Can you explain why the effect of indexes becomes worse in the SIMD mode?

Mike Stonebraker: For example, I want to query Ryan's salary, and I have a B - tree. You first access the root node of the B - tree, find the separation key that divides Ryan to one side, and then follow the pointer down. This is a definite memory access. Then do it again, and again, usually repeating it three or four times.

This process is very difficult to parallelize. So the answer is that indexes are not suitable for parallelization.

Host: You just mentioned the B - tree. When you first implemented the first version of Ingres, were all these things written from scratch? I guess there weren't any ready - made B - tree libraries available at that time, right?

Mike Stonebraker: Yes, the earliest version of Ingres was all written from scratch.

Host: Then what was the most difficult part to implement?

Mike Stonebraker: The query optimizer.

Host: Why is it so difficult?

Mike Stonebraker: Because it's really difficult. It's very complex at the algorithm level. Even today, if you ask any senior database programmer what the most difficult part of the system is, they'll probably still say the optimizer.

3  Google Chose the Wrong Direction, and Amazon Chose Too Many Directions 

Host: After MapReduce emerged in the early 2000s, it almost swept across the entire data field at once. Many people were very shocked and thought that Google really knew what it was doing and that this was the most advanced thing. But judging from your papers and views back then, you seemed to strongly disagree. Why did you strongly oppose MapReduce?

Mike Stonebraker: Because there were many people who didn't really understand and would take it for granted that Google was smart and must know what it was doing, so they just followed suit. So everyone started working on Hadoop or moving towards the Hadoop model.

But Hadoop was extremely inefficient.

People like Dave DeWitt and others who participated in our 2011 paper all understood distributed databases and knew that a distributed database system could beat Hadoop badly. Our 2011 paper basically talked about this. And later, the facts also proved that this was the case.

But Google didn't just do stupid things in this regard.

They also thought that eventual consistency was the right way for concurrent control at that time. This was also something that Google promoted from top - down during that period. But this was completely wrong. All people in the database field were saying that they were crazy because it only solved a very specific problem, and this problem was actually very rare in the real world.

Host: Then why did they pursue eventual consistency?

Mike Stonebraker: Imagine that you have a database on the East Coast and one on the West Coast, and they are replicas of each other. You want them to be consistent.

If I want to perform a transaction to reduce the inventory of