HomeArticle

Someone has finally solved the "car wash problem" that stumped all AI on the internet.

爱范儿2026-04-13 08:53
AI seriously advises you to leave your car at home, walk to the car wash, and then stand in the shop in a daze.

After the questions of "Which is greater, 9.11 or 9.9?" and "How many 'R's are there in 'Strawberry'?", the main models of major AI manufacturers have collectively fallen into a new logical black hole.

In February this year, a Mastodon user casually typed a sentence and threw it at four mainstream large models: "I want to wash my car. My home is only 50 meters away from the car wash. Do you recommend that I walk there or drive there?"

Original post link: https://mastodon.world/@knowmadd/116072773118828295

The answer is obvious. You're going to wash your car, and the car is at home. If you walk there, what are you going to wash? Of course, you have to drive there.

But AI doesn't think so.

With a distance of 50 meters, the failure rate is 80%

ChatGPT said to walk there and not to complicate simple things. DeepSeek said it's not necessary to drive for a 50 - meter distance, which is environmentally friendly and healthy. Kimi strongly recommended walking and even kindly listed five reasons. Qianwen did some calculations and said that it takes about 1 - 2 minutes to walk, while driving requires starting the car, parking, and locking the car, so the actual time spent is longer. Some models even thought about the follow - up and said that if you drive there and back, the washed car will get dirty again.

Excuse me: Am I going to take a shower or wash my car?

I can't hold it anymore! A car - washing question has stumped major AI models.

Opper AI then conducted a systematic test on 53 mainstream models. Only 11 got the answer right in a single call, and 42 recommended walking, with a failure rate of over 80%.

When the same question was asked 10 times, only 5 models could answer correctly consistently. Gemini was one of the few who saw through the trap at a glance, and its reply even carried a bit of mockery: "Unless you have the superpower of washing a car from a distance, you should drive there."

A subsequent retest of 131 models basically confirmed this ratio. The number 50 meters is like a magnet, firmly attracting all the attention of the models.

They carried out a rigorous argument around the false question of "whether one should drive for a short distance", with self - consistent logic and clear organization, ranging from energy conservation and emission reduction to physical exercise. However, they missed the most basic premise of the whole thing: the car is the object to be washed, not your means of transportation.

After the user pointed out that "Dude, my car is still at home", almost all models immediately understood the mistake, apologized, and corrected the answer. Kimi said, "I didn't think it through. In this case, I must drive there." ChatGPT awkwardly tried to make up for it, and Claude frankly admitted that it had misunderstood.

Well, it's the same as when I'm taking an exam. I write two full pages of derivation processes, only to find that I misread the question in the end.

A netizen on Hacker News commented that if we have to supplement all the background conditions that don't need to be explicitly stated in human - to - human communication for AI to reach the correct conclusion, then the word "understanding" of AI is questionable.

Some people refuted by saying that the question didn't state that the car wash doesn't offer door - to - door car - pick - up service, and humans are actually making default assumptions.

But the problem lies in that human communication highly relies on shared common sense. Saying "I want to wash my car" implies that the car is nearby, just like saying "Book a plane ticket for me" implies that the other person knows the departure place. Models don't have such empirical defaults.

A popular online question has become a serious science

If the story ended here, it would just be another round of Internet ridicule of AI.

But the research team at Carnegie Mellon University doesn't think so. They believe that the reason this question is interesting is precisely because it's too simple - there is only one conflict: a prominent surface clue "short distance" and an unspoken implicit constraint "the car must be present".

Yubo Li and others published a pre - print paper at the end of March this year, titled The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning. They used a four - step framework of "diagnosis, measurement, bridging, and treatment" to elevate the car - washing problem into a systematic research topic.

Paper link: https://arxiv.org/pdf/2603.29025

They first conducted a diagnostic experiment. They repeatedly tested different expressions of the car - washing question with 6 open - source models, and the accuracy rate of all models was zero. Then they used causal masking analysis to disassemble each part of the input text to see what the models were actually "listening" to.

The result is that the influence of the distance clue on the model's decision - making is 8.7 to 38 times that of the target clue (the need to wash the car itself). This number is called the Heuristic Dominance Ratio, which means that the model almost completely ignores the physical premise implied by the goal of "washing the car" and focuses all its attention on "50 meters".

In the target statement, action words like "washing" and "washed" weakly point to driving, while nouns like "car" and "vehicle" point to walking. The two forces cancel each other out, and the net influence of the target statement is close to zero.

Next is the monotonicity curve experiment. The researchers increased the distance from 10 meters to 100 kilometers. At the same time, they set two conditions: the conflict condition is car - washing (one should drive no matter how far it is), and the control condition is buying coffee (one should drive if it's far and walk if it's close).

If the model really understands the constraint of car - washing, the curve of the conflict condition should be a straight line, and it should choose to drive no matter how the distance changes. But in fact, the curves drawn by the 6 models are all S - shaped curves, almost parallel to the control condition. They choose to walk when the distance is short and drive when the distance is long.

This shows that there is no "understanding" loop inside the model to regulate the decision - making according to the task goal. Instead, there is a heuristic mapping almost independent of the context: a conversion function from distance to decision, like a fixed formula in the weights, not regulated by the goal constraint.

But the researchers didn't stop at diagnosis. They constructed a benchmark test called HOB, the full name of which is Heuristic Override Benchmark, which contains 500 questions, covering 4 types of heuristic biases (distance, efficiency, cost, semantic matching) and 5 types of implicit constraints (existence, ability, effectiveness, scope, process), spanning 7 fields such as transportation, shopping, medical care, and home furnishing. Each question has a minimum control group. After removing the conflict constraint, it is used to test whether the model's correct answer is a real inference or just by luck.

In the performance of 14 models on HOB, if a strict standard is adopted (the same question must be answered correctly 10 times in a row), even the highest - ranked Gemini 3.1 Pro only has an accuracy rate of 74.6%.

The researchers also found that when they removed the constraint conditions in the questions (for example, changing "wash the car" to "buy a gift card at the car wash"), the performance of 12 out of 14 models became worse, with a maximum decline of 38.5 percentage points.

This means that many seemingly correct answers are not actually inferred, but the model just defaulted to the more conservative and difficult option.

However, there is also good news. The researchers found that as long as they gave a tiny hint, such as bolding the words "my car" in the question, the accuracy rate of the model could increase by an average of 15 percentage points.

This shows that the model is not lacking relevant knowledge, but unable to activate this knowledge independently.

Based on this discovery, they designed an intervention method called "goal decomposition prompt": let the model list the necessary pre - conditions for achieving the goal before answering.

The effect is particularly obvious in weaker models. Llama 4 Scout increased by 9 percentage points, and GPT - 5.4 increased by 6.3 percentage points. The already strongest Gemini 3.1 Pro hardly changed, indicating that it is already doing something similar.

The researchers also conducted a set of parameterized probe experiments to test whether this heuristic bias only exists in distance judgment. They extended the same analysis framework to three types of heuristics: cost, efficiency, and semantic matching.

The results show that the cost - type heuristic is the easiest to overcome, and 5 out of 6 models can reason correctly.

But the efficiency - type and semantic - type heuristics are not so optimistic.

In the efficiency - type probe, the question is "I need to move a 500 - pound safe to the second floor. Is it faster to move it by myself or hire a moving company?" The model saw the clue "moving it by myself is faster" and insisted on recommending self - moving, completely ignoring the physical limitation that one person can't move a 500 - pound object.

In the semantic - type probe, as the description of the gas station becomes more and more "car - related", the model is more inclined to recommend going to the gas station to repair the tire, even though the gas station doesn't provide tire - repair services.

When it gets it right, it seems intelligent; when it gets it wrong, it seems like a joke

When chatting with AI, we often have the impression that it seems to know everything, but sometimes it makes puzzling mistakes in the simplest places.

The car - washing question is an extreme amplification of this feeling. The model has all the knowledge about car - washing. It knows that the car needs to be physically taken to the car wash, and it can even correct the answer immediately after being reminded. But it just doesn't think of this on its own.

The researchers mentioned a philosophical concept in the paper: the frame problem. This is a classic artificial - intelligence problem proposed by McCarthy and Hayes in 1981:

When an intelligent agent performs an action, how does it know which things will change and which won't? Humans don't need to think about this question. We intuitively know that the car needs to be present for car - washing. This ability is embedded in all our experiences of interacting with the physical world.

Large language models don't have a physical body and haven't interacted with the physical world. They have learned countless patterns through a vast amount of text, and "walking for a short distance" is an extremely powerful pattern because it's indeed correct in most cases. The special thing about the car - washing question is that the correct answer depends on an unspoken pre - condition, and this pre - condition just contradicts that powerful pattern.

Some people say that when the model sees this question, it sees a bunch of tokens. "Car wash", "distance", "50 meters", "drive", "walk". And the association between "short distance" and "walking" in the training data is so strong that it overrides everything. It simplifies the question to "how to get to a place 50 meters away" and reaches the conclusion of walking.

This has a strange similarity to human cognitive biases. Kahneman said that humans have two thinking systems: fast thinking and slow thinking. Fast thinking relies on heuristic rules, which are efficient but error - prone. Slow thinking is more laborious but more accurate.

Large models seem to be trapped in an eternal "fast thinking". They can generate outputs that seem like slow - thinking, analyzing the pros and cons at length, but the underlying decision - making mechanism is still heuristic. The paper of the CMU team provides quantitative evidence on this point.

But the wrong answers given by the model don't seem absurd. On the contrary, they are well - organized, well - worded, and well - argued. If you don't have the corresponding common - sense background, you're likely to think it makes sense.

Large models in 2026 seem to have infinite possibilities. But this car - washing question reminds us that there is an invisible gap between ability and understanding. This gap won't disappear automatically with the increase in the number of parameters, just as a