Pen and Screen: Why Has AI Hardware Split into Two Paths?
In the past year, the main battlefield of AI hardware has not only been "putting large models into mobile phones" but has started to compete for something more fundamental: the form through which people interact with AI.
In this process, numerous AI hardware products have emerged, including AI headphones, AI glasses, AI phones, AI voice recorders, and recording cards.
On one hand, there are screenless, wearable "new species," and on the other hand, there are traditional terminal upgrades that continue to use the screen as the center and integrate AI into the system foundation.
We collectively refer to the former route as the "pen": It doesn't refer to a specific shape but generally means lightweight, portable, low - profile AI hardware that mainly uses audio/environmental perception as input and usually doesn't have a screen.
The most radical example of the "pen" was attempted as early as 2024. Humane launched the AI Pin that year, which was defined as a "wearable computer" that could be clipped onto clothes, aiming to allow users to break away from smartphones.
However, the product received a large number of negative reviews due to issues such as high latency, invisible projections, and a lack of application ecosystems. Eventually, it only received about 10,000 orders, and Humane was later acquired by HP.
In contrast, more practical "pens" are entering from narrow scenarios. AI voice - recording hardware represented by Plaud, DingTalk A1, and Feishu Recording Bean emphasizes specific scenarios such as meetings and interviews. They collect voice on the go and then rely on large models to complete transcription and summarization.
The imagination in this direction is now further magnified with OpenAI.
OpenAI has confirmed a hardware - level cooperation with Jony Ive, the former head of Apple's design. The project aims at a new type of AI device different from mobile phones and PCs, emphasizing more natural interaction and lower profile. The currently known form is also described as "like a pen."
Parallel to the "pen" is another route that still uses the screen as the center. Microsoft defines Copilot + PC as a new generation of PC form and clearly sets the NPU computing power threshold. Meta and major mobile phone manufacturers continue to strengthen the deep integration of AI with display and systems on terminals such as glasses and mobile phones.
If we put these phenomena together, we can see a clear divergence: New - generation AI companies are more willing to bet on the "pen" - screenless, close - fitting, and input - first; traditional Internet/hardware companies are more willing to bet on the "screen" - extending display, system, and ecosystem capabilities on existing product forms.
Obviously, in the face of the AI hardware category, different bets on interaction costs, technological maturity, and commercialization paths have emerged in the industry.
1
In 2025, in a public dialogue, OpenAI CEO Altman described the current digital life as "walking in Times Square."
He said that information, push notifications, and screens are constantly vying for attention, while the AI hardware they are exploring has the opposite goal - "more calm and less distracting."
In fact, this concept is not new, but it has been revisited in the past two years. Instead of stuffing AI into mobile phones or PCs, it's better to go back to a more fundamental level: first perceive and connect with the world itself. We summarize such devices as "pen - type AI" - lightweight, close - fitting, and low - profile; in terms of product logic, their perception takes precedence over operation.
From an industry perspective, the current concept of "pen - type AI" is not about replacing mobile phones or PCs as a new hardware entry point, but about competing for the first - hand input rights of individuals and organizations: voice, environment, and perspective. On the premise of relative invisibility, it allows AI to continuously receive and process information.
However, driven by this concept, no product has successfully opened up the market in the past few years.
Going back to 2024, the AI Pin launched by Humane and the R1 by Rabbit both tried to become "AI terminals independent of mobile phones," using voice or environmental perception to achieve instant response and even perform cross - application tasks. But in the end, these products received mediocre responses and failed to open up the market.
Poor experience is the most direct result. Technology reviewer Marques Brownlee said bluntly when reviewing the AI Pin: "This is one of the worst products I've ever reviewed - not because of the idea, but because it simply doesn't work now."
Another reason is that the computing power on the device side at that time could not support complex inferences, and most screenless devices had frequent latency and interruptions. Joanna Stern, a technology columnist for The Wall Street Journal, wrote when comparing and experiencing multiple screenless AI devices that the Humane AI Pin and Rabbit R1 "are more like scientific projects than finished products." She also recorded in a video test that it took Rabbit four minutes to do a "real - time translation."
One of the industry's measurement standards for device - side computing power is the performance of the NPU (Neural Processing Unit). Institutions such as IDC believe that it needs to exceed 30 TOPS to basically meet the inference ability of large language models. But as of early 2024, only a few SoCs reached this threshold, such as Qualcomm Snapdragon 8 Gen 3 and Apple A17 Pro.
In some more "vertical" scenarios, the positioning of "pen - type AI" has rapidly shrunk, and a batch of AI products prioritizing recording have been launched one after another. This change is particularly evident in the Chinese market, such as the DingTalk A1 recording card and the Feishu AI Recording Bean co - developed with Anker.
In the overseas market, products with similar orientations have also emerged, such as the Plaud Note Pro. These devices focus more on converting fragmented voice, meeting content, and environmental sounds into searchable and structured corpora, rather than instant dialogue or task execution.
The AI hardware company Limitless (formerly Rewind) invested by a16z also has a similar orientation. Its founder Dan Siroker said in an interview that they "are not making a second device but building the infrastructure for recording conversations."
However, as the scope of the "pen" continues to expand, the contradictions also escalate. In recent years, some teams have started to introduce cameras into headphones or head - mounted devices, hoping to obtain a more natural first - person perspective input. This not only means that battery and computing power need to be redistributed but also raises concerns about privacy boundaries at the social level.
Technology analyst Avi Greengart pointed out when discussing the trend of AI wearables that consumers' expectations for privacy "have not disappeared, but they are indeed shifting." People are willing to give up some boundaries for convenience but still remain vigilant about "being continuously recorded."
It is against this background that the hardware project of OpenAI and the Jony Ive team is regarded as the biggest variable for the "pen." Chris Lehane, the global affairs head of OpenAI, has publicly confirmed that the company plans to showcase its first hardware device in the second half of 2026. Its focus is not on display but on more natural and restrained environmental perception.
Ive once led the design of the iPhone 4, which initiated and defined the previous era of smartphones. But in his design career, another product that is often mentioned is the simpler and more restrained iPod. Maybe we can still see the shadow of the iPod in OpenAI's future hardware.
Putting aside Ive's "obsession," "pen - type AI" is more like the sensor layer in the AI era. It doesn't directly make decisions for people but prioritizes perception and then drives user interaction. However, in some scenarios where interaction comes first, the importance of the "screen" seems irreplaceable.
2
Unlike "pen - type AI" which tries to reduce its presence, the hardware camp with screens - whether it's traditional PCs/mobile phones or new - generation products - has actively embraced AI in the past two years.
The watershed for this route can be said to have occurred when AI shifted from "application capabilities" to "system capabilities."
In May 2024, Microsoft launched Copilot + PC, claiming it was a reconstruction of AI at the operating system level. The official documentation states that new - generation Windows devices need to be equipped with an NPU with a computing power of 40+ TOPS. Some AI capabilities will run in a "system - native" way rather than being called through independent applications.
This design essentially integrates AI into the system layer of the hardware. Yusuf Mehdi, the head of Microsoft's consumer business, said bluntly during the launch that this was an adjustment to "redefine Windows computers."
A similar logic also appears on the mobile phone side. Whether it's Apple, Samsung, or domestic manufacturers such as Xiaomi and vivo, they have all directly integrated AI capabilities into the system layer in recent years.
This shows that in the AI era, the screen is still the center for information display, transaction confirmation, and permission authorization. Smartphones with AI have the capital to compete with super apps for the "first entry point."
Take the Doubao Mobile Assistant jointly launched by ByteDance and ZTE as an example. Its positioning is not an independent application but is deeply embedded in the system interaction process, participating in multiple aspects such as search, writing, schedules, and notifications, and re - allocating the relationship between users, information, and services through system - level entry points.
When AI enters the system layer, it is first regarded as a "monster" by peers in the Internet era. As early as 2024, Microsoft launched the Recall function, trying to help users retrieve information later by periodically recording screen content. This function caused great controversy at the beginning of its release.
The communication app Signal was the first to publicly oppose it, pointing out that Recall did not give applications enough control in its design, and the system - level screenshot mechanism might capture encrypted communication interfaces. Subsequently, a number of tools announced that they would block Recall by default.
A year later, a similar scenario played out again with the Doubao Mobile Assistant. Leading companies in the fields of social media, e - commerce, and finance blocked the Doubao Mobile Assistant with their apps.
Reliability has become an important factor affecting the pace of AI evolution in smartphones. Last year, Apple announced that it would postpone the launch of the Siri AI function originally planned for 2026. Putting aside the fact that Apple's self - developed model lags behind, Apple officially responded that these functions had not reached the expected reliability standards.
In a subsequent interview, Greg Joswiak, the global marketing head of Apple, said that Apple did not want users to be exposed to "unstable system - level capabilities." According to the latest news, Apple plans to let the new version of Siri use the basic model driven by Google Gemini to improve semantic understanding and dialogue capabilities.
Whether the new version of Siri can further expand the boundaries of mobile phone AI remains to be seen until its official release.
Turning our attention to AI glasses, the characteristics of the screen - type route are further magnified. Glasses have become an extension of the screen form of terminals such as mobile phones, transferring information display into the user's field of vision.
The AI glasses jointly developed by Meta and Ray - Ban were the first to open up the market. They first made the functions of voice, translation, and basic visual understanding work in daily use, and then introduced waveguide technology in the second - generation product. Before Meta, companies such as Rokid, Thunderbird, and Alibaba Quark also launched AI glasses with waveguide visual technology.
However, compared with the previous generation of AI glasses mainly featuring "no display," the challenges faced by visual AI glasses at the engineering level have not been alleviated but have been more concentratedly exposed.
Andrew Bosworth, the chief technology officer of Meta, said bluntly when talking about the Orion prototype that the yield rate of the display components was "incredibly bad." In other words, for AI glasses to become the AI terminals of the new era, there are still many engineering problems to be solved.
In fact, in the current technological path, waveguide is almost the only solution that can overlay information on the real world while ensuring a clear field of vision. However, to this day, the large - scale mass production of waveguide remains an unsolved engineering problem.
On the other hand, the high homogenization of the technological route also affects some manufacturers' attitudes towards AI glasses. Recently, some industry rumors point to ByteDance's plan to release AI glasses soon. But according to the speculation of the XR Vision studio, this product may not be put on the market, and ByteDance may directly start the research and development of the next - generation product.
This speculation by XR Vision stems from the common dilemma of "hardware homogenization and functional convergence" in the current AI glasses track. As the industry commented on vivo's suspension of its AI glasses project, the core reason why large manufacturers are generally cautious in this track is that "it's difficult to make differences."
Against the background that technological iteration requires more time, screen - type AI is more like a long - distance race. Whether it's AI mobile phones or AI glasses, they need continuous integration of the operating system, hardware threshold, and ecosystem.
3
After disassembling the usage logic, you'll find that the "pen" and the "screen" are not on the same path. Imagine a very common scenario: during a meeting, you just want to fully remember what people say; after the meeting, you need to write the key points into the minutes and assign tasks. At the AI interaction level, the former requires "feeling and collecting," while the latter requires "operation and correction."
Whether to let AI participate in decision - making is one of the important reasons behind the differentiation of AI hardware. In some scenarios, decision - making may not be involved, and perception comes first. But once entering the decision - making process, the interaction logic is completely different.
Why must "decision - making" be more cautious? Melanie Mitchell, a complex systems researcher, said in an interview that generative models are prone to show fragility and self - contradiction in different situations, so humans must maintain the role of supervision and editing.
Put more intuitively: the model can help you advance the process, but people must be able to monitor this process and intervene and control it at any time. This is the irreplaceable aspect of screen - based AI hardware.
In academia, there is a concept called "automation bias" - people tend to believe the suggestions given by the system, even if they vaguely feel something is wrong.
As early as 2000, an experimental study in the United States, "Accountability Mechanisms and Automation Bias," pointed out that when participants were required to "take responsibility" for overall performance or decision - making accuracy, the incidence of automation bias would decrease. In the AI era, this means that once the results need to be held accountable, people need a confirmable and verifiable interface, which is exactly what the "screen" is best at.
Therefore, the competition between the "pen" and the "screen" routes is essentially the result of the differentiation of scenarios corresponding to AI hardware: the close - fitting device "pen" is more like a "perception and memory peripheral," responsible for continuously collecting information; the screen terminal is more like an "editing and execution center," emphasizing continuous interaction and traceability of tasks.
However, considering the continuous evolution at the technological level, the value of the "pen" may be reflected in a longer cycle.
The all - day understanding and perception enable the "pen" form to provide long - term corpora and context for AI. In other words, based on the technology of continuous perception, the future "pen" may understand users better than the "screen." The "pen" is more in line with the long - term memory in the AI era, while the "screen" is more like a workbench.
The Stanford Human - Centered AI Institute has proposed that placing the interaction design between humans and AI at the core of the algorithm is the key to ensuring that the system is usable and trustworthy, emphasizing that "the algorithm should not only output results but also consider how humans understand and use these results."
In essence, the "pen" and the "screen" only represent two priority choices in this logic: whether to prioritize "long - term memory" or "instant feedback."
New - generation AI companies hope to break out of the traditional interaction logic and create new scenarios and demands through new product forms. This is why, after the failure of the AI Pin, the industry is still willing to expect the results of OpenAI's screenless solution.
In the ToB field, more and more companies are no longer trying to "do everything" with one device. Plaud, DingTalk, and Feishu position AI hardware as a voice entry point, which has verified the feasibility of "continuous perception" in vertical scenarios.
In fact, the "pen" and the "screen" are never contradictory. They are more like the two ends of the "impossible triangle" in product design: invisibility and portability, visibility