首页文章详情

In the era of AI, humans are panicked, and screens look awkward.

峰小瑞2026-03-13 09:37
Even though you have an 8K screen, why do your eyes still feel so tired?

We are living in an era inundated with AI trends.

Since OpenAI kick - started the era of large language models, we've witnessed incredibly realistic AI - generated videos. The hype around Sora is still fresh, and then Seedance 2.0 has shaken the industry. Not to mention the recent sensation, Open Claw. AI technology seems to be evolving on a weekly basis. We're afraid of missing out on any trend, and even more terrified that missing one could leave us behind in this fast - paced era.

However, if you peel back the layers of these flashy technologies, you'll find that they all boil down to one battle: the fight for human attention.

As the core terminal for information, the evolution of screen shapes is a vivid microcosm of this battle. From the industrial - era pursuit of horizontal expansion, to the early days of smartphones when screens became narrower and longer for better grip, resulting in various "remote - control - like" or "ultra - wide" screens, now they're all converging towards a stable form: square.

For example, Apple's rumored first foldable iPhone might adopt a "wide - fold" design to address the issue of the "narrow inner screen" in traditional foldable phones. Its 7.8 - inch inner screen, when unfolded, won't be a long, thin rectangle but a wider screen with an aspect ratio of about 1.4:1.

This isn't limited to phones. For instance, IMAX screens have an aspect ratio of 1.43:1, and the single - eye display of Apple Vision Pro 2 maintains an aspect ratio of about 1.21:1 .

This trend towards squares isn't a renaissance. It's a necessary technological choice made by the audio - visual industry after a hundred years of "horizontal expansion" in the face of information overload.

Actually, squares might be more in tune with human physiological instincts than any widescreen.

Goldmann's visual field diagram. The white part in the middle corresponds to the "visual field seen by both eyes together", and the textured part corresponds to the "peripheral area seen by one eye but not the other". Image source: 1964 NASA report, Bioastronautics Data Book.

As shown in the diagram, the visual field seen by both human eyes naturally tends to be more square - like, approaching a circle. In the current era, where AIGC has brought the cost of content generation close to zero and content supply is growing exponentially, square screens can capture users' shrinking attention better than widescreens. This is because square screens can project information more precisely onto the focal area of the human eye with higher collection efficiency.

In this industry research report, "Technology - Driven Sensory Revolution", we'll start with human physiological limitations and the evolutionary history of the audio - visual industry. We'll analyze why the square might be the end - point of screen evolution, and how future technological development and startup directions will change based on this view.

We'll answer the following questions in this article:

  • In today's world where the attention threshold has shrunk to seconds, how can we combat information overload?
  • Was the popularization of the 16:9 ratio a result of demand or compromise?
  • Why do The Legend of Zhen Huan and Neon Genesis Evangelion share the same underlying logic?
  • How does AIGC use the "Save the Cat Beat" to bring content production costs down to zero?
  • Why is the widescreen becoming a hindrance in the era of short dramas?
  • From IMAX to triple - fold screens, why are screens returning to the "square" shape?

/ 01 / The Human "Token Retention" Mechanism

Before exploring the evolutionary logic of audio - visual carriers over the past century, we must first face a physiological fact: While display technology has been advancing at an astonishing pace following Moore's Law, the human eyes, which are responsible for receiving information, haven't undergone any substantial upgrades in tens of thousands of years of evolution.

1. Vision is the most important channel for humans to receive external information, without a doubt.

Anatomical data shows that the human optic nerve contains about 1.2 million nerve fibers, while the cochlear nerve responsible for hearing only has about 30,000.

If we assume the total amount of human sensory information reception to be 100%, the visual channel accounts for about 70%, the auditory channel about 20%, and the remaining 10% is shared by smell, taste, touch, and the sense of balance.

This means that most of our understanding of the world is based on visual signals.

2. The brain needs to prevent over - stimulation.

Although the visual bandwidth is extremely wide, it doesn't mean that the human brain's processing capacity is infinite. Our pursuit of novelty and exploration is based on continuous stimulation of the nervous system, and this stimulation has a physiological threshold. If external information floods in without control, the brain can easily enter an over - stressed state.

To protect ourselves, we've developed a highly refined information filtering and retention mechanism: Faced with a vast amount of audio - visual input, what we ultimately retain is often not a specific, pixel - by - pixel image, but a core "feeling".

For example, if you got a perfect score on an exam in school, years later, you might not remember the specific questions on the paper or even which subject it was, but you can clearly recall the sense of achievement at that moment. Borrowing the concept from the large - model era, this is essentially "purifying a large amount of redundant information into a small number of Tokens".

We're so good at simplifying complex information, yet we've entered an era where the supply of information has exploded. This means that if content can't capture our attention in a short time, the brain will directly label it as "junk information".

/ 02 / A Century - long "War" for Attention

The evolution of audio - visual history is essentially creators' continuous efforts to gain control of human sensory channels. If we look back to the early days before the technological boom, we'll find that all carriers were "square" and "focused".

1. From static to dynamic: Murals - Drama.

The earliest image records can be traced back to ancient murals. Whether using a brush or carving, humans have always been looking for ways to record beauty.

When murals "came to life", they became dramas. The visual focus of a drama is actually fixed on the square - shaped central stage. At this stage, there were no camera zooms or montages. Creators guided the audience's attention through the actors' movements in the center of the frame.

2. From silent to sound: Taking over action and atmosphere.

In the late 19th century, movies were born. At that time, movie masters were essentially making "recordings of dramas". If we look at early movie works, like Modern Times in 1936, within the silent, black - and - white 4:3 frame, due to the lack of color and sound, creators had to rely on extremely exaggerated movements and expressions to hold the audience's attention. In this extreme square composition, every subtle movement was magnified.

Poster of Modern Times. Image source: WIKI.

As sound tracks were added, creators began to explore the ability of sound to create atmosphere.

The Good, the Bad and the Ugly in 1966 demonstrated how sound can take over when vision fails. In a scene from this movie, the camera movement was extremely slow, with little actual narrative progress. If only looking at the picture, modern audiences' visual attention would start to wander after the first 7 seconds due to the "low information density".

However, Ennio Morricone's iconic soundtrack took over in time. The tense ambient sounds and the long - lasting, menacing melody created a strong sense of immersion in the audience's minds through the auditory channel when the visual was boring. Years later, you might not remember the characters' faces, but that soundtrack will instantly transport you back to the Wild West.

3. From black - and - white to color, from live - action to animation: Control of imagination

In 1968, black - and - white images turned into color.

In 2001: A Space Odyssey, Stanley Kubrick used color to greatly expand the boundaries of human imagination. Color images are not only a visual upgrade but also a significant improvement in information input efficiency. Different color combinations represent different emotional tendencies and environmental information, making information input more dense.

After that, animation emerged. Compared with the uncontrollable natural environment in live - action movies, animation represents humans' in - depth control of audio - visual sovereignty. Every frame, light, and composition is precisely determined.

For example, The Little Monster from Langlang Mountain is very touching even without expensive special effects. This is because its color and delicate narrative rhythm have captured the public's emotions.

This shift from "recording reality" to "artificial generation" has essentially laid the logical foundation for content creation in the AIGC era.

/ 03/ If "square" is the answer, why have screens become wider over the past century?

Since the "square" is the optimal physiological solution, why have screens become narrower and longer in the past fifty years? Behind this is a tug - of - war between industry interests and physical realities.

In the late 19th century, Thomas Edison's laboratory established the 4:3 (i.e., 1.33:1) aspect ratio when developing 35mm film. This ratio wasn't chosen randomly; it's the closest to the focal range that the human eye can cover in a natural state without making large - scale sweeps.

The standard specification for 35mm silent films established in the late 19th century. Note the 4 perforations corresponding to each frame in the picture. This engineering limitation led to the precise 4:3 aspect ratio. Image source: Film Atlas.

In the 1950s, as televisions entered every household, cinemas faced an existential crisis. To attract audiences back from their living - room TVs, the film industry made a change: Since TVs used the same 4:3 ratio as movies, movies had to become wider.

In 1953, 20th Century Fox introduced CinemaScope technology, stretching the aspect ratio to 2.35:1. The core logic was: Since the visual center was saturated, use an extremely wide frame to fill the audience's "peripheral vision" and create an illusion of immersion.

In the 1980s, 16:9 gradually became the standard ratio for high - definition TVs. However, in reality, the popularization of this ratio was more due to physical space constraints: In modern architecture, due to ceiling height limitations, it's difficult for manufacturers to increase the screen size by adding height. Expanding horizontally (making it wider) is more cost - effective and easier to achieve. The 16:9 and subsequent 21:9 ratios are actually a forced reshaping of visual habits.

This has led to an awkward situation: The human physiological field of vision tends to be elliptical or even more square - like, approaching a circle. But we're placed in an extremely narrow and long visual environment. On a 16:9 widescreen, our eyes have to make more frequent horizontal movements (sweeps) to capture edge information, while the rhythm of content production is accelerating.

/ 04 / Why is The Legend of Zhen Huan = Neon Genesis Evangelion = Hamlet?

While the screen ratio has been expanding horizontally, the content structure carried by audio - visual carriers has already completed a highly standardized evolution.

In the film, television, and content industries, there's a precise narrative template called "Save the Cat".

This is a precise narrative formula: There are very clear time points for when the protagonist appears, when they fall into crisis, and when they are redeemed. If we strip away the surface art packaging, we can even say that whether it's The Legend of Zhen Huan, Neon Genesis Evangelion, or Hamlet, their underlying story beats are the same.

For the content industry, this "metronome" is a guarantee of production efficiency. It shows that most successful narratives aren't random creations but precise hits on human psychological stimulation points.

It's precisely because of this "metronome" that the emergence of AIGC has enabled an exponential increase in content creation speed.

Since the amazing debut of Sora, followed by video models like KeLing and Runway, and then the sudden appearance of Seedance 2.0 at the beginning of this year, learning how to generate videos with AI has become a must - learn for almost all professional content creators.

Hollywood director Charles Curran publicly stated after a test that he completed a movie - level trailer in just 20 minutes and spent only a few dozen dollars. Seedance 2.0 has the potential to disrupt the traditional film and television industry and may reshape the Hollywood creative process and cost structure.

As AI video model technology continues to improve, the threshold for implementing the "metronome" will only get lower. Short dramas are the ultimate manifestation of using this metronome.

In short dramas, the traditional minute - level beats are further compressed to seconds. To keep the audience engaged within a very short attention span, creators are forced to pack the highest - intensity stimulation into a very small visual space. Each episode is only one or two minutes long, and every twist has to hit the right node precisely.

This highly concentrated and high - frequency reversal content form is making the "wide visual background" lose its original purpose.

In a 16:9 or even wider screen ratio, the two sides of the picture are usually used to place environmental details, architectural compositions, or distant - view atmospheres. These details are used to create a "sense of atmosphere" in traditional slow - paced narratives. However, when the narrative becomes a "short - drama - style beat", the audience's visual focus will be firmly locked on the conflict point in the center of the picture.

For the audience, the wide background that exists to fill the peripheral vision becomes redundant. The old model that relied on "atmosphere building", like in The Good, the Bad and the Ugly, is being replaced by a more direct and energy - efficient visual focus.

This also indicates that audio - visual carriers are about to complete a century - long cycle: returning from wide physical occupation to the "square" that best suits the focal field of vision.

/ 05 / Four Ways to Reshape Sensory Experience

The return of audio - visual logic to the "square" is not just a change in ratio but the beginning of a trend to take over all senses. Based on this, we can imagine the future: How will the technological roadmap and the form of consumer hardware develop? More importantly, where are the startup and investment opportunities?

1. Establish "direct trust" in emotion

The end - goal of immersion isn't just piling up flat pixels but establishing "direct trust" at the physiological level. Although vision accounts for 70% of the bandwidth, its falsity can be easily recognized by the brain. In contrast, the "presence" brought by spatial audio can greatly enhance the user experience.

Take the spatial audio function popularized in the latest generations of iPhones as an example. It allows users to freely choose whether the sound follows the camera, the environment, or is off - screen during