HomeArticle

Apple Reinvents Image Compression with AI: Same Quality, One-Third the File Size

机器之心2026-05-30 10:39
Satisfy the human eye

How small can an image be compressed?

In February 2025, the Joint Photographic Experts Group (JPEG) announced something quietly celebrated in the industry: JPEG AI. This long - awaited first end - to - end learning - based international standard for image encoding, which took many years to develop, was officially released.

When the news spread, many researchers reposted it on social media with comments like "AI has finally entered the standard."

The JPEG standard was born in 1992 and has been a fundamental language for human digital images for more than three decades. Now, artificial intelligence is starting to rewrite the grammar of this language.

However, behind the celebration lies a subtle reality: even JPEG AI is still quite far from true "perceptual compression."

Engineers know that the traditional metric for measuring compression quality, Peak Signal - to - Noise Ratio (PSNR), has little to do with what the human eye perceives as "good - looking." An image with a high PSNR score may look mediocre to the human eye, while another image with a lower PSNR may be considered rich in details and realistic in texture. Optimizing mathematical metrics and optimizing human perception are two completely different things.

For decades, from JPEG to VVC, and then to JPEG AI, the design logic of almost all codecs has been revolving within the framework of mathematical metrics. Perceptual compression (optimizing directly for the human eye experience) has always been a long - term goal in academic papers rather than an engineering reality that can be implemented in mobile phones.

Just at this critical moment, an engineering team at Apple quietly published a paper and presented their answer, codenamed: PICO.

Paper title: What Matters in Practical Learned Image Compression

Paper address: https://arxiv.org/pdf/2605.05148

Why is "looking better" much more difficult than "having higher numbers"?

Before understanding PICO, we need to understand what image compression is actually doing.

Storing a photo as a file is essentially a trade - off question of "what to forget and what to remember." Since storage space is limited, we have to discard some information while making the viewer hardly notice. Different codecs follow different "discarding methods."

Traditional codecs such as JPEG, AV1, and VVC are rule - based systems manually designed by engineers. They divide the image into blocks, perform transformations, quantization, and entropy encoding. Each step is based on decades of manual experience. Such systems can perform extremely well in mathematical metrics like PSNR, but their design is essentially aimed at "reducing pixel errors" rather than "reducing human eye discomfort."

The problem is that the human eye is not a pixel error meter. The human eye's sensitivity to textures, text, and details is much more complex than mathematical formulas. When you compress a street - view photo to a very small size, the PSNR may still look decent, but you will see blurred building edges and distorted road - sign text - and these are exactly what the human eye notices first.

The emergence of learning - based codecs theoretically opens a new door: neural networks can be trained end - to - end directly for human perception rather than for mathematical formulas. However, before PICO, existing perceptual learning - based codecs were either too slow to be practical, lacked cross - device compatibility, or could not flexibly control the bit rate, and thus could not be integrated into a consumer - grade product.

Three core problems, three solutions

The full name of PICO is Perceptual Image Codec. This name directly indicates its goal: to satisfy the human eye.

The research team systematically explored millions of model configurations and introduced several key technological innovations.

First problem: What to do about slow entropy encoding?

There is a difficult problem in image compression: to compress an image to a smaller size, the codec needs to use an "entropy model" to accurately estimate the information content of each pixel. The most accurate method is autoregressive encoding: when compressing each pixel, it is necessary to first look at the surrounding compressed pixels and make predictions in sequence. This is like a chef having to look back at the state of the pot after adding each ingredient before deciding the next step. It is accurate but extremely slow.

PICO's solution is the One - shot Context Model: it separates the most critical "scale parameter" in entropy encoding and calculates it all in one forward pass, eliminating the need for back - and - forth waiting; the remaining parameters can be calculated in parallel, retaining the accuracy of autoregressive encoding while bypassing its speed bottleneck. The result is that removing this module causes a 10.28% drop in model performance, while adding it has little impact on speed.

Second problem: What to do about hallucinations in perceptual training?

Images trained with GAN (Generative Adversarial Networks) often "look real," but the reality may be fabricated - hair strands turn into non - existent patterns, and false textures appear on smooth surfaces. What's more troublesome is that the human eye is extremely sensitive to text. Even a slight deformation of a single letter can be immediately noticed.

PICO specifically designed TextFidelityLoss for text: it uses an existing text detector to automatically find the text areas in the image, imposes strict pixel fidelity constraints in these areas, and suppresses the "play space" of GAN in the text areas. Experiments show that after adding this loss function, the absolute error in the text areas is reduced by half.

Third problem: What to do about color block boundaries left by image block processing?

To run quickly on mobile phone chips, PICO divides the image into tiles of 504×504 pixels, processes them separately, and then stitches them back together. However, GAN tends to ignore low - frequency colors during training, resulting in visible color differences between adjacent tiles, similar to the feeling of "poor stitching" when retouching an image. The research team specifically introduced TilingArtifactLoss, a multi - resolution L1 loss, to force the model to maintain color consistency at multiple spatial frequencies. This measure reduces the error at the tile boundaries by more than half.

Experimental results

The Apple team didn't just rely on benchmark evaluation metrics. They commissioned a third - party platform, Mabyduck, to organize a large - scale human subjective evaluation.

The evaluation used a blind pairwise comparison method: 610 screened evaluators (who needed to pass a color - blindness test and a compression artifact discrimination test) compared the reconstruction results of the same image under different codecs in pairs, and the results were finally summarized into Bayesian ELO scores. A total of 74,925 pairwise comparison results were collected.

The final numbers speak for themselves: At the same visual quality, the file size of PICO is only one - third to one - half of that of AV1, AV2, VVC, ECM, and JPEG AI - in other words, to store the same image, it only needs 30% - 43% of the bits required by these standards. Compared with the current strongest learning - based perceptual codecs (such as HiFiC, MRIC, etc.), PICO also saves 20% - 40% of the file size.

In terms of speed, on the iPhone 17 Pro Max, PICO only takes 230 milliseconds to encode a 12MP photo and 150 milliseconds to decode it. Most top - level ML codecs run slower than this even on NVIDIA V100 server graphics cards.

It is worth noting that the paper also specifically records a "counter - example": in the traditional PSNR metric, PICO performs mediocrely, even worse than DCVC - RT and VVC. This exactly confirms the team's basic judgment: optimizing perceptual quality and optimizing mathematical metrics are essentially two different directions, and you can't have both.

An era node, not an end

PICO certainly has limitations. The paper admits that for highly regular synthetic images such as cartoons and schematics, PICO's compression efficiency is lower than that of traditional codecs because this type of content is naturally suitable for rule - driven autoregressive modeling rather than perceptual generation.

However, these limitations do not overshadow the significance of this work.

In the past three decades, technological progress in image compression has almost all occurred on the track of "making the numbers look better." From JPEG to HEVC, and then to VVC, engineers have been optimizing metrics such as PSNR and SSIM for generations. The human eye's perception has always been a "difficult problem" to be bypassed.

PICO is the first time someone has systematically tackled this difficult problem head - on: from architecture search, loss function design, to large - scale human subjective evaluation, and finally integrated it into a codec that can run in real - time on mobile phones.

The next time you share a photo on an Apple device, you may not feel any difference. But perhaps in that quiet compression process, an algorithm tailored for human eye perception is deciding which information is worth keeping and which can be quietly forgotten.

The team: From WaveOne to Apple

The corresponding author of this paper is Oren Rippel, an Apple researcher and a well - known figure in the compression field.

His name first appeared on a large scale in 2017. At that time, he was still at the startup WaveOne and published a paper titled "Real - Time Adaptive Image Compression," which used neural networks to beat all mainstream codecs at that time while maintaining real - time running speed. That paper caused quite a stir in the academic community and established Rippel's position in the field of learning - based compression.

After that, the same group of core members continued to work at WaveOne and launched ELF - VC for video compression, which achieved a 44% bit - rate savings compared to H.264 on the UVG video test set and ran more than five times faster than similar ML codecs.

The team from WaveOne later joined Apple as a whole. And this PICO is their first systematic answer in the field of image perceptual compression, leveraging Apple's computing power and platform resources.

This article is from the WeChat official account “MachineHeart” (ID: almosthuman2014), author: Compression is Intelligence. It is published by 36Kr with permission.