He Kaiming's NeurIPS 2025 Speech Review: Three Decades of Visual Object Detection

A Brief History of Visual Object Detection

Not long ago, NeurIPS 2025 was successfully held. As one of the top conferences in the field of artificial intelligence academia, there were numerous works and speeches by academic giants.

One particular honor stood out as especially significant and well-deserved—the classic paper "Faster R-CNN", co-authored by Ren Shaoqing, He Kaiming, Ross Girshick, and Sun Jian, won the "Test of Time Award".

Anyone who has been exposed to computer vision is no stranger to this name. Since its publication in 2015, "Faster R-CNN" has undoubtedly become one of the most landmark works in this field. It not only established the core paradigm of modern object detection frameworks but also served as a lighthouse, profoundly influencing and guiding the development direction of visual models for the following decade.

Paper link: https://arxiv.org/pdf/1506.01497

As a witness and summary of this historic moment, He Kaiming delivered a speech titled "A Brief History of Visual Object Detection" at the conference.

He Kaiming's speech PPT has been made public. You can refer to the following link:

https://people.csail.mit.edu/kaiming/neurips2025talk/neurips2025_fasterrcnn_kaiming.pdf

From the content of He Kaiming's speech, this is not just a technical report but more like an epic about how computers learned to "see the world," summarizing the development of visual object detection over the past 30 years. Each work introduced in the speech has won the Test of Time Award at different top conferences and played a decisive role in the development of visual intelligence.

Are you curious about why today's AI can instantly recognize cats, dogs, cars, and even their positions in a photo, while this was considered an almost impossible task just a decade ago?

Let's follow the perspective of the master and travel back to that "primitive" era to see how this journey unfolded.

Primitive: Hand-crafted "Magnifying Glasses"

Before the explosion of deep learning, computer vision scientists were more like "craftsmen."

Early attempts at face detection: As early as the 1990s, scientists began to use neural networks and statistical methods to detect faces:

In 1996: Rowley et al. published "Neural Network-Based Face Detection", which was the first CV paper He Kaiming read. It used early neural networks to search for faces in image pyramids.

In 1997: Osuna et al. introduced support vector machines and published "SVM for Face Detection", attempting to draw a perfect classification line in the data.

In 2001: The famous Viola-Jones Framework emerged. It achieved extremely fast face detection through simple feature combinations. Even today, the autofocus function of many old cameras owes its existence to this framework.

The golden age of feature engineering: Since it was difficult to detect the "whole face," scientists started looking for "key points" and "textures." In the following years, feature descriptors took center stage:

In 1999: Lowe proposed SIFT (Scale-Invariant Feature Transform). This method could recognize objects even under rotation and scaling, making it the absolute king at that time.

In 2003: Sivic and Zisserman borrowed the concept from text search and proposed the "Bag of Visual Words" model, treating images as a collection of "visual words."

In 2005: Dalal and Triggs invented HOG (Histogram of Oriented Gradients) to describe the contours of pedestrians. In the same year, Grauman and Darrell proposed the "Pyramid Match Kernel" to compare the similarity between two sets of features.

In 2006: Lazebnik et al. further proposed "Spatial Pyramid Matching", which solved the problem of the loss of spatial location information in the bag-of-words model.

In 2008: DPM (Deformable Part Model), the culmination of feature engineering, made its debut. It regarded objects as deformable parts (such as a person's head, hands, and feet) connected like springs. This was the peak of traditional methods.

What were the pain points? The features were hand-crafted, and classifiers (such as SVM) could only work with these limited pieces of information. This approach was not only slow but also struggled to adapt to complex scenarios.

Dawn: The "Brute Force Aesthetics" of AlexNet and R-CNN

In 2012, AlexNet emerged suddenly, demonstrating that deep learning's ability to extract features far exceeded that of hand-crafted methods. But how could it be used for object detection?

The thunderbolt of deep learning: In 2012, AlexNet (Krizhevsky et al.) won the ImageNet competition by a landslide. It proved that deep convolutional neural networks (CNNs) had a much stronger ability to extract features than hand-crafted methods.

R-CNN: From classification to detection But how could CNNs be used for object detection (to draw bounding boxes around objects)? In 2014, Girshick et al. proposed the groundbreaking R-CNN (Region-based CNN). Its idea was straightforward:

First, use a traditional algorithm (Selective Search) to extract about 2,000 "region proposals" from the image.

Then, feed each region into the CNN to extract features and use an SVM for classification.

Peak: The "Speed Evolution" of Faster R-CNN

R-CNN required passing each candidate bounding box through the CNN, resulting in a huge computational load. Scientists began to think about how to reuse computations.

In 2014: He Kaiming's team proposed SPP-Net (Spatial Pyramid Pooling). It introduced a spatial pyramid pooling layer, allowing neural networks to process images of arbitrary sizes and only computing the full-image features once, significantly accelerating the detection process.

In 2015: Girshick, inspired by SPP-Net, introduced Fast R-CNN. It introduced RoI Pooling, integrating feature extraction, classification, and regression into a single network. It was not only fast but also allowed for end-to-end training.

The final bottleneck: Even so, the "region proposals" still relied on the cumbersome traditional algorithm (Selective Search), which became the speed bottleneck of the system.

In 2015, the birth of Faster R-CNN: He Kaiming's team proposed the RPN (Region Proposal Network). Inspired by the "Space Displacement Net" proposed by LeCun et al. in 1991, they let the neural network "slide" on the feature map and predict the possible locations of objects through pre-defined anchors.

At this point, all aspects of object detection—proposal generation, feature extraction, classification, and regression—were taken over by neural networks, achieving true "end-to-end" real-time detection. With a leap in both speed and accuracy, computer vision finally entered the era of real-time detection.

The New World Beyond the Mist: Transformer and Everything

Faster R-CNN opened a new era, but the exploration never stopped. In the second half of the speech, He Kaiming showed how the technological wave continued to surge:

If speed was the goal, could the "region proposal" step be eliminated?

In 2016: YOLO (You Only Look Once) and SSD (Single Shot MultiBox Detector) were introduced. Like humans taking in the whole scene at a glance, they directly output the locations and categories of all objects, achieving extremely high speed.

In 2017: To address the low accuracy problem in single-stage detection (imbalance between positive and negative samples), He Kaiming's team proposed Focal Loss (RetinaNet).

In 2017: Mask R-CNN made a stunning appearance. It added a branch to Faster R-CNN, enabling not only bounding box drawing but also pixel-level object segmentation (instance segmentation). It introduced RoI Align to solve the pixel alignment problem.

In 2020: DETR (Detection Transformer) introduced the Transformer architecture to vision. It completely abandoned anchors and complex post-processing (NMS) and redefined detection using a global attention mechanism.

In 2023: SAM (Segment Anything Model) emerged. Fed with large amounts of data, it learned to "segment everything," no longer limited to specific training categories, showcasing the prototype of a large visual model.

What Have We Learned in This "Age of Exploration"?

What have we learned in the past few decades?

He Kaiming said, "Write object detection papers and win Test of Time Awards :)"

At the end of the speech, he concluded with a very symbolic image generated by Nano-Banana: a ship sailing into the misty sea.

He said, scientific exploration is like sailing into the mist.

There is no pre-drawn map.

We don't even know if the destination exists.

From hand-crafted features to CNNs and then to Transformers, each leap was like discovering a new continent in the mist by explorers. Faster R-CNN is not just an algorithm; it teaches us that when old components become bottlenecks, we should replace them with more powerful learnable models.

What will be the "Holy Grail" of computer vision in the next decade?

This article is from the WeChat official account "Machine Intelligence", published by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

A Review of He Kaiming's NeurIPS 2025 Speech: Thirty Years of Visual Object Detection

Primitive: Hand-crafted "Magnifying Glasses"

Dawn: The "Brute Force Aesthetics" of AlexNet and R-CNN

Peak: The "Speed Evolution" of Faster R-CNN

The New World Beyond the Mist: Transformer and Everything

What Have We Learned in This "Age of Exploration"?