StartseiteArtikel

New breakthrough in image geolocation: the LocDiff framework proposed by the University of Maine, Google, OpenAI, etc., enables global-level precise positioning without grids and reference libraries.

超神经HyperAI2025-11-19 18:11
Selected for NeurIPS 2025

A joint team composed of the University of Maine, Google, Harvard University, etc., proposed the "Spherical Harmonics Dirac Delta (SHDD)" and the integrated framework LocDiff. By constructing an encoding method and a diffusion architecture adapted to spherical geometry, they achieved accurate positioning without relying on preset grids or external image libraries, providing a breakthrough technical path for this field.

Location decoding technology infers geographical locations from contextual information and is widely used in fields such as trajectory synthesis, building contour segmentation, and image geolocation. Among them, image geolocation, which associates visual content with geographical coordinates, has become a research focus. It predicts latitude and longitude coordinates by analyzing image features and is suitable for data types such as wildlife monitoring and urban street views.

However, different from the mature image classification task, image geolocation faces complex non - linear mapping problems and is difficult to model accurately. Early studies used regression models to directly map image features to latitude and longitude, but in global - scale tasks, the stability was poor, and the prediction error often reached hundreds of kilometers. To overcome this problem, researchers proposed the "discretization transduction" method, which transforms the positioning task into a classification or retrieval problem. However, these methods still have limitations in spatial resolution and geographical coverage.

In recent years, generative technologies represented by diffusion models have opened up a new path for geolocation research due to their excellent ability to model continuous data distributions. Based on this, a joint team from the University of Maine, the University of Texas, the University of Georgia, the University of Maryland, Google, OpenAI, and Harvard University proposed an innovative method. They found that the fundamental reason for the failure of traditional generative methods is that the spatial properties of geographical coordinates are different from those of conventional data: coordinates are located on an embedded Riemannian manifold rather than in Euclidean space, and directly applying noise will lead to projection distortion; at the same time, the original coordinates lack multi - scale spatial information, making it difficult to support the modeling of complex distributions. To address these two major problems, the team proposed the "Spherical Harmonics Dirac Delta (SHDD)" and the integrated framework LocDiff. By constructing an encoding method and a diffusion architecture adapted to spherical geometry, they achieved accurate positioning without relying on preset grids or external image libraries, providing a breakthrough technical path for this field.

The relevant research results, titled "LocDiff: Identifying Locations on Earth by Diffusing in the Hilbert Space", have been included in NeurIPS 2025.

  • Paper link: https://openreview.net/forum?id=ghybX0Qlls

Dataset: Expand three typical global - scale image geolocation datasets based on GeoCLIP as a benchmark

To ensure the comparability and reliability of the research results, the researchers followed the benchmark settings of the GeoCLIP model widely used in the field of image geolocation. During the training phase, the MP16 dataset (MediaEval Placing Tasks 2016) was used. This dataset contains 4.72 million images with accurate geographical annotations, providing sufficient data support for model training. During the testing phase, three typical global - scale image geolocation datasets were selected: Im2GPS3k, YFCC26k, and GWS15k.

It should be noted that the data distributions of the test sets Im2GPS3k and YFCC26k are relatively similar to that of the training set MP16, and there may be some overlapping images. This characteristic gives retrieval - based methods (such as GeoCLIP) certain advantages in the matching process, which helps to improve their retrieval accuracy. During the model inference phase, the researchers borrowed the strategies adopted by mainstream models such as GeoCLIP and SimCLR and generated 16 enhanced versions for each test image. The geographical center of multiple samplings was used as the final predicted location. This strategy significantly improved the model performance. For example, in the comparative experiment, if the two steps of image enhancement and result averaging were cancelled, the 1 - kilometer scale positioning accuracy of GeoCLIP on the Im2GPS3k dataset would drop from 14% to below 10%.

To comprehensively evaluate the model's positioning ability at different spatial scales, this study set five evaluation levels: street - level (1 kilometer), city - level (25 kilometers), regional - level (200 kilometers), national - level (750 kilometers), and continental - level (2,500 kilometers). The model performance was quantified by counting the proportion of samples whose predictions fell within the neighborhood of the true location.

LocDiff: A latent diffusion model for spherical location generation

The core goal of the LocDiff model is to construct a latent diffusion framework adapted to spherical location generation. Its core idea is to build a location encoding space that can overcome the problems of sparsity and non - linearity, which is specifically achieved through the in - depth integration of the Spherical Harmonics Dirac Delta (SHDD) encoding - decoding framework, the conditional Siren - UNet (CS - UNet) architecture, and an efficient computing strategy.

To clarify the technical direction, this study first defined the core attributes that an ideal position encoding space should have from a mathematical perspective: Let the coordinate space C be a unit sphere embedded in a three - dimensional Euclidean space, and parameterize it with an angular coordinate system (θ, φ); the ideal position encoder PE needs to be an injective function from C to the high - dimensional space ℝ^d (to ensure the uniqueness of encoding), and the decoder PD needs to be a surjective function from ℝ^d back to C (to ensure the completeness of decoding). More importantly, the encoding space needs to be densely filled through a continuous difference metric ℰ, and the decoder needs to meet the stability requirement that "a small perturbation in the encoding space only causes a small change in the spherical coordinates" - these two attributes are the keys to breaking through the existing technical bottlenecks.

However, existing methods face a double dilemma when achieving the above goals: If the position encoding space used is sparse, the diffusion model will have difficulty performing a stable diffusion process in it, directly leading to difficult training convergence and low decoding accuracy; if a dense location embedding space is used instead, although it can support the smooth progress of the diffusion process, the highly non - linear mapping between the position encoding and the coordinate space will make the task of "inferring the correct geographical coordinates from the embedding result" a deadlock - minimizing the distance in the embedding space often does not correspond to minimizing the distance in the geographical space.

To solve this dilemma, the researchers proposed the SHDD encoding scheme. Its innovative idea is to first transform the spherical point (θ₀, φ₀) into a spherical harmonics Dirac delta function δ_ (θ₀, φ₀), then encode this function into a vector of spherical harmonic function coefficients, and finally form the SHDD representation. In practical applications, by setting the maximum order L of the spherical harmonic function, the theoretically infinite - dimensional coefficient vector can be truncated into a compact (L + 1)² - dimensional representation. The larger the L value, the more detailed the spatial information captured by the representation, providing flexible support for multi - scale positioning requirements.

The SHDD encoding space naturally has a dense characteristic: Each point e in it uniquely corresponds to a spherical function Fₑ. The difference between this function and the spherical harmonics Dirac delta function δ_ (θ₀, φ₀) corresponding to the real position is quantified by the reverse KL divergence. This difference metric ℰ is the continuous metric standard required by the research. More importantly, there is a clear constraint relationship between the SHDD KL divergence and the Wasserstein - 2 distance, which mathematically ensures the consistency between the difference in the encoding space and the difference in the spherical probability distribution, laying the foundation for decoding stability. At the same time, the SHDD encoding effectively solves the non - linear problem of traditional methods. The comparison of relevant heat maps intuitively shows that compared with traditional embedding methods, the change in the spherical distance measured by SHDD is smoother, and this smoothness greatly reduces the risk of error propagation during the decoding process, providing a guarantee for accurate positioning.

Multi - scale latent diffusion for image geolocation

Based on the characteristics of the SHDD representation, the researchers designed a modal search decoder to achieve efficient decoding. This decoder uses the nature of modal search of the reverse KL divergence to infer the coordinates by finding the region where the probability mass of the spherical function is most concentrated. The hyperparameter ρ is used to balance the decoding resolution and stability - when the ρ value is large, the decoding result is insensitive to local peaks but has coarser accuracy; when the ρ value is small, the accuracy is improved but it is easily affected by local noise. This parameter - free design has dual advantages: it not only avoids introducing additional losses during the decoding stage but also completely gets rid of the dependence on preset spherical partitions or external reference image libraries, breaking the application limitations of traditional methods.

As the conditional generation backbone network of LocDiff, as shown in the figure below, the CS - UNet architecture is based on the SirenNet module. This choice is because the spherical harmonic function coefficients are essentially superpositions of sine and cosine functions, and the sine activation function of SirenNet can effectively maintain the gradient flow, adapting to the propagation requirements of spherical harmonic features. The core unit C - Siren of CS - UNet achieves efficient conditional denoising through an elaborate feature fusion mechanism: After inputting the latent vector x, the image conditional embedding e_I, and the diffusion step t, x and e_I are first projected into hidden vectors, then the discrete diffusion time step t is converted into scale and offset vectors to complete unconditional denoising, and finally the image conditions and denoised features are fused, and the adjusted features are output and passed to the next - level module to form a complete conditional guidance link.

Architectures of C - Siren and CS - UNet

The training process of LocDiff follows the standard DDPM framework, using "image - spherical position" as training sample pairs: First, the image is converted into a fixed - dimension embedding representation e_I through a frozen CLIP encoder, and the corresponding spherical position (θ, φ) is encoded into an SHDD representation and stored for later use; during the forward propagation stage, the spherical harmonics Dirac delta function is gradually added with noise until it is converted into a pure Gaussian noise vector; during the backward propagation stage, under the guidance of the image embedding e_I, CS - UNet gradually restores the original SHDD representation from the noise vector. The loss function used in training is the SHDD KL divergence. Compared with the traditional spherical MSE loss, it is not only more numerically stable but also can effectively retain multi - scale spatial information, helping the model learn global and local features.

During the inference stage, the model starts from random Gaussian noise. Under the guidance of the embedding features of the input image, it gradually generates an SHDD coefficient vector through CS - UNet, and finally converts it into spherical coordinates (θ, φ) through the modal search decoder. In the actual engineering implementation, the calculation of the SHDD KL divergence and the integral operation of the modal search are both approximately calculated by summing over a discrete set of spherical anchor points. During training, anchor points are randomly sampled from the global range to avoid overfitting.

Focusing on 3 dimensions, LocDiff performs excellently in most test scenarios

To systematically evaluate the performance of the LocDiff model, this study conducted experiments from three dimensions: positioning accuracy, generalization ability, and computational efficiency. All experiments followed the domain - standard settings to ensure a fair comparison.

The experiments show that, as shown in the following table, LocDiff performs excellently in most test scenarios. To further improve the fine - grained performance, the researchers designed a hybrid model LocDiff - H. By limiting the retrieval range of GeoCLIP to a 200 - kilometer radius around the position generated by LocDiff, they effectively combined the advantages of the two types of methods. LocDiff - H performs outstandingly on Im2GPS3k and YFCC26k, but is inferior to the original LocDiff on GWS15k, especially at the fine - grained scale. This is mainly because there is a significant distribution difference between GWS15k and the training set, causing the inductive bias of GeoCLIP to have a negative impact.

Main calculation results using GeoCLIP

As shown in the following table, in comparison with similar generative models, LocDiff outperforms comparison models such as DiffR³ and FMR³ on the OSM - 5M and YFCC - 4k datasets, verifying the advantages of the multi - scale latent diffusion method.

Comparison of LocDiff with existing generative methods

The analysis of generalization ability reveals the unique value of generative methods. The retrieval - based GeoCLIP heavily relies on the spatial coverage of the image library: When the distribution of the test set does not match that of the training set, its performance drops significantly; even when using millions of uniform grid points as candidate locations, its performance at scales of 200 kilometers and above is still far inferior to that when using the original image library. This reflects the limited ability of this method to adapt to unseen locations.

In contrast, LocDiff shows robust generalization ability. As shown in the following table, experimental verification shows that whether the anchor points are the locations in the MP16 image library or uniform grid points, and whether the number of anchor points increases from 21,000 to 1 million, the performance of LocDiff remains stable, further confirming its robustness.

Generalization experiment results

In terms of computational efficiency, LocDiff performs excellently. The SHDD encoding/decoding is a deterministic closed - form operation, with a time complexity close to a constant level and a linear space complexity. During training, the SHDD encoding can be pre - computed as an embedding lookup table, and decoding is achieved through efficient matrix multiplication and argmax operations. In particular, the multi - scale SHDD representation significantly accelerates the convergence of the diffusion process - LocDiff only needs about 2 million steps to converge on the YFCC dataset, while the best similar model needs 10 million steps.