HomeArticle

Generative recommendation fills in a crucial link, and semantic ID achieves differentiable joint optimization for the first time.

新智元2026-04-30 15:27
[Introduction] Generative recommendation is becoming an important direction for recommendation systems. Different from traditional recommendation models that directly classify items in a collection, generative recommendation tends to first encode items into a sequence of discrete tokens and then have the model generate a "representation of the next item" that users may be interested in. In this technological approach, Semantic ID has become a key infrastructure.

The problem is that in most existing methods, semantic IDs are often learned first and then frozen: they usually serve content reconstruction but are not directly optimized for the recommendation goal. As a result, although the recommendation model is learning "what to recommend", it cannot in turn influence "how items should be represented".

Researchers from institutions such as the University of Glasgow, Shandong University, and Leiden University introduced the differentiable semantic indexing mechanism system into the generative recommendation framework for the first time in the paper DIGER, allowing the recommendation loss to directly participate in the learning of semantic IDs and achieving consistent improvements on multiple public datasets.

Paper link: https://arxiv.org/abs/2601.19711

Code link: https://github.com/junchen-fu/DIGER

This paper has the following three highlights:

1. It brings the joint optimization of differentiable semantic IDs into generative recommendation for the first time, allowing the recommendation loss to directly participate in the learning of semantic IDs;

2. It improves training stability and alleviates codebook collapse through the Gumbel noise and uncertainty decay mechanism;

3. It stably outperforms the traditional two-stage generative recommendation process on public datasets.

This paper has been accepted as a long paper by SIGIR 2026. SIGIR is a top international conference in the field of information retrieval and a CCF-A class conference, mainly accepting cutting-edge research results in directions such as search and recommendation.

Why does generative recommendation rely on semantic IDs?

In traditional recommendation systems, a product often corresponds to a fixed ID, and the model learns "whether the user will click on this ID".

However, in generative recommendation, researchers increasingly hope that the model can utilize richer item content, such as titles, descriptions, categories, and even longer texts. Thus, a natural idea emerged: first compress the product content into a set of discrete semantic representations, and then transform the recommendation problem into a sequence generation problem.

This set of discrete representations is what the paper refers to as semantic IDs.

It can be understood as a short code automatically learned by the model. It is not the original product number in the database but a string of discrete tokens obtained based on the product content.

For the model, such a representation is more compact and more suitable for using generative models to predict items that users may be interested in next.

That's why semantic IDs have become a very important technical component in generative recommendation. Many works first train a tokenizer to map product content to semantic IDs, and then train a generative recommendation model to predict these IDs.

Rajput, Shashank, et al. "Recommender systems with generative retrieval." Advances in Neural Information Processing Systems 36 (2023): 10299-10315.

Problem: Well-learned semantic IDs may not necessarily be more suitable for recommendation

In existing methods, there has long been a default premise: semantic IDs are learned separately first and then fixed for use.

This process seems logical:

Step 1: Use models such as RQ-VAE to learn semantic IDs;

Step 2: Hand these fixed IDs to the generative recommendation model for training.

However, there is a key problem here: the learning goal of semantic IDs is usually content reconstruction, while the optimization goal of the recommendation model is to predict the user's next behavior. The former cares about "whether the content is accurately represented", while the latter cares about "whether the recommendation is accurate enough". The two are not completely consistent.

This means that although the recommendation model is responsible for the final effect, it cannot reverse-update the learning process of semantic IDs. In other words, the semantic IDs ultimately used by the model may not be the ones most beneficial for recommendation. This is exactly what DIGER focuses on.

In traditional generative recommendation, semantic IDs are often learned first and then frozen, and the recommendation loss cannot be backpropagated to the index learning.

DIGER, on the other hand, opens up this path, allowing semantic IDs to be optimized together with the recommendation goal.

DIGER: Bringing semantic IDs into the recommendation optimization closed-loop

DIGER aims to solve a key problem in generative recommendation: Although semantic IDs determine how items are represented, in the past, they were often just fixed inputs learned in advance and could not be updated together with the recommendation goal.

This has a direct impact. The recommendation model is constantly optimizing "what to recommend", but the learning of semantic IDs still serves content reconstruction, and the two goals are not completely consistent. As a result, the set of semantic representations ultimately used by the model may not be the most suitable for the recommendation task itself.

A natural idea is that since semantic IDs usually come from discrete representation models such as RQ-VAE, and STE itself is also a commonly used gradient approximation method in RQ-VAE, can we directly follow this approach and bring semantic IDs into the recommendation training for joint optimization?

The paper found that things are not that simple.

Why doesn't directly using STE work?

The problem is not that STE cannot backpropagate gradients, but that doing so directly is likely to lead to training collapse.

From the experimental results, although naive STE formally opens up the gradient path, problems will soon occur during training: the improvement in recommendation performance is limited, and the model stops early;

Meanwhile, the code balance is significantly low, indicating that a large number of codes are not fully utilized, and obvious collapse occurs in the semantic space.

This also shows that the joint optimization of semantic IDs cannot rely solely on "connecting" the gradients. If there is not enough exploration in the early stage of training, the model will quickly converge to a few codes, and the subsequent recommendation optimization space will also be compressed.

When directly using naive STE for joint optimization, the model stops early, and both the recommendation performance and code balance are significantly behind; DIGER, on the other hand, has more stable training and continuous performance improvement.

What does DIGER do?

The core framework of DIGER. Compared with the hard update of directly using STE, DIGER introduces Gumbel noise and uncertainty decay, making the learning of semantic IDs both differentiable and more exploratory.

Based on this problem, the design of DIGER can be summarized in two steps.

Step 1: Enable semantic IDs to participate in the optimization of the recommendation goal.

The paper proposes DRIL (Differentiable Semantic ID with Exploratory Learning), which introduces Gumbel noise into the learning process of semantic IDs and replaces the overly rigid hard selection with a differentiable method with exploration.

In this way, the recommendation loss can be more effectively backpropagated to the semantic ID learning module, and the two-stage process of "encoding first, then recommending" is truly connected for the first time.

Step 2: To solve the contradiction of "requiring sufficient exploration in the early stage and stable convergence in the later stage" in joint training, DIGER further designs two uncertainty decay strategies.

The first is SDUD (Standard Deviation-based Uncertainty Decay), that is, uncertainty decay based on the standard deviation.

The idea is that as the training progresses, gradually reduce the randomness brought by Gumbel noise, allowing the model to retain stronger exploration ability in the early stage and contact as many candidate codes as possible;

In the later stage of training, gradually reduce the uncertainty to make the allocation of semantic IDs more stable and facilitate the model to converge to a more reliable representation.

The second is FrqUD (Frequency-based Uncertainty Decay), that is, uncertainty decay based on frequency.

Different from SDUD, which mainly starts from the overall noise intensity, FrqUD pays more attention to the actual usage of different codes:

If some codes are selected too early and too frequently during training, the model will correspondingly increase the exploration pressure on these positions to avoid a few popular codes dominating the entire semantic space too early; for codes with relatively insufficient usage, more opportunities for exploration and activation are reserved.

The two uncertainty decay strategies of DIGER. SDUD gradually reduces the training randomness to help the model transition from exploration to convergence; FrqUD adjusts the exploration intensity according to the code usage frequency to avoid a few popular codes dominating the semantic space too early.

Experimental results

The paper conducted experiments on three public datasets: Amazon Beauty, Amazon Instrument, and Yelp.

Compared with the traditional Two-Stage generative recommendation process, DIGER achieved stable improvements on all three datasets:

  • On Amazon Beauty, R@10 increased from 0.0610 to 0.0657–0.0696, and N@10 increased from 0.0331 to 0.0361–0.0376
  • On Amazon Instrument, R@10 increased from 0.1058 to 0.1124–0.1138, and N@10 increased from 0.0797 to 0.0823–0.0844
  • On Yelp, R@10 increased from 0.0407 to 0.0432–0.0439, and N@10 increased to 0.0227

These results indicate that after incorporating semantic IDs into the joint optimization of the recommendation goal, the benefits are not limited to a single scenario but show consistency on multiple datasets.

When further compared with representative methods such as ETEGRec and LETTER, DIGER also demonstrated strong competitiveness: it was close to LETTER on Yelp and achieved better results on the other datasets.

DIGER "opens up" the semantic space

If the previous results show that DIGER can improve performance, then this figure further answers another question: Has the semantic ID learned by DIGER really utilized the codebook?

The figure shows the code usage distribution of four methods on three quantization layers at the best checkpoint. Each 16×16 grid corresponds to 256 codebook entries. The darker the color, the higher the probability that the code is used. It can be seen that DIGER with uncertainty decay, especially the SDUD and FrqUD versions, has a more balanced code usage on different layers, and the overall distribution is smoother; in contrast, the distribution of STE is significantly more concentrated, with some positions being darker and a large number of positions being lighter, indicating that the model is more likely to rely on a few codes, and the codebook is not fully utilized.

This phenomenon is particularly obvious in deeper quantization layers. As the number of layers increases, the code usage of STE becomes more and more unbalanced, while DIGER can still maintain better coverage. In other words, DIGER solves not only "whether semantic IDs can be jointly optimized" but also whether the semantic space can be more stably and fully utilized after joint optimization. This is also where it is more meaningful than directly using STE.

The code usage distribution of different methods at the best checkpoint. DIGER, especially the versions with SDUD and FrqUD, has a more balanced code usage on each quantization layer; STE is more likely to concentrate on a few codes, resulting in obvious codebook collapse.

Summary

Although this paper belongs to the field of recommendation systems, the problem behind it is not limited