StartseiteArtikel

Fünf Jahre lang habe ich endlich Transformers v5 erwartet.

机器之心2025-12-02 16:12
Ganzheitlich PyTorch umarmen.

Just now, the first RC (Release Candidate) version v5.0.0rc0 of Transformers v5 was released.

GitHub: https://github.com/huggingface/transformers/releases/tag/v5.0.0rc0

This update marks that this world's most popular AI infrastructure library has officially overcome the technological cycle from v4 to v5, which has taken five years.

As the most central open - source project of Hugging Face, the daily download volume of Transformers has increased from 20,000 times at that time to more than 3 million times since the release of version v4 in November 2020, and the total installation volume has exceeded the mark of 1.2 billion times.

It defines how the industry uses models. The supported architectures have been expanded from the original 40 to more than 400, covering text, visual, audio, and multimodal areas. The model weights contributed by the community even exceed 750,000, which also cover text, visual, audio, and multimodal areas.

The official website states that in the field of artificial intelligence, "innovation" is the key to long - term success. As the leading model definition library in the ecosystem, Transformers must constantly evolve and adapt the form of the library to maintain its relevance.

Version v5 establishes PyTorch as the only core backend and focuses on four dimensions of evolution: maximum simplicity, the transition from fine - tuning to pre - training, interoperability with high - performance inference engines, and quantization as a core function.

Simplicity

The first focus of the team is simplicity.

Hugging Face wants the integration of models to be clean and clear. Simplicity brings broader standardization, stronger universality, and more comprehensive ecosystem support.

New Models

Essentially, Transformers has always been a toolbox for model architectures. Hugging Face aims to include all the latest model architectures and become the only reliable source for model definition.

In the past 5 years, an average of 1 - 3 new models have been added per week. The timeline is as follows:

Modular Approach

In the past 12 months, Hugging Face has strongly promoted the modular design. This approach makes maintenance easier, accelerates integration, and promotes community cooperation.

Although Hugging Face always follows the philosophy of "one model, one file", they constantly introduce some abstraction levels to simplify the management of general auxiliary functions. The most typical example in this field is the introduction of AttentionInterface, which provides a central abstraction level for attention mechanisms. The eager method remains in the modeling files; other methods such as FA1/2/3 (FlashAttention), FlexAttention, or SDPA are moved to this interface.

Model Conversion Tools

Hugging Face is also strongly building tools to identify the similarity between a new model and existing model architectures. This function discovers the code similarity between different model files using machine learning.

More specifically, Hugging Face wants to automate the model conversion process: when a new model needs to be integrated into Transformers, the system automatically creates a draft PR (Pull Request) to convert it into a version in the Transformers format.

This process will reduce a lot of manual work and ensure overall consistency.

Code Simplification

Simplification of Modeling and Tokenization/Processing Files

Hugging Face has also carried out a comprehensive restructuring of the modeling files and the tagging - related files.

Thanks to the above - mentioned modular approach and the unified standardization between different models, the modeling files have been significantly improved. The standardization has abstracted many tools that don't really belong to the model itself, so that the modeling code only retains the core part for the forward/backward propagation of the model.

At the same time, v5 also simplifies the tokenization and processing files: in the future, it will only focus on the tokenizer backends, and the concepts of fast and slow tokenizers will be removed.

In the future, the image processors will only keep the fast versions, which are based on torchvision as the backend.

Finally, v5 will gradually discontinue support for Flax / TensorFlow and instead focus on PyTorch as the only backend; however, the team is also collaborating with partners in the JAX ecosystem to ensure that the models are compatible with this ecosystem.

Matt White, the CEO of the PyTorch Foundation and the leader of the Linux Foundation AI, said: With the release of version v5, Transformers completely turns to PyTorch.

Training

Training in version v5 remains a key focus of the team's work. Previously, Hugging Face has focused more on fine - tuning than on pre - training and full training. With v5, support for the latter will be strengthened.

Pre - Training

To support pre - training, Hugging Face has redesigned the model initialization method and also added support for optimization operators for forward and backward propagation. Currently, v5 is already widely compatible with tools such as torchtitan, megatron, nanotron, etc.

Fine - Tuning and Post - Training

Hugging Face says that it will continue to work closely with all fine - tuning tools in the Python ecosystem. At the same time, Hugging Face is also compatible with tools such as MaxText in the JAX ecosystem to ensure that their frameworks can interoperate well with Transformers.

Now, all fine - tuning and post - training tools can rely on Transformers as the source for model definition; this can also support more agentic scenarios through OpenEnv or Prime Environment Hub.

Inference

Inference is also one of the key focuses of the optimization of v5. Hugging Face brings several paradigmatic updates: including special kernels, cleaner default settings, new APIs, and improved support for inference engines.

In addition, similar to the training stage, Hugging Face has invested a lot of energy in the encapsulation of inference kernels.

Besides this work, they have also added two special APIs for inference:

Continuous Batching and paged attention mechanisms. These functions are already being used internally, and a guide will be published later.

The new transformers serve system, which can provide a server compatible with the OpenAI API to serve Transformers models.

With the updates of v5, Hugging Face has significantly strengthened the support for inference scenarios, especially for tasks such as model evaluation, where many requests need to be processed simultaneously.

It should be noted that the positioning of Transformers v5 is not to replace professional inference engines such as vLLM, SGLang, TensorRT - LLM. On the contrary, their goal is to be compatible with these engines.

Production Environment

Local Deployment

The team works closely with the most popular inference engines so that Transformers can be used as the backend. This brings significant advantages: as soon as a model is added to Transformers, it is immediately available in these inference engines and can simultaneously take advantage of the benefits of each engine, such as inference optimization, special kernels, dynamic batching, etc.

In addition, Hugging Face works closely with ONNXRuntime, llama.cpp, and MLX to ensure that Transformers interoperates well with these modeling libraries. For example, thanks to the great efforts of the community, it is now very easy to load GGUF files into Transformers for further optimization. Conversely, one can also easily convert Transformers models into GGUF files for use in llama.cpp.

The same goes for MLX: The safetensors files of Transformers are directly compatible with MLX models.

Finally, Hugging Face is also pushing the boundaries of local inference and working closely with the executorch team so that Transformers models can run directly on devices. The support for multimodal models (visual, audio) is also expanding rapidly.

Quantization

Quantization is quickly becoming the standard in the development of modern top - notch models. Many SOTA models are published in low - precision formats today, such as 8 - bit and 4 - bit (e.g., gpt - oss, Kimi - K2, DeepSeek - R1).

To keep up with the technological forefront, v5 makes quantization one of the core capabilities of Transformers and ensures that it is fully compatible with the main functions and provides a reliable quantization framework for training and inference.

Reference link: https://huggingface.co/blog/transformers - v5

This article is from the WeChat account "Machine Intelligence", editors: +0, Chen Chen. Published by 36Kr with permission.