Schlüsselfortschritte, Anwendungen, Datensätze und Methoden von multimodalen Large Language Models (LLM) und Video-Sprache-Prätraining
With the development of video applications, a large number of videos have been uploaded to the Internet. Therefore, how to conduct representation learning using videos and their corresponding weak subtitles has become a hot topic recently. This article will review the latest progress, subsequent applications, basic datasets, and technologies of large-scale video-language pre-training tasks.
1. Introduction
The first part of this series reviews the progress, applications, datasets, and technologies of large-scale video-language pre-training. This task uses weak subtitles and videos for representation learning. Pre-training and fine-tuning are a standard learning paradigm in deep learning, which is used to pre-train models on large datasets and then fine-tune them on smaller datasets for specific tasks. This eliminates the need to train new models for different tasks and reduces computational costs.
Pre-training is usually performed on large datasets such as ImageNet using self-supervised learning, and unsupervised learning also performs well in the fields of natural language processing (NLP) and computer vision (CV). The weights of pre-trained models are then fine-tuned on smaller datasets to achieve the learning goals of specific tasks.
Video-language pre-training uses large-scale video-text data for self-supervised/unsupervised learning to obtain generalized representations. The main proxy tasks include Masked Language Model (MLM), Masked Frame Model (MFM), Language Reconstruction (LR), Video-Language Matching (VLM), Sentence Ordering Model (SOM), and Frame Ordering Model (FOM). These tasks focus on language prediction, frame prediction, sentence generation, video-language alignment, sentence ordering, and frame ordering, respectively. These tasks aim to learn co-occurrence associations, semantic constraints, video subtitle generation, alignment, and relationships from a sequential perspective.
2. Latest Progress and Applications
The latest progress of pre-trained models highlights the importance of dataset size for representation learning. Therefore, researchers are using large-scale, weakly labeled cross-modal data from the Internet, such as image-caption pairs and video-caption data. This has led to a surge in research on cross-modal tasks, especially visual-language tasks and video-language tasks.
An important progress in visual-language pre-training is Contrastive Language-Image Pretraining (CLIP), which uses contrastive loss to learn multi-modal representations from weakly supervised data. The model is trained on a dataset of 400 million image-text pairs and performs well in zero-shot visual recognition tasks such as image classification.
Progress has also been made in the processing of video data, which is inherently multi-modal and contains elements such as titles, audio, and narration. Large video datasets such as Howto100M have been proposed, which contains 136 million videos with narration text data. This has promoted the development of video-language pre-training and opened up new areas for video understanding tasks.
The Transformer model, originally proposed for machine translation, performs well in the field of computer vision. It calculates the similarity of elements and aggregates the long-range dependencies of these elements, enabling training on larger datasets.
Video-language pre-training aims to transfer knowledge from large datasets to downstream tasks, which should include text and video inputs. Downstream tasks include video-text retrieval, action recognition, video question answering, and video captioning. Each task requires different methods to transfer information from pre-training to downstream tasks, which highlights the importance of compatibility between pre-training and downstream tasks.
3. Open Video-Language Pre-Training Datasets
The scale and quality of pre-training datasets are crucial for learning robust visual representations, especially for Transformer-based models. The key datasets for video-language pre-training can be divided into two categories: label-based datasets and caption-based datasets.
Label-based Video Datasets:
1. Kinetics: A large-scale action recognition dataset with multiple categories, containing up to 650,000 video clips covering 400/600/700 human action categories.
2. AVA: 80 atomic visual actions are densely annotated in 15-minute movie clips, resulting in 1.62M action labels, and each person often has multiple labels.
Caption-based Video Datasets:
1. ActivityNet Captions: Contains 20k videos, totaling 849 video hours, with a total of 100k descriptions, each with its unique start and end time.
2. YouCook2: One of the largest task-oriented instructional video datasets, containing 2000 longer unclipped videos of 89 cooking recipes.
3. Howto100m: A large-scale narrated video dataset containing more than 136 million video clips, whose subtitles are from 1.2 million YouTube videos.
4. WebVid: A dataset containing more than two million weakly captioned videos, all crawled from the Internet. Currently, there are two versions: WebVid-2M and WebVid-10M.
5. HD-VILA: The first high-resolution dataset, containing 100 million video clips and sentence pairs from 3.3 million videos, of which 371.5K hours are 720p videos.
These datasets have played an important role in the progress of video-language pre-training methods, providing diverse and large-scale data for training robust models.
4. Video-Language Pre-Training Methods
Recent video-language pre-training methods mainly use Transformer as a feature extractor to learn from large-scale multi-modal data. These methods can be divided into two categories: Single-Stream and Two-Stream.
Single-Stream Methods:
1. VideoBERT: The first model to explore video-language representations using a Transformer-based pre-training method.
2. HERO: A single-stream video-language pre-training framework that encodes multi-modal inputs in a hierarchical structure.
3. ClipBert: Proposes a general framework for cost-effective end-to-end learning of video and language tasks by adopting sparse sampling.
4. DeCEMBERT: This technology is developed to solve the problem that the automatically generated subtitles in pre-training datasets are noisy and occasionally inconsistent with the video materials.
5. VATT: A method for learning multi-modal representations from unlabeled data using a convolution-free Transformer structure.
6. VIOLET: Proposes a transformation framework for end-to-end simulation of video temporal dynamics.
7. ALPRO: A single-stream framework for video-language pre-training that proposes video-text contrast to promote multi-modal interaction.
Two-Stream Methods:
1. CBT: Proposes Contrastive Noise Estimation (NCE) as the loss objective for video-language learning.
2. UniVL: Proposes a model for multi-modal understanding and generation.
3. Frozen in Time (FiT): Aims to learn joint multi-modal embeddings for effective text-to-video retrieval.
4. CLIP-ViP: Suggests pre-training the CLIP model on video-language data to further extend visual-language alignment to the video level.
These methods have shown good results in various applications, including action recognition, video captioning, action prediction, and video segmentation. The choice between single-stream and two-stream methods depends on the specific requirements of the task. Single-stream methods can usually capture more fine-grained relationships between text and videos, while two-stream methods provide greater flexibility by separately extracting features from different modalities.
This article is from the WeChat official account "Data-Driven Intelligence" (ID: Data_0101), author: Xiaoxiao. It is published by 36Kr with authorization.