HomeArticle

Achieve free PPT presentations with one click. Synchronously generate "narration audio + video" with an effect approaching that of a real person.

新智元2025-07-17 10:39
PresentAgent automatically converts documents into presentation videos with voice and slides, and the effect is close to that of a human.

PresentAgent can transform long documents such as papers and reports into presentation videos with real - human voices and synchronized slides in one click. The process is similar to how a person outlines, creates a PPT, records audio, and then synthesizes them. In an experiment, 30 documents were used for a comparative test with human - made videos. PresentAgent is close to human performance in terms of content accuracy, visual clarity, and audience understanding, which can save teachers and business people a great deal of time spent on making PPTs and recording audio.

Presentation is a widely used and effective way of conveying information. By combining visual elements, structured explanations, and oral interpretations, it can present information step by step, making it easier for different audiences to understand.

Although it is highly effective, creating high - quality presentation videos from long documents (such as business reports, technical manuals, policy briefs, or academic papers) usually requires a significant amount of manual effort.

This process involves content selection, slide design, script writing, voice recording, and integrating all the content into a coherent multimodal output.

Although AI has made progress in areas such as document - to - slide and text - to - video conversion in recent years, there is still a key issue: these methods either can only generate static visual summaries or output unstructured general video clips, making it difficult to handle presentation tasks that require structured narration.

To fill this gap, researchers from the Australian Institute for Artificial Intelligence and the University of Liverpool in the UK proposed a new task: Document - to - Presentation Video Generation, aiming to automatically transform structured or unstructured documents into video presentations with voice explanations and synchronized slides.

Paper link: https://arxiv.org/pdf/2507.04036, Code link: https://github.com/AIGeeksGroup/PresentAgent

The challenges of this task far exceed those of traditional summarization or text - to - speech systems, as it requires selective content abstraction, layout - based visual planning, and precise multimodal alignment of vision and speech.

Figure 1: Overview of PresentAgent.

Figure 2: Document diversity in the evaluation benchmark

Different from previous methods that only focus on static slide/image generation or single - voice summarization, the researchers aim to build a fully integrated video experience that mimics the way human speakers convey information in reality.

Figure 3: Overview of the method framework

Given diverse input documents (such as papers, websites, blogs, slides, or PDFs) on the left side of the above figure, PresentAgent can generate presentation videos with explanations, and the output is synchronized slides and audio.

On the right side, PresentEval, a dual - path evaluation framework, is designed:

(1) Objective quiz evaluation (top): Conduct fact - understanding detection through Qwen - VL;

(2) Subjective scoring evaluation (bottom): Score the video from dimensions such as content quality, visual design, and speech understanding with the help of a vision - language model.

To address the above challenges, the researchers proposed a modular generation framework - PresentAgent, as shown in Figure 1.

Its process includes:

Semantically segment the input document (through outline planning);

Generate slide visual content with layout guidance for each semantic block;

Rewrite the key information into oral explanation texts;

After voice synthesis, synchronize it with the slides in time, and finally generate a well - structured and clearly explained video presentation.

Notably, the entire process has controllability and domain adaptability and is suitable for various document types and presentation styles.

To effectively evaluate such complex multimodal systems, the researchers compiled a test set of 30 manually created document - presentation video pairs covering multiple fields such as education, finance, policy, and scientific research.

Meanwhile, the researchers designed a dual - path evaluation strategy:

  • On the one hand, use fixed multiple - choice questions to test content understanding;
  • On the other hand, score through a vision - language model to evaluate the content quality, visual presentation, and audience understanding of the video.

The experimental results show that the videos generated by this method are smooth, well - structured, and informative, and are close to human performance in terms of content conveyance and audience understanding.

This indicates that combining language models, visual layout generation, and multimodal synthesis can achieve an interpretable and scalable automatic presentation generation system.

The main contributions are as follows:

  • Propose a new task: For the first time, propose the new task of "document - to - presentation video generation", aiming to automatically generate structured slide videos from various long texts with voice explanations.
  • Design the PresentAgent system: Propose a modular generation framework that covers document parsing, layout - aware slide construction, script generation, and audio - visual synchronization, realizing a controllable and interpretable video generation process.
  • Propose the PresentEval evaluation framework: Build a multi - dimensional evaluation mechanism driven by a vision - language model to promptly score the video from dimensions such as content, vision, and understanding.
  • Build a high - quality evaluation dataset: Create a dataset containing 30 pairs of real documents and corresponding presentation videos. Experiments and ablation studies show that PresentAgent is not only close to human performance but also significantly outperforms existing solutions.

Presentation video evaluation benchmark

This benchmark not only evaluates the fluency and information accuracy of the video but also supports the evaluation of audience understanding.

Learning from the method of Paper2Poster, the researchers designed a quiz - style evaluation, that is, use a vision - language model to answer content questions based only on the generated video (slides + explanations) to simulate the audience's understanding level.

The researchers also introduced manually created videos as a reference standard, which are used for both score calibration and performance ceiling comparison.

As shown in Figure 2, the benchmark covers four representative document types (academic papers, web pages, technical blogs, and slides), each with a real human - made video explanation, covering various real - world fields such as education, scientific research, and business reports.

Example: Objective Quiz Evaluation

Sample prompts in objective quiz evaluation. Each set of multiple - choice questions is manually designed based on the real content of the source document, focusing on testing the ability of theme recognition, structure understanding, and core idea extraction, which is used to evaluate whether the generated video effectively conveys the original information.

Example: Subjective Scoring Prompts

Examples of subjective scoring prompts, where each prompt focuses on a specific dimension, aiming to guide the vision - language model to score the video from a "human perspective". Abbreviation explanations: Narr. Coh. = Narration coherence; Comp. Diff. = Comprehension difficulty.

The researchers adopted a "unified model - driven evaluation framework" to score the generated presentation videos. All evaluations use a vision - language model, guided by prompts designed for different dimensions.

This evaluation framework consists of two parts:

  1. Objective quiz evaluation: Measure the accuracy of information conveyed by the video through multiple - choice questions;
  2. Subjective scoring evaluation: Score the video on a scale of 1 - 5 from dimensions such as content quality, visual/audio design, and understanding clarity.

These two types of indicators together form a comprehensive quality evaluation system for the generated video.

Introduction to the Doc2Present Dataset

To support the evaluation of document - to - presentation video generation, the researchers built a real - world comparative dataset with multiple fields and styles - Doc2Present Benchmark, where each pair of data includes a document and a corresponding presentation video.

Different from previous benchmarks that only focus on summaries or slides, the data includes business reports, product manuals, policy briefs, tutorial - style documents, etc., and each document is accompanied by a manually created video explanation.

Data sources

The researchers collected 30 high - quality presentation video samples from public platforms, educational resource libraries, and professional presentation archives. Each video has a clear structure, combining slide visual presentation and synchronized voice explanations.

The researchers manually aligned each video with its source document and ensured that the video structure was consistent with the document content, the visual information on the slides was compact and structured, and the explanations were well - synchronized with the slides in time.

Data statistical information

Document length: Approximately 3000 - 8000 words

  • Video length: 1 - 2 minutes
  • Number of slides: 5 - 10 pages

This setting emphasizes the core challenge of the task: how to transform dense, domain - specific document content into concise and understandable multimodal presentation content.

PresentEval

To evaluate the quality of the generated presentation videos, the researchers adopted two complementary evaluation strategies: Objective Quiz Evaluation and Subjective Scoring, as shown in Figure 3.

For each video, the slide images and the complete explanation text are provided as a unified input to the vision - language model to simulate the real viewing experience of the audience.

In the objective evaluation, the model needs to answer a set of fixed factual questions to determine whether the video accurately conveys the key information in the original document.

In the subjective scoring, the model scores the video from three dimensions: the coherence of the explanation, the clarity and aesthetics of the visual design, and the overall ease of understanding. All evaluations do not rely on real references but completely depend on the model's understanding of the presented content.

Objective Quiz Evaluation

To evaluate whether the generated video effectively conveys the core content of the original document, a fixed - question understanding evaluation protocol is adopted.

The researchers manually designed five multiple - choice questions for each document, focusing on aspects such as theme recognition, structure understanding, and argument extraction.

As shown in Table 1, during the evaluation, the vision - language model receives the complete video containing the slides and audio transcripts and answers the five questions.

Each question has four options, and only one is correct. The correct answers are based on the annotations of the manually created videos. The final understanding score (ranging from 0 - 5) reflects how many questions the model answered correctly, measuring the video's ability to convey the original information.

Subjective Scoring

To evaluate the quality of the generated videos, the researchers adopted a prompt - based vision - language model evaluation method. Different from methods that rely on manual references or fixed indicators, the model is required to score from the audience's perspective with its own reasoning and preferences.

The scoring focuses on three aspects: the coherence of the explanation, the visual effect of the slides, and the overall difficulty of understanding.

After the model watches the video and audio content, it scores each dimension (on a scale of 1 - 5) and briefly explains. The specific scoring prompts are shown in Table 2, and different prompts are designed for different modalities and tasks to achieve accurate evaluation.

PresentAgent

Figure 4: Overview of the PresentAgent framework

This system takes various types of documents (such as papers, web pages, PDFs, etc.) as input and follows a modular generation process:

  1. First, perform outline generation;
  2. Then, retrieve the most suitable slide template;
  3. Next, generate slides and explanation scripts with the help of a vision - language model;
  4. Convert the explanation script into audio through TTS and synthesize it into a complete presentation video;
  5. To