MIT and IBM Propose ChartNet: World's Largest Synthetic Chart Dataset with 1.5M Diverse Chart Samples

The online access address of the included dataset

Experts from the Massachusetts Institute of Technology (MIT), the MIT-IBM Watson AI Lab, and IBM Research have proposed ChartNet, a high-quality multimodal dataset with over one million samples for chart understanding. It aims to advance the development of chart understanding and reasoning capabilities.

In the past two years, the development of multimodal large models has far exceeded expectations. From recognizing image content, understanding complex documents, to parsing video information, vision-language models (VLMs) have continuously pushed the boundaries of their capabilities. However, there is a seemingly simple yet highly challenging visual object that still causes many advanced models to "stumble" - charts.

For humans, a bar chart, line chart, or scatter plot can often quickly convey trends, comparison relationships, and key conclusions. But for AI, charts are far more than just images. A model not only needs to identify visual elements but also understand the relationships between axes, data points, legends, and labels, and further perform tasks such as numerical extraction, trend analysis, and even causal reasoning. In other words, chart understanding is essentially a complex task that spans three cognitive abilities: vision, numerical analysis, and language. Current VLMs can only partially achieve this ability.

In recent years, some datasets have promoted the development of related research, but they generally have three problems: small scale, limited chart types, and lack of complete multimodal information. Many datasets only focus on a single task (such as question-answering or chart description) or lack key modalities. Therefore, open-source models still lag behind proprietary systems in complex chart reasoning tasks.

To fill this gap, experts from the Massachusetts Institute of Technology (MIT), the MIT-IBM Watson AI Lab, and IBM Research have proposed ChartNet - a high-quality multimodal dataset with over one million samples for chart understanding, aiming to advance the development of chart understanding and reasoning capabilities.

This is the largest synthetic chart dataset to date. It uses a novel code-guided synthesis process to generate 1.5 million diverse chart samples, covering 24 chart types and 6 plotting libraries. Comprehensive experiments have verified the practicality of ChartNet. The results show that its optimal fine-tuned model outperforms much larger models and GPT-4o on all tasks.

Use the dataset online: https://go.hyper.ai/lGPsc

The related research results, titled "ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding," will be presented at the IEEE Conference on Computer Vision and Pattern Recognition.

Research Highlights:

* ChartNet is based on a code-guided synthesis process that can generate chart samples on a large scale while capturing visual, structural, numerical, and textual information for chart understanding.

* ChartNet integrates real-world data and manually annotated data and provides a dedicated subset for visual grounding and security analysis, expanding the application value of the dataset in model training and evaluation.

* Fine-tuning on this dataset can continuously improve the performance of vision-language models in chart reconstruction, data extraction, and chart summarization tasks.

Paper URL:

Dataset: Composed of 1.5 million multimodally aligned synthetic samples

The core dataset of ChartNet consists of 1.5 million multimodally aligned synthetic samples. Each sample includes: a chart image, plotting code, tabular data, natural language description, and question-answer pairs with chain-of-thought (CoT) reasoning. A complete overview of data attributes, chart types, and plotting libraries used is shown in the following figure:

Data attributes, chart types, and plotting libraries included in the ChartNet dataset

To cover the full spectrum of chart understanding capabilities, ChartNet also includes several dedicated subsets: manually annotated data, real-world charts, grounding data, and security data.

Manually annotated synthetic chart data: It contains 96,643 aligned synthetic chart images, descriptions, and tabular data. All samples have been rigorously verified and annotated by humans.

High-quality real-world chart data: To supplement the synthetic chart corpus, the researchers compiled and annotated 30,000 real-world charts from international authoritative media and data visualization institutions, such as the World Bank, Bain Insights, the Pew Research Center, Our World in Data, and other globally renowned publishers. This collection covers a wide range of contemporary topics, including economics, technology, geopolitics, environmental science, and social trends, while ensuring high data diversity and strong real-world relevance. Charts with low information content or substandard quality are explicitly excluded to ensure interpretability.

Grounding QA pair data: Modern VLMs still struggle to identify chart regions and syntactic elements relevant to specific questions. To improve this ability, the researchers constructed grounding QA pairs. First, geometrically aware annotations were extracted from plotting code elements (axes, ticks, gridlines, legends, graphic blocks) to generate dense chart grounding annotations. An entropy-based method was used to further filter the bounding boxes. Using the generated grounding annotations, a set of templated QAs were created for each chart to capture the correspondence between the expected spatial layout of visual elements and the actual content in the chart.

The expected positions are encoded into the answer strings using serialized bounding boxes. The templates cover unique and recurring visual elements, generating reference expressions by combining indices, in-chart text labels, and visual attributes (such as element colors). The generator supports short and long answers and can optionally include grounding information. The final dataset generates one QA pair for each chart by uniformly sampling all template types and output modalities. In addition, gpt-oss-120b was used to generate reasoning-based grounding QA pairs.

Security data: To address security issues, the researchers extended the data generation process to generate chart-related security-aligned data to reduce the risk of harmful content and "jailbreaking" in model outputs.

Core idea of ChartNet: Automatic generation of code-guided synthetic charts

The core idea of ChartNet's data generation is that charts can be generated programmatically, with executable plotting code serving as a structured intermediate representation for data visualization. The researchers proposed a large-scale automatic generation process for code-guided synthetic charts (shown in the following figure). This process starts with a limited amount of chart image data ("seeds") and uses a vision-language model (VLM) to output code that can roughly reconstruct these charts.

Code-Guided Chart Augmentation process

Specifically, the data generation process includes the following stages:

① Chart-to-Code Reconstruction: A VLM is used to generate Python plotting code to roughly reconstruct a given set of chart images. In this stage, 150,000 unique chart images were selected from the TinyChart dataset as seeds, but the process has no specific dependence on seed selection.

② Code-Guided Chart Augmentation: Taking the generated plotting code as input, a large language model (LLM) is used to iteratively rewrite the code. While maintaining relevance to the previous iteration, the underlying data values and labels are modified to better match the desired chart type. The following figure shows the process of iterative code augmentation and chart rendering. This stage is the main part of dataset scale expansion, and each seed image can generate an arbitrary number of variants.

Examples of synthetic charts generated from a single seed chart using the ChartNet process

③ Chart Rendering: All generated plotting code is executed to generate chart images, and successfully executed scripts are paired with the generated images.

④ Quality Filtering: A VLM is used to evaluate each chart image, detecting various potential rendering defect categories (such as text overlap, label cropping, chart element occlusion, etc.). Images with visual problems and their corresponding plotting code are removed.

⑤ Code-Guided Attribute Generation: Finally, a VLM is used to generate supplementary semantic attributes for chart image-code pairs. Under the condition of using the code as context, data values and labels are extracted from the chart, and a tabular data representation is generated. In addition, combining visual information, code, and tabular data, a grounded chart description is generated.

Significant and consistent improvements in all chart understanding tasks

To verify the effectiveness of ChartNet in improving the chart understanding ability of models, the researchers trained vision-language models of different scales on the ChartNet dataset, including ultra-compact models (Ultra-Compact, ≤1B parameters), small models (Small, ≤4B parameters), and medium models (Medium, ≤7B parameters).

Overall, fine-tuning on the ChartNet dataset can bring significant and consistent improvements in all chart understanding tasks (as shown in the following table). The uniformity and magnitude of this improvement are independent of the model scale. This indicates that existing VLMs lack training opportunities with high-quality multimodal chart supervision, and ChartNet effectively fills this gap.

Comparison between the base model and the fine-tuned model on the ChartNet evaluation set (performance improvements are marked in blue¹)

① Chart Reconstruction

Models trained on the Chart-to-Code subset have achieved significant improvements in code execution rate, data consistency, structure/code similarity, and image similarity. Ultra-compact models (SmolVLM-256M, Granite-Docling-258M) that were originally unable to reconstruct charts at all now have full functionality. Small models (such as Granite-Vision-2B) have almost achieved perfect reconstruction, with multiple indicators exceeding 90%. The LLaVA-7B model has the highest improvement of up to +42.4 points in the data consistency indicator. This scale-independent trend indicates that the multimodal alignment between images and code in ChartNet provides the structured supervision that was missing in previous datasets.

② Chart Data Extraction

ChartNet has significantly improved the ability of all models to directly recover numerical tables from charts. The best-performing Granite-Vision-2B reaches 70.3%. The performance of the fine-tuned LLaVA-7B has improved by +41.8 points, surpassing all open-source baselines and even GPT-4o (only 46.7%). This reflects the value of the close coupling between code-generated charts and CSV data in ChartNet, allowing models to simultaneously access visual geometry and underlying numerical structures.

③ Chart Summarization

The summarization quality of all model families has improved significantly, with increases ranging from +9.5 (Qwen2.5-VL-3B) to +31.4 (Granite-Docling-2B). The fine-tuned Granite-Vision-2B reaches 83.9%, surpassing GPT-4o and all open-source baselines in Table 3, including those with an order of magnitude larger parameter scale. This shows that the synthetic summaries in ChartNet (constructed by code and rendered charts) provide a structured and semantically complete supervision signal for descriptive chart understanding.

④ QA with CoT Reasoning

In complex multi-stage reasoning tasks, each model shows a stable improvement in accuracy. The LLaVA-7B has the largest improvement (+15.17), reaching 70.3%, surpassing the dedicated chart reasoning model ChartGemma and all comparable or larger open-source models (including GPT-4o).

⑤ Comparison with Off-the-Shelf Models

The following table shows that models fine-tuned on ChartNet outperform off-the-shelf models with larger parameters on almost all indicators. After fine-tuning, 2B or 7B parameter models consistently outperform models in the 20B - 72B scale. Especially in chart reconstruction and data extraction tasks, ChartNet fine-tuned models far exceed GPT-4o.

Performance of off-the-shelf models on the ChartNet evaluation set

This indicates that in fields where vision, numerical analysis, and language are closely coupled, such as chart understanding, providing high-quality, code-aligned multimodal supervision is more effective than simply increasing the model scale.

⑥ Generalization to Public Benchmarks

As shown in the following table, after fine-tuning on the core ChartNet dataset, all models have achieved significant improvements on public benchmarks. Granite-Vision-2B has increased from 1.6 to 12.4 BLEU on ChartCap and from 30.8 to 58.4 on ChartMimic-v2. Even ultra-compact models (SmolVLM-256M) have obtained non-negligible ability improvements. This improvement is consistent in chart summarization and chart-to-code generation tasks, indicating that the multimodal alignment supervision of ChartNet can be effectively transferred to real-world benchmarks, not limited to the synthetic training distribution.

Generalization ability of ChartNet synthetic data on two real public benchmarks

Conclusion

ChartNet aims to address the core bottleneck in the field of chart understanding: the lack of large-scale, high-fidelity, supervised signals that align images, plotting code, numerical data, text descriptions, and reasoning trajectories. It provides a scalable and open foundation platform for research in numerical reasoning, visualization understanding, document intelligence, and code-aligned multimodal modeling, pushing VLMs from "describing charts" to "understanding the structured information encoded in charts."

Jovana Kondic, a graduate student in the Department of Electrical Engineering and Computer Science (EECS) at MIT and the first author of the ChartNet-related paper, said, "Many previous training datasets only focused on answering simple questions about charts. We tried to go beyond this with ChartNet and generate data that can support a comprehensive and in-depth understanding of charts."

In the future, the researchers plan to continue expanding ChartNet by incorporating more complex data to create practical value for more industries.

References: https://arxiv.org/abs

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

MIT and IBM Propose ChartNet, the Largest Synthetic Chart Dataset to Date, Generating 1.5 Million Diverse Chart Samples

Dataset: Composed of 1.5 million multimodally aligned synthetic samples

Core idea of ChartNet: Automatic generation of code-guided synthetic charts

Significant and consistent improvements in all chart understanding tasks

Conclusion