Google Open-Sources Flood Dataset Groundsource with Over 2.6M Historical Records, Processing News from 150 Countries Based on Gemini

The dataset has been launched on HyperAI and supports online use.

The open - source flood dataset Groundsource released by Google Research is used to extract verified ground truth information from unstructured data, thereby mapping the footprints of historical disasters with unprecedented accuracy. Researchers automatically processed over 5 million news reports from more than 150 countries and finally compiled over 2.6 million records of historical flood events, providing an unprecedented scale and coverage of data for global flood research.

Among various natural disasters globally, floods are one of the few disaster types with both high frequency of occurrence and great destructive power. Therefore, they have always been a core issue of concern in the fields of hydrology, climate science, and disaster management for a long time. From improving hydrological forecasting models, analyzing the impact of climate change on flood evolution, to assessing future flood risks and improving the disaster prevention and mitigation system, almost all relevant research depends on the same basic condition - high - quality historical flood data. These data are not only the key reference for testing the reliability of models but also an important basis for supporting risk assessment and policy - making.

Traditional hydrological and meteorological observation stations are sparsely distributed, and the data quality also varies. It is difficult to support large - scale and high - precision flood information collection. Currently, there are only a handful of truly well - structured flood datasets. Although the "Storm Event Database" maintained by the U.S. National Centers for Environmental Information is a typical example, globally, such systematic records are still in the minority, and many countries have not established long - term flood event databases. Therefore, the existing global flood datasets generally have deficiencies in terms of coverage and record integrity.

It is worth noting that a large amount of information on flood events has actually been scattered in unstructured texts such as news reports and government documents for a long time. In the past, although some research has attempted to extract data from them, limited by the low standardization of texts and high manual processing costs, it has always been difficult to promote on a large scale. In recent years, the development of generative artificial intelligence has provided a new breakthrough path for this problem.

Recently, Google Research open - sourced the flood dataset Groundsource, which is used to extract verified ground truth information from unstructured data, thereby mapping the footprints of historical disasters with unprecedented accuracy. Researchers automatically processed over 5 million news reports from more than 150 countries and finally compiled over 2.6 million records of historical flood events, providing an unprecedented scale and coverage of data for global flood research.

Currently, "The Groundsource Global Flood Event Dataset" has been launched on the dataset section of the HyperAI official website (hyper.ai) and supports online use:

https://go.hyper.ai/KO3dB

Paper address: https://eartharxiv.org/repository/view/12083/

Based on 5 million news articles,

Screened over 2.6 million flood reports

The construction of the Groundsource dataset follows a standardized automated process. In the global data collection and entity recognition stage, the research team used some of Google's infrastructure, such as the WebRef named - entity recognition system and the Read Aloud tool. However, the data extraction logic, the large - language - model prompt framework, and the spatio - temporal aggregation rules have all been publicly recorded. Therefore, after replacing with open - source algorithms or other language models, this process can still be replicated in different technical environments.

The data construction starts with the collection of news information. The research team used web crawlers to gather public news reports published since 2000 and calculated a flood - topic relevance score for each article through WebRef. Researchers set the threshold at 0.6 and initially screened out about 9.5 million web pages. However, manual spot - checks showed that only about half of them actually reported flood events, and the rest only mentioned them in the background.

Then it enters the text extraction stage. The system automatically strips advertising and navigation elements from the web pages, only retains the article text and the publication date, and filters out languages that cannot be parsed or websites that cannot be accessed. Finally, about 7.5 million available articles are obtained. All non - English texts are translated into English, and the names of geographical locations are extracted through entity recognition to form a candidate location library.

Identifying specific flood events from news texts is the most complex part of the entire process. There are often multiple locations and vague time expressions in the reports, such as "yesterday" or "last week". For this reason, the research team designed a structured prompt framework for the Gemini large - language model and debugged it with 250 manually annotated articles. They used Google Read Aloud to extract the original text from 80 languages and standardized it into English through the Cloud Translation API. The model needs to complete 4 tasks in sequence: determine whether the article describes a real flood event, extract and standardize the event time, identify the specific locations affected by the flood, and match the place names to standard geographical identifiers.

Under this process, about 5 million out of the 7.5 million articles were identified as containing real flood events. Based on the manually annotated samples, the precision rate of event recognition is about 75%, and the recall rate is about 90%. The accuracy of date and location extraction is slightly lower, but it can still provide effective spatio - temporal clues.

In order to locate these events on the map, the system also geocodes the locations: if it can match an existing geographical entity, it directly calls its spatial boundary; if it cannot match, it converts the place name into coordinates through the geocoding service and generates a small - scale buffer zone when necessary for spatial analysis.

Finally, the research team merged the continuously reported records into a single flood event according to the geographical identifiers and time information and carried out quality control, removing records with an overly large scope or abnormal time. After this series of processing, more than 2.64 million independent records were finally obtained, each corresponding to a flood observation captured by news reports at a specific time and location.

Dataset evaluation:

82% of the events are of analytical value,

Block - level accuracy fills the gap in small - scale disaster records

To evaluate the reliability of the Groundsource dataset, the research analyzed it from three aspects: precision rate, spatio - temporal distribution, and consistency with external databases. It was compared with the two major databases, the Global Disaster Alert and Coordination System (GDACS) and the Dartmouth Flood Observatory (DFO).

In the precision rate evaluation, researchers randomly selected 400 records and traced back to the original news sources to check the time and location information. The results showed that the strictly "accurate" records accounted for 60% (95% confidence interval ±5%); if records with slight deviations but still of analytical value were included, about 82% of the events could still be used for subsequent analysis. The remaining about 18% of the errors mainly came from spatial positioning deviations caused by place - name ambiguities and misinterpretations of relative time expressions such as "yesterday" and "last week".

In terms of spatio - temporal distribution, the dataset shows an obvious "recent bias". As shown in the figure below, about 64% of the records are concentrated between 2020 and 2025, and the year 2025 alone accounts for 15%. This trend is more likely to reflect the rapid growth of digital news media rather than an increase in flood events themselves.

Temporal distribution of the Groundsource dataset

The spatial distribution is also affected by the media ecosystem. There are more event records in areas with dense news reports, and the representativeness is lower in areas with scarce digital news or insufficient language support. However, the data still clearly shows flood - prone areas such as Europe, South Asia, and Southeast Asia. Its spatial distribution is highly consistent with the locations of major floods recorded by GDACS.

Global spatial distribution of the extracted flood events

Despite the reporting bias, Groundsource performs outstandingly in terms of spatial resolution. Statistics show that the average coverage area of the extracted events is 142 square kilometers, and 82% of the records are less than 50 square kilometers. Many events can be refined to the block or community scale, thereby capturing localized floods that are often ignored by traditional global disaster databases.

Geographical area distribution of the extracted flood events

In the integrity evaluation, the research compared Groundsource with the Global Disaster Alert and Coordination System (GDACS) and the Dartmouth Flood Observatory (DFO) through spatio - temporal matching. The results showed that since 2020, the recall rate of GDACS events has reached 85% to 100%. In areas with well - developed media infrastructure, such as the United States, the matching rates reached 96% (GDACS) and 91% (DFO) respectively. In addition, the recall rate is significantly related to the degree of disaster impact: the recall rate of major flood events is close to or exceeds 90%.

Comparison between Groundsource and GDACS and DFO

Overall, although Groundsource cannot provide a completely balanced global coverage, with more than 2.6 million records and high spatial resolution, it fills the gap in the records of small - scale and localized flood events in traditional disaster databases, providing a new data source for global flood research.

AI - driven flood data research

Extracting standardized flood event information from unstructured texts through large - language models is gradually becoming an important method in the field of flood research.

In the academic community, many research teams are conducting continuous exploration in this direction. Researchers at MIT proposed improved prompt strategies and context - association methods for the common problems of time ambiguity and place - name ambiguity in flood event extraction by large - language models. By fine - tuning the model with historical hydrological observation data, the team increased the accuracy of flood event date extraction to over 80% and developed a multilingual adaptation module, enabling the model to more stably process news texts in different languages, thereby constructing a flood event dataset covering multiple regions.

Paper title: Generating Physically - Consistent Satellite Imagery for Climate Visualizations

Paper link:

https://ieeexplore.ieee.org/document/10758300

The research team at the National University of Singapore further expanded the application boundaries of the research. The team combined historical flood events extracted from news by AI with urban drainage network data and high - precision terrain information to establish an urban - scale flood risk assessment model. By analyzing the relationship between the flood occurrence frequency, affected area, and urban infrastructure in different regions, researchers can more clearly identify potential risk areas and provide more targeted references for urban flood - prevention planning. They also attempt to evaluate the changing trend of future flood risks under extreme climate conditions.

Paper title: Forecasting fierce floods with transferable AI in data - scarce regions

Paper link:

https://www.cell.com/the - innovation/fulltext/S2666 - 6758(24)00090 - 0

The progress of related research has also begun to extend to the industrial circle. Microsoft Research collaborated with NASA to develop an AI - driven flood risk prediction platform, Hydrology Copilot. The system integrates flood event data extracted from news, satellite remote - sensing information, and real - time hydrological monitoring data, and predicts the probability of flood occurrence and the potential affected area through a machine - learning model. Currently, the platform has been piloted in the United States and several other countries to support local emergency management departments in improving flood early - warning and response processes.

Overall, automatically extracting flood event information from news texts is gradually becoming an important source to supplement traditional observational data. With the continuous improvement of model capabilities and data scale, such methods are expected to provide a more abundant and high - resolution data basis for global flood risk research.

Reference link: 1.https://www.geekwire.com/2025/microsoft - nasa - ai - hydrology - copilot - floods

This article is from the WeChat official account "HyperAI Super Neuro". Author: Tian Xiaoyao, Editor: Li Baozhu. Republished by 36Kr with authorization.