Carnegie Mellon University's Interdisciplinary Team Captures 3.3 - Billion - Year - Old Life Remains with Random Forest Model on 406 Samples

Capture the remains of ancient life in the chaotic molecular fragments

A cross - disciplinary team composed of the Carnegie Institution for Science in the United States and multiple universities around the world has refined a "technology integration" solution that combines pyrolysis gas chromatography - mass spectrometry with supervised machine learning. This solution can capture ancient life remains from chaotic molecular fragments.

Decoding the organic molecules buried deep in the ancient rock layers beneath the earth's surface plays a crucial role in understanding the Earth's history and studying the evolution of life. These potential witnesses of life activities can not only solve the mystery of the origin of life on Earth, especially clarify the relationship between the origin of photosynthesis and the oxidation process of the Earth's atmosphere, but also fill the gaps in the timeline of life evolution and provide core clues for understanding the formation of the early Earth's ecosystem. However, unlike large organisms that can form visible fossils, these "witnesses" have long since disappeared after being eroded by geological time. Therefore, how to identify traces of life from highly degraded organic remains has become a major challenge in the fields of paleontology and earth science.

For a long time, scientists have mainly relied on methods such as the morphology of paleontological fossils and isotope analysis to explore early life. However, these methods are often limited by the preservation state of samples. For example, clear records of complex molecules such as lipids and porphyrins can only be traced back to about 1.6 billion years ago, which is much shorter than the time of the origin of life revealed by other evidence. The origin of organic molecules in Archean rocks is ambiguous, and it is difficult to determine the boundary between biogenic and abiogenic origins. All these have left many key discoveries at the stage of speculation.

To break this deadlock, a cross - disciplinary team led by the Department of Terrestrial Magnetism of the Carnegie Institution for Science and composed of multiple universities and research institutions around the world proposed a "technology integration" solution. They first used pyrolysis–gas chromatography–mass spectrometry (py - GC - MS) for analysis, and then classified and discriminated the analysis data through supervised machine - learning methods to capture ancient life remains from chaotic molecular fragments.

Experiments show that this technology - integrated model performs better than expected. It can accurately distinguish modern organic matter from meteorite/fossil organic matter with 100% accuracy, and the accuracy of distinguishing fossil plant tissues from meteorite organic matter can reach 97%. More importantly, when the team applied it to unknown samples, the model successfully identified evidence of biogenic molecular assemblages in rocks from the Paleoarchean and Neoarchean about 3.33 billion and 2.52 billion years ago, respectively. This provides new methodological support for exploring earlier and less - preserved life traces.

The relevant research, titled "Organic geochemical evidence for life in Archean rocks identified by pyrolysis–GC–MS and supervised machine learning", was published in the Proceedings of the National Academy of Sciences (PNAS).

Research highlights:

* The technology - integration method proposed in the research breaks through traditional limitations. By combining pyrolysis gas chromatography - mass spectrometry with machine learning, it overcomes the core problem of difficulty in distinguishing after molecular degradation.

* The research samples cover a wide range, from modern life to rocks billions of years ago, from Earth organisms to extraterrestrial meteorites, providing a full - dimensional reference for model training. * Experiments show that this method is both scientific and forward - looking. It not only verifies the existence of life traces in Archean rocks but also provides a new method for exploring other unknown life traces.

Paper link: https://www.pnas.org/doi/10.1073/pnas.2514534122

Dataset: 406 samples cover a wide range, providing full - dimensional reference for the model

The research team analyzed a total of 406 natural and synthetic samples containing a series of organic molecules, covering ancient and modern, biological and non - biological sources. The time span ranges from about 3.8 billion years ago (Archean) to 10 million years ago (Neogene). The sample types include sedimentary rocks (141 pieces), fossils (65 samples), modern organisms (123), meteorites (42, of which 39 are carbonaceous chondrites), and laboratory - synthesized organic molecular assemblages (35 groups), providing a rich and diverse data foundation for machine - learning analysis.

Among these 406 samples, 272 samples were clearly divided into 9 categories according to phylogenetic relationships and physiological characteristics, which were used for the training (75%) and testing (25%) of supervised machine learning. Specifically (as shown in the figure below):

Three - dimensional py - GC - MS data of 9 categories of samples

* Modern animals: From a variety of recently deceased invertebrates and vertebrates, representing the organic molecular characteristics of modern non - photosynthetic heterotrophic organisms. The number of samples is 21.

* Modern plants (non - photosynthetic tissues): Including non - photosynthetic tissues and secretions of plant roots, seeds, flowers, fruits, and sap, representing the molecular differences of different functional tissues of plants. The number of samples is 40.

* Modern plants (photosynthetic tissues): Mainly leaves and other photosynthetic tissues, serving as a modern reference for the molecular characteristics of photosynthetic organisms. The number of samples is 36.

* Sedimentary rocks containing photosynthetic cyanobacteria/algae fossils: Organic residues enriched by acid dissolution with hydrochloric acid (HCl) and hydrofluoric acid (HF) from shale or chert, and the rocks have reliable morphological evidence of cyanobacteria or algae fossils, serving as molecular records of ancient photosynthetic microorganisms. The number of samples is 24.

* Fossil woods, coals, and oil shales: Mainly samples from the Phanerozoic (< 541 million years), also including hydrocarbon - rich sediments with complex origins in Proterozoic rocks, such as shungite and anthraxolite, representing the molecular preservation characteristics of ancient higher plants and hydrocarbon substances. The number of samples is 49.

* Animal fossils: All are Phanerozoic samples, including carbonized residues of fish fossils and trilobite fossils, as well as shell - binding proteins extracted from the shells of Miocene gastropods, representing the organic molecular residues of ancient animals. The number of samples is 9.

* Modern fungi: Including a variety of wood - decaying fungi and yeasts, filling the molecular data of non - plant and non - animal eukaryotic groups. The number of samples is 16.

* Meteorites: Mainly carbonaceous chondrites (39), with organic molecular assemblages enriched by chemical dissolution, serving as a clear reference for non - biological organic sources. The total number of samples is 42.

* Laboratory - synthesized samples: Organic molecular assemblages obtained through laboratory synthesis processes such as the Maillard reaction and the Formose reaction, simulating the molecular characteristics of non - biogenic organic substances. The number of samples is 35.

In addition, the research team also set up two additional auxiliary category samples for specific machine - learning models to distinguish between photosynthetic and non - photosynthetic organisms, with a total of 3 samples. Two modern cyanobacteria samples were used to supplement the data of photosynthetic prokaryotes, and one modern halophilic bacterium (Halobacter) sample was used to supplement the data of non - photosynthetic archaea.

Finally, the remaining 131 samples are mainly acid - dissolved and enriched residues of organic - rich Archean or Proterozoic sedimentary rocks. The origin and physiological characteristics of the organic molecules in these samples are unknown or controversial, but they also provide a new classification test field for verifying the application of machine - learning analysis in this experiment.

Research method and model: In - depth integration of py - GC - MS and machine learning

This experiment can be mainly summarized into four steps:

* Step 1: Collect 406 different carbon - containing samples from various modern and ancient, biological and non - biological sources.

* Step 2: Extract carbonaceous macromolecular substances from meteorites and ancient sedimentary rocks.

* Step 3: Analyze each sample using pyrolysis gas chromatography coupled to electron impact ionization mass spectrometry.

* Step 4: Train a supervised random forest model using the data from the experimental sample analysis subset (machine - learning method).

Among them, the most important part of this method lies in the "technology integration" of the py - GC - MS analysis technology and the machine - learning method.

First, the analysis technology. In this experiment, the research team used a CDS 6150 thermal probe, an Agilent 8860 series gas chromatograph, and an Agilent 5999 quadrupole mass spectrometer in the instrument configuration. An Agilent 30 M 5% phenyl PDMS chromatographic column was used for chromatographic separation. The pyrolysis products were immediately swept by helium gas onto the gas chromatographic column for analysis. The specific operations are as follows:

* Pyrolysis: The researchers loaded the samples (10 - 100 μg) into pre - heated (burned in air at 550°C for 3 h) quartz tubes, and then inserted them into the thermal probe coil for flash pyrolysis. The temperature was raised to 610°C at a rate of 500°C/s and maintained for 10 s.

* Chromatography: The initial temperature was 50°C and maintained for 1 min, then raised to 300°C at a rate of 5°C/min and maintained for 15 min. Ultra - high - purity helium (UHP 5.5 grade) was used as the carrier gas.

* Mass spectrometry: It operated in the electron ionization (EI) mode with an ionization energy of 70 eV at 250°C, with a scanning range of m/z 45 - 700, a scanning rate of 0.80 s/decade, and a delay of 0.20 s between scans.

To avoid interference from small - molecule volatiles (such as CO₂ and H₂O), MS data were not collected in the first two minutes of the experiment. In addition, the experiment also needed to exclude the signals in the elution regions of common pollutants (such as palmitic acid and stearic acid) in the chromatogram. Each sample was converted into a two - dimensional matrix (3,240 elution time periods x 150 m/z values), and the signal intensities of 489,240 elements were recorded as a function of mass and retention time. After standardization and smoothing, 8,149 effective features were finally retained.

Second, the model selection. This experiment used the random forest method. This is an integrated classification method with high accuracy, low computational cost, and interpretability. By constructing multiple de - correlated decision trees, the risk of overfitting is reduced. The model used is the random forest model mentioned by Leo Breiman in "Random Forests".

The researchers used two verification strategies for the trained machine - learning model. First, a stratified random sampling of 75% training set + 25% test set was used to ensure that the proportions of various samples were consistent in the two groups. Then, the generalization ability of the model was evaluated through 10 - fold cross - validation repeated 10 times, and the average accuracy was calculated to reduce random errors.

The experiment tested 4 models, which were used to distinguish modern biological sources (animals and plants) from non - biological sources (meteorites + synthetic samples), ancient biological sources (sedimentary rocks of known biological origin) from non - biological sources, ancient biological sources (excluding fossil woods and coals) from non - biological sources, and photosynthetic from non - photosynthetic samples.

Experimental results: Multiple models and multiple dimensions verify the feasibility of technology integration

In the preliminary test, the researchers used the random forest model to classify 36 pairwise combinations of 9 types of samples with known attributes. When the sample sizes were relatively balanced, in 25 out of 36 tests, the accuracy rates of both the training set and the test set were ≥ 90%, and in 19 of them, the accuracy rate was ≥ 95%. All the results are shown in the following table:

To further illustrate this method, the paper presented several case results, showing the differences in the efficiency of this method in different cases. For example, in the cases of Group 3 and Group 8, that is, modern plants (photosynthetic tissues) and meteorites, this method distinguished plants from meteorites with 100% accuracy. The category probabilities of all samples were > 0.6 or < 0.4, indicating significant differences in molecular characteristics. As shown in Figure A below:

Probability histograms of each sample in the training set belonging to one of the two categories

In addition, identifying samples of biogenic and abiogenic origins is a key goal in paleontology and astrobiology research. In response to this, the research team constructed and compared 3 different random forest models to verify the ability to distinguish between biological and non - biological sources for different sample combinations.

Specifically, in model # 1, the research team tested the ability to distinguish modern animals and plants in Groups 1, 2, 3 from non - biological sources (meteorites and synthetic samples) in Groups 8 and 9, with 97 and 77 samples respectively. The overall accuracy rate reached 98%. The AUC value was 0.977 for the training set and 1.000 for the test set; the 10 - fold cross - validation accuracy was 98.3%.

Model # 2 was mainly used to verify the ability to distinguish ancient biological samples from non - biological samples rich in organic matter. The control samples were from Groups 4, 5 and Groups 8, 9, containing 87 and 77 samples respectively. Among the 87 ancient organic samples of biogenic origin, 83 were correctly classified, with an accuracy rate of 95%. In addition, 70 (80%) of these samples had a high - confidence biogenic classification probability of > 0.6. Among the non - biological source samples, 69 were correctly classified, with an accuracy rate of 90%; the AUC value was 0.924 for the training set and 0.926 for the test set; the 10 - fold cross - validation accuracy was 92.7%.

When Model # 2 was applied to 109 ancient sedimentary rocks of unknown biological origin, it was found that 68 samples (61%) had a biogenic classification probability of > 0.50, and 32 samples had a biogenic classification probability of > 0.60.

In addition, the results also revealed the law that the proportion of biogenic samples decreases with geological age. Among 82 Phanerozoic samples, 76 (93%) were of biogenic origin, 43 (73%) in Proterozoic samples, and only 21 (47%) in 45 Archean samples. This shows that the percentage of biogenic

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The interdisciplinary team at Carnegie Mellon University successfully captured the remains of life from 3.3 billion years ago using a random forest model based on 406 samples.

Dataset: 406 samples cover a wide range, providing full - dimensional reference for the model

Research method and model: In - depth integration of py - GC - MS and machine learning

Experimental results: Multiple models and multiple dimensions verify the feasibility of technology integration