Hundreds of universities are conducting the world's largest multi-cohort proteogenomics study, unlocking disease-causing genes and new uses for old drugs based on data from nearly 80,000 subjects.
A team from over a hundred universities and research institutions, including Queen Mary University of London and the University of Cambridge, has announced the largest multi - cohort proteogenomic study globally to date. Relying on a large - scale proteogenomic meta - analysis covering 38 independent research cohorts with a total of 78,664 subjects, 24,738 protein quantitative trait loci were systematically identified and associated with 1,116 circulating proteins. This comprehensively revealed extensive proximal and distant genetic regulatory characteristics at the protein level.
The human genome is like a complete instruction manual for life, recording all genetic information such as appearance, height, physical constitution, and disease risks. However, the process of decrypting this manual is not straightforward, and there are likely to be various "small accidents", including disease - causing mutations that make people more susceptible to certain diseases. More problematically, most disease - causing mutations are located in the non - coding regions of the genome that "do not directly encode proteins". This "black - box mechanism" of which gene is involved and through what mechanism it causes disease severely limits people's ability to infer disease - causing genes and mechanisms. As the direct executors of gene function, the thousands of proteins circulating in human blood are the key to unlocking the black - box mechanism and connecting non - coding variations with disease - related mechanisms.
Currently, proteogenomic research has made important progress in clinical pathogenesis and potential drug targets, but there are still limitations in its systematic and large - scale application in human biology. First, past research has almost entirely focused on proximal cis - acting variations (i.e., cis - protein quantitative trait loci, cis - pQTL). Non - coding variations may be located in regulatory regions, which can directly affect multiple neighboring coding genes and indirectly regulate proteins encoded by genes at other positions in the genome at a distance. Second, past research on the polygenic genetic structure of protein biomarkers that affect disease diagnosis and prognosis is still insufficient. Finally, to stably and generally identify protein quantitative trait loci, repeated verification in different populations is required. Currently, very few population verification studies have been carried out in the field of broad - spectrum proteomics.
In response to this, a team from over a hundred universities and research institutions, including Queen Mary University of London and the University of Cambridge, has announced the largest multi - cohort proteogenomic study globally to date. Relying on a large - scale proteogenomic meta - analysis covering 38 independent research cohorts with a total of 78,664 subjects, 24,738 protein quantitative trait loci were systematically identified and associated with 1,116 circulating proteins. This comprehensively revealed extensive proximal and distant genetic regulatory characteristics at the protein level.
Machine learning was used to further analyze the key pathways, cell types, and tissue sources that regulate the abundance of circulating proteins, and to clarify the core role of N - glycosylation in the protein regulatory network. In addition, distinguishing the differences between cis and trans genetic regulation of proteins can effectively explain the internal mechanisms of different biological phenotypes, which provides evidence for screening potential protein drug targets for certain diseases. Further, through trans - locus triangular association analysis, the basis for "repurposing old drugs" was more deeply explored.
The relevant research results were published in Cell under the title "Multi - cohort proteogenomic analyses reveal genetic effects across the proteome and diseasome".
Research highlights:
* The largest multi - cohort proteogenomic study globally to date, covering 38 independent research cohorts with a total of 78,664 subjects participating
* 24,738 protein quantitative trait loci were identified and associated with 1,116 circulating proteins, comprehensively revealing extensive proximal and distant genetic regulatory characteristics at the protein level
* The regulatory rules of circulating proteins were systematically elaborated at the genetic level, providing important theoretical basis and data resources for analyzing the molecular mechanisms of human diseases, exploring innovative therapeutic targets, and conducting drug repositioning research
Paper link: https://www.cell.com/cell/fulltext/S0092 - 8674(26)00385 - 5
Largest - scale core data: 38 international cohorts and nearly 80,000 subjects participated
This study is the largest - scale multi - cohort proteogenomic meta - analysis globally. It integrated 38 international cohorts, covering 78,664 subjects of European ancestry. Based on the Olink high - throughput proteomic technology, 1,161 blood protein targets were detected and summarized. Finally, 24,738 fine - mapped pQTLs (including 5,040 cis - pQTLs and 19,698 trans - pQTLs) were identified, and genetic regulatory data for 1,116 effective proteins were obtained.
Research overview
SCALLOP meta - analyses: It includes genome - wide statistical data from 37 cohorts and 1,194 blood protein targets. Most of these subjects are of European ancestry. For these data, antibody - based proteomic detection was completed using at least one of the 13 Target - 96 detection panels provided by Olink. Each panel can detect 92 protein targets, covering fields related to cardiovascular, immune, inflammatory, neurological, and metabolic systems.
UK Biobank (UKBB): It includes 48,017 subjects of European ancestry. For this part of the data, the proteomic measurements were generated through the Olink Explore 1536 platform. Also using antibody - based technology, 1,463 protein targets were measured.
Phased machine - learning classifier
In this study, the core purpose of using the machine - learning model was to systematically, automatically, accurately, and on a large scale assign "effector genes" to all trans - pQTLs located outside the major histocompatibility complex (MHC) region, in order to address the long - standing challenge of difficultly mapping effector genes in distal genomic regions to protein quantitative trait loci related to blood protein levels. In response to this, inspired by the ProGeM architecture, the researchers constructed a phased machine - learning classifier.
First, regarding the sources of features and annotations, the researchers integrated multi - dimensional biological and genomic annotations for each genetic variant or its surrogate variant (r² > 0.6). Variant - level annotations included the distance between the variant and the gene body within a 1 Mb base window and the potential functional impact inferred based on the Variant Effect Predictor (VEP) tool.
At the same time, gene - level annotations were performed for each gene within the 1 Mb base window, including obtaining relevant evidence based on the co - localization of GTEx v8 protein abundance - gene expression QTLs, rare variant load associations, using the OmnipathR 3.10.1 package to sort out literature and determine whether there are ligand - receptor/protein complexes corresponding to the cis - protein encoded by the trans - gene, and judging whether the relevant genes are involved in the same biological pathway based on KEGG/REACTOME annotation information.
Then, the training set required for constructing the machine - learning model was continued to be built. Due to the lack of a widely applied gold - standard variant for gene assignment, the researchers used previous biological and genomic knowledge to obtain three partially independent sets of "putative true positives (PTP)". Only one cis - protein was retained at each locus within each PTP set to avoid bias, and other genes within the 1 Mb window were used as negative samples. Specifically, it included trans - genes encoding ligand - receptor pairs or forming high - confidence protein complexes with cis - proteins (n = 540), sentinel trans - pQTLs mapped to functional variants (n = 1747), and trans - genes with significant rare variant loads (n = 1049). Subsequently, the training set and test set were divided at a ratio of 7:3 according to the genomic region, and this was repeated 10 times to ensure stability.
Furthermore, regarding the model architecture and training process, the random forest classifier was used as the model algorithm in this study. By inputting 10 training sets, repeated 3 - fold cross - validation was performed and combined with a subsampling strategy to handle the problem of dataset imbalance during the training process. The model training was implemented based on the R language caret v6.0.94 toolkit, and then the random forest models with the best performance in each training set were selected and evaluated through the Kappa score.
Then, using the 10 random forest classifiers corresponding to each set of putative true - positive datasets, scores were assigned to the candidate effector genes of all trans - pQTLs one by one. First, the median of the scores of the 10 classifiers under the same putative true - positive dataset was taken, and then the three sets of predicted scores were accumulated. At the same time, when constructing the classification models for each putative true - positive dataset, the feature variables used to define the true - positive samples were removed.
Finally, all three sets of classification models showed stable and reliable performance, with the median Kappa coefficient ranging from 0.54 to 0.57.
Deciphering the pathogenic mechanism and providing genetic evidence for drug R & D and drug repositioning
This study was based on 38 international cohorts covering 78,664 subjects. A multi - cohort proteogenomic meta - analysis was carried out on 1,161 blood protein targets, systematically explaining the genetic regulatory patterns of circulating protein levels and their associations with diseases.
pQTL identification and characteristics
A total of 14,690 regional sentinel variants were identified in the study. Through Bayesian fine - mapping, 24,738 independent credible variant sets were obtained, covering 5,040 cis - pQTLs and 19,698 trans - pQTLs, and covering 1,116 protein targets. Among them, 87.1% of the proteins had cis - pQTLs, and 94.1% had trans - pQTLs; 82.3% of the cis - pQTLs and 83.3% of the trans - pQTLs were high - confidence loci, including 278 newly discovered cis - pQTLs and 4,013 newly discovered trans - pQTLs. At the same time, in non - European ancestry cohorts, the effect sizes of the identified loci were moderately correlated with those in the European cohort, r = 0.6. This verifies the cross - population robustness of the results.
Fine - mapped protein quantitative trait loci in the SCALLOP and UKBB meta - analyses
In addition, there were significant differences in the degree of variation in blood protein levels explained by genetic loci. Cis - pQTLs explained an average of 8.4% of protein variation, significantly higher than trans - pQTLs. However, proteins such as ICAM2 and FUCA1 were mainly regulated by trans - pQTLs, and the degrees of variation they explained reached 52.7% and 68.4% respectively, while cis - pQTLs only explained 0.3% and 6.3%.
Further observation of 261 protein targets showed that there was no significant linear association between the pQTL variation explanation and polygenic heritability, indicating that the identification of pQTLs for these proteins in this study may be approaching saturation.
Characteristics of protein targets under gene regulation
Protein characteristics related to the presence and number of pQTLs based on the zero - inflated Poisson regression model
Proteins containing disulfide bonds and transmembrane domains had significantly more corresponding pQTLs, which may explain why these proteins are more easily genetically regulated. There was a significant negative correlation between the functional constraint intensity of protein - coding genes and the number of cis - pQTLs.
Proteins with a high number of trans - pQTLs were significantly enriched in secreted protein characteristics, such as glycosylation and sulfation, but lacked characteristics of intracellular proteins such as zinc - finger structures and DNA - binding domains, indicating that the long - distance genetic regulation of circulating proteins is closely related to the secretion pathway.
Analysis of trans - pQTL effector genes and regulatory pathways
Based on the integration of prior biological knowledge in the machine - learning framework, at least one medium - confidence effector gene was identified for more than half of the trans - pQTLs (n = 11,261), of which 1,534 were assigned with high confidence. For two - thirds of the loci (n = 13,881), the distribution of candidate scores among genes indicated that a single causal gene was the most likely pathogenic gene.
Analysis of effector genes of trans - pQTLs
Functional enrichment analysis showed that trans - effector genes were significantly enriched in the "asparagine N - glycosylation" pathway (involving 143 protein targets), platelet activation (involving 41 protein targets), etc. Among them, N - glycosylation is the most common and core regulatory pathway.
Cell and tissue enrichment results showed that trans - effector genes were highly expressed mainly in hepatocytes, natural killer cells, endothelial cells, and type II alveolar cells, revealing that the liver and immune cells are key sites for the long - distance regulation of circulating proteins. 44 protein - tissue pairs and 76 protein - cell type pairs were non - classical secretion sources, confirming the important role of cross - organ communication in protein homeostasis regulation.
Pleiotropy at the molecular and phenome levels
Among all the identified independent pQTLs, 43.4% showed pleiotropy, and the pleiotropy of trans - pQTLs was significantly higher than that of cis - pQTLs. Subsequently, the pleiotropic genetic variations were divided into three categories: "molecular pleiotropy", "phenotypic pleiotropy", and "non - specific pleiotropy". More than half (332 out of 533) showed phenotypic pleiotropy. In particular, their expression in hepatocytes was enhanced by 2 - fold, and they preferentially regulated target proteins through protein complexes, ligand - receptor interactions, and pathway synergy.