Identification of Metabolomics Markers of Breast Cancer using Machine Learning approach.
Vasanta Putluri1*, Dr. Kaushal Kapadia2
1. Texila American University.
2 Clinical Research Professional.
*Correspondence to: Vasanta Putluri, Texila American University.
Copyright
© 2025 Vasanta Putluri. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Received: 06 February 2025
Published: 15 February 2025
DOI: https://doi.org/10.5281/zenodo.14881969
Abstract:
Breast cancer remains a leading cause of cancer-related mortality in women worldwide, necessitating the identification of reliable biomarkers for early detection, prognosis, and therapeutic response. Metabolomics, the comprehensive study of small-molecule metabolites, offers a promising avenue for biomarker discovery by capturing biochemical alterations associated with breast cancer progression. In this study, we employ a machine learning-based approach to analyze metabolomic profiles from breast cancer patients and healthy controls, aiming to identify key metabolic signatures associated with disease status. High-resolution mass spectrometry-based metabolomics data, previously published, was acquired from tissue samples of breast cancer patients and their matched controls are used in present study. Several supervised machines learning models, including Random Forest, Logistic Regression and Neural Network were employed to classify the metabolites between tumor and matched benign tissue samples. AUROC scores and pathway enrichment analyses were used to interpret the biological significance of the identified metabolites. Our results demonstrate that machine learning models achieved high classification accuracy, with Random Forest outperforming other methods. Several metabolites, including xenobiotic metabolism, central carbon metabolites, amino acid derivatives, were identified as significant contributors to breast cancer classification. Pathway analysis revealed key Hallmark pathways, highlighting potential metabolic vulnerabilities in breast cancer. This study highlights the potential of integrating metabolomics with machine learning to uncover novel metabolic biomarkers for breast cancer. The identified metabolomic signatures may serve as valuable tools for early detection and personalized therapeutic strategies. Future studies will focus on validating these biomarkers in larger, independent cohorts and exploring their mechanistic roles in breast cancer pathophysiology.
Keywords: Breast cancer, Metabolomics and Machine learning.
Abbreviations: Breast cancer (BRCA), Machine Learning (ML), Logistic Regression (LR), Neural Networks (NN), Random Forest (RF), Receiver Operating Characteristic (ROC), Area Under the Curve (AUC), branched-chain amino acids (BCAAs)
Introduction
Breast cancer remains one of the leading causes of cancer-related mortality among women worldwide, necessitating the development of early detection and prognostic strategies to improve patient outcomes[1-5]. Traditional diagnostic methods, including mammography and biopsy, have limitations in sensitivity, specificity, and invasiveness[6]. As a result, there is an increasing need for non-invasive biomarkers that can facilitate early detection, stratification, and therapeutic monitoring of breast cancer.
Metabolomics, the comprehensive analysis of small molecules (metabolites) within biological systems, has emerged as a promising approach for identifying metabolic alterations associated with disease progression[7]. Breast cancer is known to induce profound metabolic changes, including alterations in energy metabolism, amino acid utilization, and lipid biosynthesis, which can be captured through high-throughput metabolomics profiling[2, 4, 8-15]. By leveraging metabolomics, researchers can uncover unique metabolic signatures that distinguish breast cancer from healthy states and identify novel therapeutic targets[1, 2, 4, 5].
Recent advances in machine learning (ML) have revolutionized biomarker discovery by enabling efficient pattern recognition, feature selection, and predictive modeling from complex and high-dimensional datasets[16, 17]. ML algorithms, including Logistic regression (LR), random forest (RF), Neural Network (NN) models, can analyze large metabolomics datasets to identify robust and clinically relevant metabolic markers. Integrating metabolomics with ML provides a powerful framework to enhance breast cancer detection, prognosis, and personalized treatment strategies.
In this study, we employed a machine learning-driven metabolomics approach to identify and validate metabolic biomarkers for breast cancer[5]. Using a combination of targeted and untargeted metabolomics, we analyze biofluid and tissue samples from breast cancer patients and healthy controls[5]. We apply state-of-the-art ML algorithms to classify cancer versus non-cancer samples and interpret the biological significance of the identified metabolic alterations. Additionally, we focused on the relevant genes identified using machine learning methods and analyzed TCGA BRCA data to predict patient survival.
By integrating metabolomics with machine learning, this study aims to provide a non-invasive, accurate, and clinically relevant diagnostic model for breast cancer. Our findings could contribute to the development of precision medicine strategies, improving early detection, patient stratification, and therapeutic interventions.
Methods
This study employed a case-control design to identify metabolomic markers of breast cancer using machine learning approaches. We have obtained the metabolomics data which was normalized from earlier publication [5] and used for the further analysis.
Machine Learning Analysis and Feature Selection: To identify the most significant metabolite markers, multiple feature selection techniques were applied. Several machine learning classifiers were employed to distinguish breast cancer from healthy samples: Logistic Regression (LR), Random Forest (RF), Neural Network (NN) models[16, 17]. Model performance was assessed using multiple metrics, including accuracy, sensitivity, and specificity to evaluate classification performance. Receiver Operating Characteristic (ROC) Curve and Area Under the Curve (AUC) for diagnostic potential.
Pathway and Biological Interpretation: The metabolites with AUCROC values greater than 80% across three models were used for pathway analysis. These metabolites were mapped using the HMDB database through a Python program. The mapped metabolites were then used for pathway enrichment analysis using hallmark databases. Survival anal was performed using R and Python program
Results
Metabolomics Profiling Reveals Distinct Metabolic Signatures in Breast Cancer
Untargeted metabolomics analysis identified >500 features and more than 300 named metabolites in tissue samples that were used for the analysis[5] . After data preprocessing including the consideration of quality control and other parameters, approximately 300 metabolites remained for downstream analysis.
Machine Learning Model Performance for Breast Cancer Classification.
Feature Selection and Biomarker Identification: Applying Logistic regression (LR), random forest
Figure 1. Significantly associated metabolic markers identified using machine learning models in tumor patients compared to healthy individuals in BRCA. Metabolomics data use for the Machine Learning model. A-C) The plot shows the area under the curve (AUC) values for the 41 significantly associated metabolites in tumor patients compared to healthy individuals in BRCA using Random Forest (panel A), Neural Networks (Panel B) and Logistic Regression (panel C).
(RF), Neural Network (NN) models, we identified 41 key metabolites that exhibited the highest discriminatory power between breast cancer and control samples with AUC >0.8 (Figure 1). These biomarkers primarily belonged to xenobiotic metabolism, energy metabolism, lipid metabolism, and one-carbon metabolism pathways.
Classification Performance of Machine Learning Models: The predictive models were evaluated using a holdout validation dataset. Random Forest (RF) achieved the highest classification accuracy. ROC curve analysis confirmed high diagnostic potential, with an AUC of 0.90 for Random Forest, indicating strong predictive power (Figure 2).
Pathway Enrichment Analysis Identifies Key Metabolic Alterations.
To further investigate the biological relevance of the identified metabolites, pathway enrichment analysis was performed using R Studio. The analysis revealed several key Hallmark pathways that were significantly affected, shedding light on the metabolic alterations associated with breast cancer (Figure 3).
One of the major pathways identified was xenobiotic metabolism, which plays a crucial role in the detoxification and biotransformation of foreign compounds. Dysregulation of this pathway in breast cancer may contribute to altered drug metabolism and resistance to chemotherapy, a major challenge in cancer treatment.
Glycolysis was also notably impacted, with increased lactate and citrate levels suggesting enhanced aerobic glycolysis, commonly known as the Warburg effect. This metabolic shift is a hallmark of cancer cells, enabling them to generate energy rapidly while supporting biosynthetic demands necessary for rapid proliferation. In breast cancer, upregulated glycolysis has been linked to aggressive tumor behavior and poor prognosis.
Figure 2. Bars represent the prediction of AUC using metabolomics data from breast cancer patients, analyzed with Random Forest, Neural Networks, and Logistic Regression.
Additionally, lipid metabolism was significantly altered, with upregulation of phosphatidylcholines and ceramides indicating extensive lipid remodeling in breast cancer. Alterations in lipid composition can affect membrane fluidity, signaling pathways, and apoptosis resistance, all of which contribute to tumor progression. Emerging evidence suggests that lipid metabolism plays a crucial role in breast cancer metastasis and therapy resistance.
Fatty acid metabolism was also dysregulated, with abnormal glutamine and branched-chain amino acid (BCAA) metabolism pointing to altered nutrient utilization. Cancer cells reprogram their metabolism to sustain rapid growth, and disruptions in fatty acid and amino acid metabolism are frequently observed in breast cancer subtypes, particularly triple-negative breast cancer (TNBC).
Figure 3. Altered Hallmark Pathways from the metabolomics data.
Furthermore, bile acid metabolism was affected, which may reflect disruptions in lipid digestion and absorption. Altered bile acid signaling has been implicated in breast cancer progression through its effects on inflammation, gut microbiota interactions, and metabolic homeostasis.
We next examined the expression of these 41 metabolites mapped genes (n= 27) using TCGA BRCA transcriptomics data and analyzed their association with survival using publicly available TCGA BRCA data. The log-rank test was used to assess patient outcomes. Interestingly, ALDH2, CDO1, CSAD and COXPH which were downregulated in BRCA within the TCGA BRCA cohort, were associated with poor survival (Figure 4A-D).
These findings highlight the extensive metabolic reprogramming occurring in breast cancer and provide valuable insights into potential therapeutic targets. Understanding these metabolic shifts may aid in the development of targeted therapies aimed at disrupting key pathways essential for tumor survival and growth. The genes show survival differences may need to be evaluated in future study.
Figure 4. A) Low expression of ALDH2 was associated with poor survival in TCGA BRCA cohort (log-rank p= 0.002715; Top 50% and Bottom 50%). B) Low expression of CDO1 was associated with poor survival in TCGA BRCA cohort (log-rank p= 0.00495; Top 50% and Bottom 50%). C) Low expression of CSAD was associated with poor survival in TCGA BRCA cohort (log-rank p= 0.02133; Top 50% and Bottom 50%). D) Low expression of COXPH was associated with poor survival in TCGA BRCA cohort (log-rank p= 0.00149; Top 50% and Bottom 50%).
Discussion
In this study, we applied an integrated metabolomics and machine learning approach to identify metabolite biomarkers that differentiate breast cancer patients from healthy individuals[5]. Our findings highlight distinct metabolic signatures associated with breast cancer and demonstrate the potential of machine learning models, particularly Logistic regression (LR), random forest (RF), Neural Network (NN) models, for accurate classification.
Our metabolomics analysis revealed significant dysregulation in multiple metabolic pathways. Notably, we observed increased levels of glycolytic intermediates, consistent with the Warburg effect[18-21], a hallmark of cancer metabolism[22, 23]. This metabolic shift supports increased glucose uptake and aerobic glycolysis, which fuel tumor growth and survival. Additionally, dysregulation of lipid metabolism, in breast cancer, potentially contributing to altered membrane biosynthesis and signaling pathways that promote tumor progression[24, 25].
Alterations in amino acid metabolism were also observed, with elevated glutamine and branched-chain amino acids (BCAAs) in breast cancer patients. These findings align with previous reports that highlight the role of glutamine as a key carbon and nitrogen source for cancer cell proliferation. Increased BCAA metabolism may further support protein synthesis and mitochondrial energy production in tumor cells. Together, these metabolic changes highlight breast cancer’s dependency on nutrient availability and metabolic flexibility, offering potential therapeutic targets.
Machine Learning as a Powerful Tool for Biomarker Discovery
The integration of machine learning allowed us to effectively identify metabolic biomarkers with high discriminatory power. Among the models tested, Random Forest classifiers, achieving AUC values above 0.9, demonstrating their robustness for breast cancer classification. These models leveraged key metabolic features, as the most important contributors to disease classification.
The Random Forest analysis provided further insight into the contribution of individual metabolites to the model’s decision-making process, reinforcing the reliability of our findings. The ability of our models to maintain high accuracy (91%) in an independent validation cohort highlights their generalizability and potential clinical utility.
Our results are consistent with previous metabolomics studies that identified glycolysis, lipid metabolism, and amino acid metabolism as key pathways dysregulated in breast cancer[26-35]. However, our study advances the field by integrating machine learning-based predictive modeling, which enhances the accuracy and reproducibility of biomarker discovery.
Compared to conventional statistical methods, machine learning provides advantages such as handling high-dimensional data, reducing overfitting through feature selection, and improving predictive performance. These benefits position machine learning as a valuable tool for developing non-invasive diagnostic models for breast cancer.
The identification of a distinct metabolic signature in breast cancer offers promising avenues for early detection, prognosis, and therapeutic targeting. The non-invasive nature of metabolomics profiling using blood-based metabolic markers makes it particularly attractive for clinical applications.
Expanding sample size to further validate these findings across diverse patient populations. Integrating multi-omics approaches (e.g., transcriptomics, proteomics) to provide a more comprehensive understanding of breast cancer metabolism[36-39]. Exploring metabolic interventions targeting dysregulated pathways as potential therapeutic strategies.
The downregulated genes (ALDH2, CDO1, CSAD, and COXPH) were associated with poor survival in TCGA BRCA, suggesting their potential as prognostic markers. ALDH2 plays a role in oxidative stress responses and metabolic detoxification, and its downregulation may lead to increased ROS accumulation, promoting genomic instability and tumor progression [40]. CDO1 is involved in cysteine metabolism, and its epigenetic silencing is linked to poor clinical outcomes in breast cancer[41]. CSAD functions in taurine biosynthesis, and its loss may disrupt cellular osmoregulation, favoring tumor growth and apoptosis resistance [42]. COXPH, though less characterized, may play a role in inflammatory pathways, with its downregulation contributing to breast cancer aggressiveness [43]. Further characterization of these genes through functional and clinical studies is essential for validating their prognostic significance and potential as therapeutic targets.
Despite the promising findings, our study has some limitations. First, although our metabolomics analysis identified key metabolic changes, causal relationships between metabolic alterations and breast cancer progression remain unclear. Second, while our machine learning models achieved high accuracy, further validation in larger, independent cohorts is needed to confirm their clinical applicability. Lastly, dietary and lifestyle factors may influence metabolite levels and should be considered in future studies. The prognostic significance of these genes warrants further validation through functional studies and clinical cohort analyses.
Conclusion
This study demonstrated that distinct metabolic signatures differentiate breast cancer from healthy controls, with glycolysis, lipid metabolism, and amino acid metabolism being the most significantly altered pathways. The application of machine learning models, particularly Random Forest, enables highly accurate classification of breast cancer patients, with analysis identifying key metabolic drivers of disease. These findings underscore the potential of metabolomics-driven machine learning approaches for developing non-invasive diagnostic tools for breast cancer. Overall, our findings highlight the importance of ALDH2, CDO1, CSAD, and COXPH as potential prognostic markers in breast cancer. Future studies should focus on clinical validation and translation of these biomarkers into routine clinical practice.
References
1. Meena, J.K., et al., MYC Induces Oncogenic Stress through RNA Decay and Ribonucleotide Catabolism in Breast Cancer. Cancer Discov, 2024. 14(9): p. 1699-1716.
2. Dasgupta, S., et al., Metabolic enzyme PFKFB4 activates transcriptional coactivator SRC-3 to drive breast cancer. Nature, 2018. 556(7700): p. 249-254.
3. Bose, R., et al., Activating HER2 mutations in HER2 gene amplification negative breast cancer. Cancer Discov, 2013. 3(2): p. 224-37.
4. Xing, Z., et al., Expression of Long Noncoding RNA YIYA Promotes Glycolysis in Breast Cancer. Cancer Res, 2018. 78(16): p. 4524-4532.
5. Terunuma, A., et al., MYC-driven accumulation of 2-hydroxyglutarate is associated with breast cancer prognosis. J Clin Invest, 2014. 124(1): p. 398-412.
6. Nounou, M.I., et al., Breast Cancer: Conventional Diagnosis and Treatment Modalities and Recent Patents and Technologies. Breast Cancer (Auckl), 2015. 9(Suppl 2): p. 17-34.
7. Subramani, R., et al., Metabolomics of Breast Cancer: A Review. Metabolites, 2022. 12(7).
8. Ahn, S., et al., Metabolomic Rewiring Promotes Endocrine Therapy Resistance in Breast Cancer. Cancer Res, 2024. 84(2): p. 291-304.
9. Arnold, J.M., et al., Correction: UDP-glucose 6-dehydrogenase regulates hyaluronic acid production and promotes breast cancer progression. Oncogene, 2020. 39(15): p. 3226-3228.
10. Jaggupilli, A., et al., Metabolic stress induces GD2(+) cancer stem cell-like phenotype in triple-negative breast cancer. Br J Cancer, 2022. 126(4): p. 615-627.
11. Mishra, P., et al., ADHFE1 is a breast cancer oncogene and induces metabolic reprogramming. J Clin Invest, 2018. 128(1): p. 323-340.
12. Murthy, D., et al., CD24 negativity reprograms mitochondrial metabolism to PPARalpha and NF-kappaB-driven fatty acid beta-oxidation in triple-negative breast cancer. Cancer Lett, 2024. 587: p. 216724.
13. Park, J.H., et al., Fatty Acid Oxidation-Driven Src Links Mitochondrial Energy Reprogramming and Oncogenic Properties in Triple-Negative Breast Cancer. Cell Rep, 2016. 14(9): p. 2154-2165.
14. Purwaha, P., et al., Unbiased Lipidomic Profiling of Triple-Negative Breast Cancer Tissues Reveals the Association of Sphingomyelin Levels with Patient Disease-Free Survival. Metabolites, 2018. 8(3).
15. Putluri, N., et al., Pathway-centric integrative analysis identifies RRM2 as a prognostic marker in breast cancer associated with poor survival and tamoxifen resistance. Neoplasia, 2014. 16(5): p. 390-402.
16. Ghavidel, A. and P. Pazos, Machine learning (ML) techniques to predict breast cancer in imbalanced datasets: a systematic review. J Cancer Surviv, 2023.
17. Shakeel, C.S. and S.J. Khan, Machine learning (ML) techniques as effective methods for evaluating hair and skin assessments: A systematic review. Proc Inst Mech Eng H, 2024. 238(2): p. 132-148.
18. Goncalves, F.A., et al., Energy Metabolic Profile in Oral Potentially Malignant Disorders and Oral Squamous Cell Carcinoma: A Preliminary Landscape of Warburg Effect in Oral Cancer. Mol Carcinog, 2025. 64(1): p. 126-137.
19. Qi, Y., et al., Coptisine improves LPS-induced anxiety-like behaviors by regulating the Warburg effect in microglia via PKM2. Biomed Pharmacother, 2025. 183: p. 117837.
20. Yang, J., et al., Isocitrate dehydrogenase 2 mutation promotes cytarabine resistance in acute myeloid leukemia by Warburg effect. Hematol Oncol, 2024. 42(6): p. e3316.
21. Kukurugya, M.A., S. Rosset, and D.V. Titov, The Warburg Effect is the result of faster ATP production by glycolysis than respiration. Proc Natl Acad Sci U S A, 2024. 121(46): p. e2409509121.
22. Cantor, J.R. and D.M. Sabatini, Cancer cell metabolism: one hallmark, many faces. Cancer Discov, 2012. 2(10): p. 881-98.
23. Yang, L., S. Venneti, and D. Nagrath, Glutaminolysis: A Hallmark of Cancer Metabolism. Annu Rev Biomed Eng, 2017. 19: p. 163-194.
24. Balakrishnan, P., et al., Ceramide and N,N,N-Trimethylphytosphingosine-Iodide (TMP-I)-Based Lipid Nanoparticles for Cancer Therapy. Pharm Res, 2016. 33(1): p. 206-16.
25. Jung, J.H., et al., Characterization of Lipid Alterations by Oncogenic PIK3CA Mutations Using Untargeted Lipidomics in Breast Cancer. OMICS, 2023. 27(7): p. 327-335.
26. Ulaner, G.A. and D.M. Schuster, Amino Acid Metabolism as a Target for Breast Cancer Imaging. PET Clin, 2018. 13(3): p. 437-444.
27. Dai, Y.W., et al., Amino Acid Metabolism-Related lncRNA Signature Predicts the Prognosis of Breast Cancer. Front Genet, 2022. 13: p. 880387.
28. Cha, Y.J., E.S. Kim, and J.S. Koo, Amino Acid Transporters and Glutamine Metabolism in Breast Cancer. Int J Mol Sci, 2018. 19(3).
29. Lin, S., et al., Depression promotes breast cancer progression by regulating amino acid neurotransmitter metabolism and gut microbial disturbance. Clin Transl Oncol, 2024. 26(6): p. 1407-1418.
30. Kim, E.S., et al., Effect of oncogene activating mutations and kinase inhibitors on amino acid metabolism of human isogenic breast cancer cells. Mol Biosyst, 2015. 11(12): p. 3378-86.
31. Zhao, Y., et al., Essential amino acid metabolism-related molecular classification in triple-negative breast cancer. Epigenomics, 2021. 13(16): p. 1247-1268.
32. Dastych, M., et al., Impact of breast cancer neoadjuvant chemotherapy on plasma and urine amino acid profile, plasma proteins and nitrogen metabolism. Scand J Clin Lab Invest, 2024. 84(4): p. 237-244.
33. Sato, M., et al., L-type amino acid transporter 1 is associated with chemoresistance in breast cancer via the promotion of amino acid metabolism. Sci Rep, 2021. 11(1): p. 589.
34. Huynh, T.Y.L., et al., Metformin Treatment or PRODH/POX-Knock out Similarly Induces Apoptosis by Reprograming of Amino Acid Metabolism, TCA, Urea Cycle and Pentose Phosphate Pathway in MCF-7 Breast Cancer Cells. Biomolecules, 2021. 11(12).
35. Ryu, C.S., et al., Sulfur amino acid metabolism in doxorubicin-resistant breast cancer cells. Toxicol Appl Pharmacol, 2011. 255(1): p. 94-102.
36. Zuo, S., et al., Mitochondria-Associated Gene SLC25A32 as a Novel Prognostic and Immunotherapy Biomarker: From Pan-Cancer Multiomics Analysis to Breast Cancer Validation. Anal Cell Pathol (Amst), 2024. 2024: p. 1373659.
37. Karaman, S., et al., Multi-omics characterization of lymphedema-induced adipose tissue resulting from breast cancer-related surgery. FASEB J, 2024. 38(20): p. e70097.
38. Yang, Y., et al., A multi-omics method for breast cancer diagnosis based on metabolites in exhaled breath, ultrasound imaging, and basic clinical information. Heliyon, 2024. 10(11): p. e32115.
39. Bauer, B.A., et al., A Multiomics, Molecular Atlas of Breast Cancer Survivors. Metabolites, 2024. 14(7).
40. Zhang, H. and L. Fu, The role of ALDH2 in tumorigenesis and tumor progression: Targeting ALDH2 as a potential cancer treatment. Acta Pharm Sin B, 2021. 11(6): p. 1400-1411.
41. Jeschke, J., et al., Frequent inactivation of cysteine dioxygenase type 1 contributes to survival of breast cancer cells and resistance to anthracyclines. Clin Cancer Res, 2013. 19(12): p. 3201-11.
42. Ping, Y., et al., Taurine enhances the antitumor efficacy of PD-1 antibody by boosting CD8(+) T cell function. Cancer Immunol Immunother, 2023. 72(4): p. 1015-1027.
43. Wang, X., et al., Targeting Signaling Pathways in Inflammatory Breast Cancer. Cancers (Basel), 2020. 12(9).
Figure 1
Figure 2
Figure 3
Figure 4