Monte Carlo peaks: Simulated datasets to benchmark machine learning algorithms for clinical spectroscopy

Jaume Béjar-Grimalt, Ángel Sánchez-Illana, Guillermo Quintás, Hugh J. Byrne, David Pérez-Guaita

Research output: Contribution to journalArticlepeer-review

Abstract

Infrared and Raman spectroscopy hold great promise for clinical applications. However, the inherent complexity of the associated spectral data necessitates the use of advanced machine learning techniques which, while powerful in extracting biological information, often operate as black-box models. Combined with the absence of standardized datasets, this hinders model optimization, interpretability, and the systematic benchmarking of the growing number of newly developed machine learning methods. To address this, we propose a simulation-based framework for generating fully synthetic spectral datasets using Monte Carlo approaches for benchmarking. The artificial datasets mimic a wide range of realistic scenarios, including overlapping spectral markers and non-discriminant features and can be adjusted to simulate the effect of different parameters, such as instrumental noise, number of interferences, and sample size. These spectra are simulated through the generation of Lorentzian bands across the mid-infrared range, without specific reference to experimental data or chemical structures. We used the proposed methodology to compare different spectral marker identification protocols in a partial least squares discriminant analysis (PLS-DA), showing that the orthogonal PLS-DA (OPLS-DA) approach, when combined with marker selection based on VIP scores or the regression vector, yielded higher sensitivity, specificity, and interpretability than standard PLS-DA using the same selection criteria. This framework was further used to benchmark the classification capabilities of commonly employed machine learning algorithms, incorporating both linear and non-linear markers reflective of compositional variations across the target classes. Key findings were validated using real infrared spectra from human blood serum and saliva collected in the frame of a clinical study. Overall, the proposed approach provides a versatile sandbox environment for the systematic evaluation of data analysis strategies in vibrational spectroscopy, that can help experimentalists to better interpret spectral markers or data analysts focused on benchmarking and validating new algorithms.

Original languageEnglish
Article number105548
JournalChemometrics and Intelligent Laboratory Systems
Volume267
DOIs
Publication statusPublished - 15 Dec 2025

Keywords

  • Clinical spectroscopy
  • Machine learning
  • Monte Carlo
  • Synthetic data
  • Vibrational spectroscopy

Fingerprint

Dive into the research topics of 'Monte Carlo peaks: Simulated datasets to benchmark machine learning algorithms for clinical spectroscopy'. Together they form a unique fingerprint.

Cite this