TY - JOUR
T1 - Monte Carlo peaks
T2 - Simulated datasets to benchmark machine learning algorithms for clinical spectroscopy
AU - Béjar-Grimalt, Jaume
AU - Sánchez-Illana, Ángel
AU - Quintás, Guillermo
AU - Byrne, Hugh J.
AU - Pérez-Guaita, David
N1 - Publisher Copyright:
© 2025 The Authors
PY - 2025/12/15
Y1 - 2025/12/15
N2 - Infrared and Raman spectroscopy hold great promise for clinical applications. However, the inherent complexity of the associated spectral data necessitates the use of advanced machine learning techniques which, while powerful in extracting biological information, often operate as black-box models. Combined with the absence of standardized datasets, this hinders model optimization, interpretability, and the systematic benchmarking of the growing number of newly developed machine learning methods. To address this, we propose a simulation-based framework for generating fully synthetic spectral datasets using Monte Carlo approaches for benchmarking. The artificial datasets mimic a wide range of realistic scenarios, including overlapping spectral markers and non-discriminant features and can be adjusted to simulate the effect of different parameters, such as instrumental noise, number of interferences, and sample size. These spectra are simulated through the generation of Lorentzian bands across the mid-infrared range, without specific reference to experimental data or chemical structures. We used the proposed methodology to compare different spectral marker identification protocols in a partial least squares discriminant analysis (PLS-DA), showing that the orthogonal PLS-DA (OPLS-DA) approach, when combined with marker selection based on VIP scores or the regression vector, yielded higher sensitivity, specificity, and interpretability than standard PLS-DA using the same selection criteria. This framework was further used to benchmark the classification capabilities of commonly employed machine learning algorithms, incorporating both linear and non-linear markers reflective of compositional variations across the target classes. Key findings were validated using real infrared spectra from human blood serum and saliva collected in the frame of a clinical study. Overall, the proposed approach provides a versatile sandbox environment for the systematic evaluation of data analysis strategies in vibrational spectroscopy, that can help experimentalists to better interpret spectral markers or data analysts focused on benchmarking and validating new algorithms.
AB - Infrared and Raman spectroscopy hold great promise for clinical applications. However, the inherent complexity of the associated spectral data necessitates the use of advanced machine learning techniques which, while powerful in extracting biological information, often operate as black-box models. Combined with the absence of standardized datasets, this hinders model optimization, interpretability, and the systematic benchmarking of the growing number of newly developed machine learning methods. To address this, we propose a simulation-based framework for generating fully synthetic spectral datasets using Monte Carlo approaches for benchmarking. The artificial datasets mimic a wide range of realistic scenarios, including overlapping spectral markers and non-discriminant features and can be adjusted to simulate the effect of different parameters, such as instrumental noise, number of interferences, and sample size. These spectra are simulated through the generation of Lorentzian bands across the mid-infrared range, without specific reference to experimental data or chemical structures. We used the proposed methodology to compare different spectral marker identification protocols in a partial least squares discriminant analysis (PLS-DA), showing that the orthogonal PLS-DA (OPLS-DA) approach, when combined with marker selection based on VIP scores or the regression vector, yielded higher sensitivity, specificity, and interpretability than standard PLS-DA using the same selection criteria. This framework was further used to benchmark the classification capabilities of commonly employed machine learning algorithms, incorporating both linear and non-linear markers reflective of compositional variations across the target classes. Key findings were validated using real infrared spectra from human blood serum and saliva collected in the frame of a clinical study. Overall, the proposed approach provides a versatile sandbox environment for the systematic evaluation of data analysis strategies in vibrational spectroscopy, that can help experimentalists to better interpret spectral markers or data analysts focused on benchmarking and validating new algorithms.
KW - Clinical spectroscopy
KW - Machine learning
KW - Monte Carlo
KW - Synthetic data
KW - Vibrational spectroscopy
UR - https://www.scopus.com/pages/publications/105018666011
U2 - 10.1016/j.chemolab.2025.105548
DO - 10.1016/j.chemolab.2025.105548
M3 - Article
AN - SCOPUS:105018666011
SN - 0169-7439
VL - 267
JO - Chemometrics and Intelligent Laboratory Systems
JF - Chemometrics and Intelligent Laboratory Systems
M1 - 105548
ER -