https://doi.org/10.25678/0007QS

Data and Workflows for: Machine learning-based hazard-driven prioritization of features in nontarget screening of environmental high-resolution mass spectrometry data

Nontarget high-resolution mass spectrometry screening (NTS HRMS/MS) can detect thousands of organic substances in environmental samples. However, new strategies are needed to focus time-intensive identification efforts on features with the highest potential to cause adverse effects instead of the most abundant ones. To address this challenge, we developed MLinvitroTox, a machine-learning framework that uses molecular fingerprints derived from fragmentation spectra (MS2) for a rapid classification of thousands of unidentified HRMS/MS features as toxic/nontoxic based on nearly 400 target-specific and over 100 cytotoxic endpoints from ToxCast/Tox21. Model development results demonstrated that using customized molecular fingerprints and models, over a quarter of toxic endpoints and the majority of associated mechanistic targets could be accurately predicted with sensitivities exceeding 0.95. Notably, SIRIUS molecular fingerprints and xboost (Extreme Gradient Boosting) models with SMOTE (Synthetic Minority Over-sampling Technique) for handling data imbalance was a universally successful and robust modeling configuration. Validation of MLinvitroTox on MassBank spectra showed that toxicity could be predicted from molecular fingerprints derived from MS2 with an average balanced accuracy of 0.75. By applying MLinvitroTox to environmental HRMS/MS data, we confirmed the experimental results obtained with targeted analysis and narrowed the analytical focus from tens of thousands of detected signals to 783 features linked to potential toxicity, including 109 spectral matches and 30 compounds with confirmed toxic activity.

Data and Resources

Citation

This Data Package

Arturi, K., & Hollender, J. (2023). Data and Workflows for: Machine learning-based hazard-driven prioritization of features in nontarget screening of environmental high-resolution mass spectrometry data (Version 1.0). Eawag: Swiss Federal Institute of Aquatic Science and Technology. https://doi.org/10.25678/0007QS

The associated article

Arturi, K., & Hollender, J. (2023). Machine Learning-Based Hazard-Driven Prioritization of Features in Nontarget Screening of Environmental High-Resolution Mass Spectrometry Data. Environmental Science & Technology. https://doi.org/10.1021/acs.est.3c00304

Metadata

Open Data Open Data
Long-term data Long-term data
Author(s)
  • Arturi, Kasia
  • Hollender, Juliane
Keywords data mining,ToxCast,Tox21,toxicity prediction,environmental pollution,supervised classification,extreme gradient boosting,SIRIUS,fingerprints,machine learning,invitroDB,Nontarget screening,HRMSMS
Substances (generic terms)
  • CompTox Dashboard
  • invitroDB
  • Tox21 and ToxCast
Timerange
  • 2020-09-01 TO 2023-02-01
Review Level general
Curator Arturi, Kasia
Contact Hollender, Juliane <Juliane.Hollender@eawag.ch>
DOI 10.25678/0007QS