Computational Toxicology - Risk Assessment for Chemicals

Computational Toxicology - Risk Assessment for Chemicals

von: Sean Ekins

Wiley, 2018

ISBN: 9781119282587 , 432 Seiten

Format: ePUB

Kopierschutz: DRM

Mac OSX,Windows PC für alle DRM-fähigen eReader Apple iPad, Android Tablet PC's Apple iPod touch, iPhone und Android Smartphones

Preis: 159,99 EUR

eBook anfordern eBook anfordern

Mehr zum Inhalt

Computational Toxicology - Risk Assessment for Chemicals


 

Chapter 1
Accessible Machine Learning Approaches for Toxicology


Sean Ekins1, Alex M. Clark2, Alexander L. Perryman3, Joel S. Freundlich3,4, Alexandru Korotcov5 and Valery Tkachenko6

1Collaborations Pharmaceuticals, Inc., Raleigh, NC, USA

2Molecular Materials Informatics, Inc., Montreal, Quebec, Canada

3Department of Pharmacology & Physiology, New Jersey Medical School, Rutgers University, Newark, NJ, USA

4Division of Infectious Disease, Department of Medicine and the Ruy V. Lourenço Center for the Study of Emerging and Re-emerging Pathogens, New Jersey Medical School, Rutgers University, Newark, NJ, USA

5Gaithersburg, MD, USA

6Rockville, MD, USA

1.1 Introduction


Computational approaches have in recent years played an increasingly important role in the drug discovery process within large pharmaceutical firms. Virtual screening of compounds using ligand-based and structure-based methods to predict potency enables more efficient utilization of high throughput screening (HTS) resources, by enriching the set of compounds physically screened with those more likely to yield hits [1–4]. Computation of absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) properties exploiting statistical techniques greatly reduces the number of expensive assays that must be performed, now making it practical to consider these factors very early in the discovery process to minimize late-stage failures of potent lead compounds that are not drug-like [5–11]. Large pharma have successfully integrated these in silico methods into operational practice, validated them, and then realized their benefits, because these firms have (i) expensive commercial software to build models, (ii) large, diverse proprietary datasets based on consistent experimental protocols to train and test the models, and (iii) staff with extensive computational and medicinal chemistry expertise to run the models and interpret the results. Drug discovery efforts centered in universities, foundations, government laboratories, and small biotechnology companies, however, generally lack these three critical resources and, as a result, have yet to exploit the full benefits of in silico methods. For close to a decade, we have aimed to used machine learning approaches and have evaluated how we could circumvent these limitations so that others can benefit from current and emerging best industry practices.

The current practice in pharma is to integrate in silico predictions into a combined workflow together with in vitro assays to find “hits” that can then be reconfirmed and optimized [12]. The incremental cost of a virtual screen is minimal, and the savings compared with a physical screen are magnified if the compound would also need to be synthesized rather than purchased from a vendor. Imagine if the blind hit rate against some library is 1%, and the in silico model can pre-filter the library to give an experimental hit rate of 2%, then significant resources are freed up to focus on other promising regions of chemical property space [13]. Our past pharmaceuticals collaborations [14, 15] have suggested that computational approaches are critical to making drug discovery more efficient.

The relatively high cost of in vivo and in vitro screening of ADME and toxicity properties of molecules has motivated our efforts to develop in silico methods to filter and select a subset of compounds for testing. By relying on very large, internally consistent datasets, large pharma has succeeded in developing highly predictive proprietary models [5–8]. At Pfizer (and probably other companies), for example, many of these models (e.g., those that predict the volume of distribution, aqueous kinetic solubility, acid dissociation constant, and distribution coefficient) [5–8, 16] are believed (according to discussions with scientists) to be so accurate that they have essentially put experimental assays out of business. In most other cases, large pharma perform experimental assays for a small fraction of compounds of interest to augment or validate their computational models. Efforts by smaller pharma and academia have not been as successful, largely because they have, by necessity, drawn upon much smaller datasets and, in a few cases, tried to combine them [11, 17–22]. However, this is changing rapidly, and public datasets in PubChem, ChEMBL, Collaborative Drug Discovery (CDD) and elsewhere are becoming available for ADME/Tox properties. For example, the CDD public database has >100 public datasets that can be used to generate community-based models, including extensive neglected infectious disease structure–activity relationship (SAR) datasets (malaria, tuberculosis, Chagas disease, etc.), and ADMEdata.com datasets that are broadly applicable to many projects. Recent efforts with them have led to a platform that enables drug discovery projects to benefit from open source machine learning algorithms and descriptors in a secure environment, which allows models to be shared with collaborators or made accessible to the community.

In the area of pharmaceutical research and development and specifically that of cheminformatics, there are many machine learning methods, such as support vector machines (SVM), k-nearest neighbors, naïve Bayesian, and decision trees, [23] which have seen increasing use as our datasets, have grown to become “big data” [24–27]. These methods [23] can be used for binary classification, multiple classes, or continuous data. In more recent years, the biological data amassed from HTS and high content screens has called for different tools to be used that can account for some of the issues with this bigger data [26]. Many of these resulting machine learning models can also be implemented on a mobile phone [28, 29].

1.2 Bayesian Models


Our machine learning experience over a decade [14, 30–46] has focused on Bayesian approaches (Figure 1.1). Bayesian models classify data as active or inactive on the basis of user-defined thresholds using a simple probabilistic classification model based on Bayes' theorem. We initially used the Bayesian modeling software within the Pipeline Pilot and Discovery Studio (BIOVIA) with many ADME/Tox and drug discovery datasets. Most of these models have used molecular function class fingerprints of maximum diameter 6 and several other simple descriptors [47, 48]. The models were internally validated through the generation of receiver operator characteristic (ROC) plots. We have also compared single- and dual-event Bayesian models utilizing published screening data [49, 50]. As an example, the single-event models use only whole-cell antitubercular activity, either at a single compound concentration or as a dose–response IC50 or IC90 (amount of compound inhibiting 50% or 90% of growth, respectively), while the dual-event models also use a selectivity index (SI = CC50/IC90, where CC50 is the compound concentration that is cytotoxic and inhibits 50% of the growth of Vero cells). While single-event models [13, 51, 52] are widely published, dual-event models [53] attempt to predict active compounds with acceptable relative activity against the pathogen (in this case, Mtb), versus the model mammalian cell line (e.g., Vero cells). Our models identified 4–10 times more active compounds than random screening did and the models also had relatively high hit rates, for example, 14% [54], 71% (Figure 1.1) [53], or intermediate [55] for Mtb. Recent machine learning work on Chagas disease has identified in vivo active compounds [56], one of which is an approved antimalarial in Europe. Most recently, we have been actively constructing Bayesian models for ADME properties such as aqueous solubility, mouse liver microsomal stability [57], and Caco-2 cell permeability [30], which complement our earlier ADME/Tox machine learning work [13, 52, 58–64]. We have also summarized the application of these methods to toxicology datasets [58] and transporters [34, 59, 62, 63, 65–67]. This has led to models with generally good to acceptable ROC scores > 0.7 [30]. Open source implementation of the ECFP6/FCFP6 fingerprints [28] and Bayesian model building module [25, 30] has also enabled their use in new software implementations (see later). We are keen to explore machine learning algorithms and make them accessible for seeding drug discovery projects, as we have demonstrated.

Figure 1.1 Summary of machine learning models generated for Mycobacterium tuberculosis in vitro data. This approach has also been applied to ADME/Tox datasets.

1.2.1 CDD Models

ADME properties have been modeled by us with collaborators [30] and others using an array of machine learning algorithms, such as SVMs [68], Bayesian modeling [69], Gaussian processes [70], or others [71]. A major challenge remains the ability to share such models. CDD has developed and marketed a robust, innovative commercial software platform that enables scientists to archive, mine, and (optionally) share SAR, ADME/Tox, and other types of preclinical research data [72]. CDD hosts the software and customers' data vaults on its secure servers. CDD collaborated with computational chemists at Pfizer in a proof of...