MyJournals Home  

RSS FeedsSubstituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: cyscore as a case study (BMC Bioinformatics)

 
 

27 august 2014 09:16:18

 
Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: cyscore as a case study (BMC Bioinformatics)
 


Background: State-of-the-art protein-ligand docking methods are generally limited by the traditionally lowaccuracy of their scoring functions, which are used to predict binding affinity and thus vital fordiscriminating between active and inactive compounds. Despite intensive research over the years,classical scoring functions have reached a plateau in their predictive performance. These assume apredetermined additive functional form for some sophisticated numerical features, and use standardmultivariate linear regression (MLR) on experimental data to derive the coefficients. Results: In this study we show that such a simple functional form is detrimental for the predictionperformance of a scoring function, and replacing linear regression by machine learning techniqueslike random forest (RF) can improve prediction performance. We investigate the conditions ofapplying RF under various contexts and find that given sufficient training samples RF manages tocomprehensively capture the non-linearity between structural features and measured bindingaffinities. Incorporating more structural features and training with more samples can both boost RFperformance. In addition, we analyze the importance of structural features to binding affinityprediction using the RF variable importance tool. Lastly, we use Cyscore, a top performing empiricalscoring function, as a baseline for comparison study. Conclusions: Machine-learning scoring functions are fundamentally different from classical scoring functionsbecause the former circumvents the fixed functional form relating structural features with bindingaffinities. RF, but not MLR, can effectively exploit more structural features and more trainingsamples, leading to higher prediction performance. The future availability of more X-ray crystalstructures will further widen the performance gap between RF-based and MLR-based scoringfunctions. This further stresses the importance of substituting RF for MLR in scoring functiondevelopment.


 
113 viewsCategory: Bioinformatics
 
MendeLIMS: a web-based laboratory information management system for clinical genome sequencing (BMC Bioinformatics)
Dimensional analysis yields the general second-order differential equation underlying many natural phenomena: the mathematical properties of a phenomenon`s data plot then specify a unique differential equation for it (Theoretical Biology and Medical Modelling)
 
 
blog comments powered by Disqus


MyJournals.org
The latest issues of all your favorite science journals on one page

Username:
Password:

Register | Retrieve

Search:

Bioinformatics


Copyright © 2008 - 2024 Indigonet Services B.V.. Contact: Tim Hulsen. Read here our privacy notice.
Other websites of Indigonet Services B.V.: Nieuws Vacatures News Tweets Nachrichten