Quantitative Structure-Activity Relationships: Linear Regression Modelling and Validation Strategies by Example

—Quantitative structure-activity relationships are mathematical models constructed based on the hypothesis that structure of chemical compounds is related to their biological activity. A linear regression model is often used to estimate and/or to predict the nature of the relationship between a measured activity and some measure or calculated descriptors. Linear regression helps to answer main three questions: does the biological activity depend on structure information; if so, the nature of the relationship is linear; and if yes, how good is the model in prediction of the biological activity of new compound(s). This manuscript presents the steps on linear regression analysis moving from theoretical knowledge to an example conducted on sets of endocrine disrupting chemicals.


I. LINEAR REGRESSION ON QSAR ANALYSIS
Quantitative structure-activity relationships (QSARs) are mathematical models linking chemical structure and pharmacological activity/property in a quantitative manner for a series of compounds [1]. The approaches are based on the assumption that the structure of chemical compounds (such as geometric, topologic, steric, electronic properties, etc.) contains features responsible for its physical, chemical and/or biological properties [2]. This assumption could be summarized as "similar compounds have similar properties" [3].
The two main fields where linear regression analysis found its applicability are drug discovery [4], [5] and toxicology prediction [6], [7]. In both of these fields, the linear regression is used mainly to predict not to estimate (the model is used to quickly determine the activity/property of new/un-investigated compounds) [8].
The linear regression is used in QSAR analysis to linearly link the activity/property of chemical compounds (measured or observed value -outcome variable abbreviated as Y) and some values translated from the structure of the compounds and generally called descriptors (assumed error non-affected independent variables abbreviated as X(s)). The multiple linear regression (MLR) expression is presented in Eq(1): whereŶ = estimated activity/property; b 0 = intercept; b i = coefficient of the i th independent variable / descriptor variable (1 ≤ i ≤ k, 5 × k ≤ n [9]), k = number of descriptors (independent/descriptor variables) in the model, n = number of observations in the sample. The regression coefficients b i could be interpreted as the change in Y when X i increased or decreased by 1 unit when all other independent variables are held constant (b 0 and b 1 estimate the population parameters β 0 and β i , [10]). The identified values of b 0 and b i are calculated to minimize the squared error for all n observations.

A. Linear Regression Assumptions
The main assumptions of linear regression (Table I) could be summarized as: 1) Linearity. The relation between Y and each of descriptors X i are linear. 2) Independence of the errors. Both the experimental values (Y ) and experimental/calculated descriptors (X i ) are measured without errors. 3) Homoscedasticity. The variance of the errors is constant. 4) Normality. The dependent variable (Y ) is normal distributed. 5) Absence of multicolinearity. The independent variables (X i ) are linearly independent of each other. Please note that this constrain did not exclude a certain degree of collinearity. Since it has been recognized that "normal law ... is not valid in a great many cases which are both common and important" [11] a series of transformation could be used to reach normal distribution [29] (see Table II).
1) Model Selection and Diagnostic: Selection of the regression model is an important task that researchers must to accomplish. The main criteria useful in this step are: • Determination coefficient (R 2 ) and its adjustment form (R 2 adj -R 2 adjusted with the number of coefficients in the model → the value will not necessary increase with the addition of X's). Generally, the R 2 increase with the number of parameters in the model but R 2 adj penalizes according to the number of parameters (the model with higher number of descriptors does not necessary has the higher value of R 2 adj ). • Standard error of the estimate: the average error predicting the activity/property of interest by the identified model. • Statistics of overall model performances (F -value and associated p-value): assess the overall ability of a model to explain as much as possible from the observed variability in Y . • Models performances in cross-validation by the leave-one-out analysis. It is say that a model with Q 2 (determination coefficient in cross-validation by the leave-one-out analysis) >0.6 and |R 2 − Q 2 | < 0.1 is a desired model in QSAR analysis [30].
However, the value of F -statistics and its associated probability are as important as Q 2 in assessment of internal validation of a QSAR model. • Mallows C p -statistic (C p = SS res /M S res − n + 2 · (k + 1), k = number of descriptor variables in the model) [31], [32], [33]: measures the overall bias or mean square error in the estimated model parameters. This is a useful parameter when models with different X(s) are compared on the same sample of compounds. A low C p value indicates good model prediction or a model with a small positive/negative discrepancy between C p and (k +1) -could be used in evaluating candidate regression models. • Akaikes information criterion and derivative formulas: assess the degree of fit by involving the goodness-of-fit of the model (R 2 ): Akaike information criterion (AIC = n · ln(RSS/n) + 2 · (k + 1) for the model with intercept and AIC = n · ln(RSS/n) + 2 · k for the model without intercept, where n = sample size, RSS = residual sum of squares; k = number of X i ) [34]; AIC based on the determination coefficient (AIC R2 = ln[(1 − R 2 )/n] + 2 · (k + 1)); McQuarrie and Tsai corrected AIC (AIC u = ln[RSS/(n − k + 1)] + (n + k + 1)/(n − k − 1)) [35]; Bayesian Information Criterion (BIC = n · ln[RSS/(n − k + 1)] + (k + 1) · ln(n)) [36]; Amemiya Prediction Criterion (AP C = RSS/n · (n − k + 1)/(n + k + 1)) [37]; Hannan-Quinn Criterion (HQC = n·ln(RSS/n)+ 2 · (k + 1) · ln[ln(n)] [38]. The smallest the AIC, BIC, AP C and HQC values are the better the model is considered. In addition to AIC values, the Akaike weights are also used in models assessment: [39] where ∆ i = AIC i min(AIC), ∆ i = difference between the AIC of the best fitting model and that of the model i th , min(AIC) = minimum AIC value out of all models, j = the number of the models. • Kubinyi function (F IT ) [40], [41]: The highest the F IT value the better the model is considered.
The diagnosis of a regression model when the dependent variable is continuous could be conducted by analyzing of residuals or rescaled residuals: • Look to the largest and/or smallest experimental values ← detect if the values are in the plausible range. Also look to descriptive statistics value: mean, standard deviation, histogram. for the model without intercept, s i = studentized residuals [43]). Several parameters that can found their usefulness in diagnosis of a MLR are presented in Table III. Several parameters presented in Table III are also used by some authors as measures of model predictivity power (see for example MAE [44]).

B. Model Predictive Power
The ability to predict the activity/property of new compounds is of major importance in QSAR/QSPR analysis. Several parameters were proposed and are used to assess model predictivity power and are presented in Table IV.
The diagnosis of a linear regression model could be conducted using a series of statistical parameters calculated on contingency table [58] after transforma- Stabilize the variance (the variance decrease with the mean of Y) Normalized the dependent variable ← negative skewed distribution of the residuals for Y Linearize the regression model ← the original relation with some independent variable is curvilinear downward (such as decrease of slope with the increase of independent variable) 'arcsine' Y' = asin√Y Stabilize the variance Y is a proportion or a percentage tion of the observed and estimated/predicted logRBA as dichotomial variables using criteria for classification of compounds as active or inactive. The total fraction of compounds correctly classified (parameter called concordance / accuracy / non-error rate) is one parameter that could bring useful information in choosing which model to be applied.
The following descriptors were previously calculated on the investigated structures [59] and were used here to illustrate how linear regression analysis works: TIE = Estate topological parameter; TIC1 = Total information content index (neighbourhood symmetry of 1-order); ATS4m = Broto-Moreau autocorrelation of a topological structure -lag 4 / weighted by atomic masses; EEig02d = Eigenvalue 02 from edge adj. matrix weighted by dipole moments; E1s = 1st component accessibility directional WHIM index / weighted by atomic electrotopological states; and Dv = total accessibility index / weighted by atomic van der Waals volumes.
The first set was used to identify the model and comprised 132 compounds (training set; 1 withdrawn, 60 weak binders, 41 moderate binders and 30 strong binders). The second dataset was used to test the performances of the model (test set) and comprised 23 compounds (3 weak binders, 16 moderate binders and 4 strong binders). The third dataset was used as external validation set and consists of 9 compounds (4 weak binders and 5 moderate binders).

A. MLR in Training Sets
The first step in the linear regression analysis was to investigate the distribution of logRBA in training set. One out of three tests rejected the null hypothesis of normality (Chi-Square statistics = 14.862, p-value = 0.03781). No outlier had been identified when the Grubbs test was applied but there was one compound with studentized residuals higher than 3 standard deviations. The experimental data in training test proved not normal distributed according just with the Chi-Square test (see Table V), the normality test that is known to be affected by the presence of outlier(s) [12], even if in this example no outlier has been identified. The normality was not achieved even by withdrawing that compounds but the correlation coefficient increased from 0.810 to 0.837. The studentized residuals, hat matrix and Cook's distance values were plotted against logRBA to identify how data were distributed (Figure 1). Three models obtained on the same datasets were investigated: The smaller the better The smaller the better TSE > (k+1) → bias due to incompletely specified model TSE< (k+1) → the model is over specified (contains too many variables) Average Prediction Mean Squared Error (APMSE) [47] The smaller the better Mean Absolute Error (MAE) -Measures the average magnitude of the errors; could be also used to compare two models The smaller the better n = sample size; k = number of independent variables in the model; y = the mean of estimated/predicted activity/property; i ŷ = predicted value of the i th compound in the sample; y i = observed/measured activity/property of i th compound; SSE = sum of squared errors; MSE = mean of squared errors full-model (the model comprised all compounds assigned to training test), Di-model (the model comprised just the compounds that did not exceeded the imposed Cooks distance threshold), and hi-model (the model comprised just the compounds that did not exceeded the imposed hat matrix threshold).
The Cook's distance and hat matrix approaches were applied to withdrawn compounds of the training sample until two criteria were accomplished: logRBA proved normal distributed and withdrawing the compound(s) did not led to an improvement in determination coefficient. Both models proved smaller RMSE and RMSEP values.
The characteristics of all investigated models are presented in Table V. The analysis of the models (Table V) revealed that none model proved collinearity (the highest correlation coefficient did not exceeded 0.8 and VIF values are less than 10). The Di-model is twice better in terms of internal validity when the |R 2 − Q 2 | difference is evaluated compared to h i -model and three times better compared to the full-model. The Mallows C p -statistic did not found its applicability in our example because the same descriptors are used in all models. The smallest values of information criteria parameter were systemat-

p = TDIST(abs(t),n TS -1,1)
Evaluate if the mean of residual is statistically different by the expected value (0) n = sample size; v = number of independent variables in the model; y = the mean of observed/measured activity/property; ŷ = the mean of estimated/predicted activity/property; i ŷ = predicted value of the i th compound in the sample; y i = observed/measured activity/property of i th compound; res = mean of residuals; stdev = standard deviation; TR = training set; TS = test set; r 2 m = a metric calculated using observed (y-axis) and estimated/predicted (x-axis)values; r′ 2 m = a metric calculated using observed (x-axis) and estimated/predicted (y-axis)values; r 2 0 = determination coefficient calculating by forcing the origin of axis; Δr 2 m = absolute difference between r 2 m and r′ 2 m ; EXT = external set; abs = absolute value ically obtained by D i -model which was follow by h imodel while the full-model systematically obtained the highest values (see Table V).
The concordance correlation coefficient for training sets had values closed to the correlation coefficients and for all models were higher than 0.80 (see Table 5).
Looking to the weights of Akaike's information criteria, which can be interpreted as probability that a certain model is the best model, it could not be identify any model with robust inference (none of the model had the values of weights higher than 0.9 [61]). The D imodel had the weights around 0.37 that is far away from 0.90 but are a little higher than those obtained by the full model where the weights are around 0.30 or by those obtained by the h i -model which are around 0.32. Recall that the D i -model could be considered the preferred model and from the inspection of the Akaike weights in Table V, this model is 1.2 (w i − AIC R2 ) to 1.4 (w i − AIC c ) times more likely in terms of Kullback-Leible discrepancy, a measure of distance between the probability generated by the model and reality [62], compared with h i -model.
Significant differences between models could also been observed if the BIC and HQC parameters are analyzed; the smallest value of BIC was obtained by D imodel while the smallest value of HQC was obtained by h i -model. The plots of residuals versus predicted values for the investigated models are presented in Figure  2. The analyses of residuals allow to identify if the assumptions of the regression appear to have been met or not (specifically linearity and homoscedascity) -the residual plot look like a horizontal band. Thus, according   to the pattern of the residuals [63], the most appropriate model is the D i -model since the distribution indicates a homoscedastic model. Furthermore, both full-model and h i -model showed evidence of heteroscedascity, the error in estimating logRBA increasing as the value of logRBA increase. However, both these models could be accepted because none of them showed the presence of systematic errors or inadequacy [63]. If assumption of linearity and/or of homoscedascity is violated, the residual plots show an increasing and narrow pattern if systematic error exists or depict a Gaussian trend when the model is inadequate [64]. Other proposed plot methods, such as linear residual plots, show to be useful in identification of non-linearity while squared residual plots proved utility in detection of non-constant variances [65].
The normal probability plots (right graphical representations in Figure 2) can be used to verify normality assumption of the residuals. Figure 2 showed that the hi-model fit better a straight line compared to both fullmodel and D i -model.
The results obtained on our data associated to the statistical parameters useful in model diagnosis introduced in Table III are presented in Table VI. The total square error is the single parameter that has the same value for all models and in all cases is equal to 7 (obtained by adding 1 to the number of descriptors in the model 6 in our example), indicating that none of the models were not over-specified or did not contain bias due to incompletely specified model. The classification of our models based on parameters presented in Table  VI led to the classification obtained according to the parameters presented in Table V: D i -model, h i -model, and full model.
Several parameters were used to assess the predictive power of the models and their results are presented in Table VII. The analysis of results presented in Table VII revealed the followings: • External predictive ability parameter (Q 2 F 3 ) [53] systematically took negative values for both external and withdrawn sets. At least for the external set, this result could be explained by the distribution of logRBA values (min=-3.3, max=-0.6) compared to training (min=-4.5, max=2.6) and test (min=-2.51, max=1.41) sets. It could be also of interest to analyze how different are the compounds containing in external and withdrawn data sets compared to the compounds from training set (in terms of similarity of their structure for example).
• D i -model achieve the criterion of exceeding 0.6 [52] in just one of 6 possible case while the h imodel reach this criterion in four out of 6 cases.
The h i -model accomplished more frequently the criteria of having values higher than 0.6 while the full-model did not accomplished at all this criterion. Thus, it seems that the compounds in test and external sets are uniformly distributed over the range of training set at least in h i -model, in view of the fact that otherwise the Q 2 F 1 and the Q 2 F 2 suffer from drawbacks [66]. • The concordance correlation coefficients obtained values higher than 0.70 in test sets. The abilities of prediction the external sets proved smaller than 0.5 for all investigated models but had values higher than 0.50 (D i -model and h i -model) when the withdrawn set is investigated. m is a parameter computed by forcing the regression through origin [54] with certain applicability and limitations (fails to detect the differences between experimental and predicted values when the slopes of the regression line are not near to 1) [67]. The values of these metrics were smaller than the determination coefficient in all investigated models and the highest value was observed in D i -model when training (see Table V) set was investigated but acceptable values were obtained just by the h i -model when the test set was investigated (r 2 m > 0.5 and ∆r 2 m < 0.2). The classification of the models according to results presented Table VII is as follows: h i -model, D i -model, and full-model.
One remark about the parameters used to assess the predictive power, namely Q 2 F 1 , Q 2 F 2 and Q 2 F 3 , can be made. Even the symbols contain "square", these parameters could take both positive and negative values according to their formula (see Table IV). A simulation study of these parameters needs to be done to identify their possible values as well as their proper interpretation.
The best way to see the abilities of a MLR model is to plot the measured values against the estimated / predicted values to visualize how well each model works (see Figure 3). With one exception, represented by h i -model in external set (p-value = 0.0632), all other correlation coefficients proved statistically significant (p < 0.04).
The analysis of models presented in Figure 3 revealed the followings: • The distribution of compounds in training set is narrower in D i -model compared to both full-model and h i -model.
• D i -model obtained higher determination coefficients in training and external sets while the h imodel obtained the higher determination coefficients in training and withdrawn sets. • The h i -model in more stable compared to D i -model if the difference in determination between training and test set is concerned. • Both D i -model and h i -model performed better in training and test sets compared to full-model. Whenever applicable, the accuracy of a model will show its ability in correct classification of compounds. The overall accuracy as well as the accuracy on each class (weak binder, moderate binder and strong binder) were computed and the obtained results are presented in Figure 4.
The analysis of Figure 4 revealed the followings: • The accuracy of all three models was identical for strong binders in test set (75%) and weak binders in external set (25%). Overall, out of 16 possibilities, all models (full-model, D i -model, and h i -model) proved highest accuracy in almost 38% of cases. • Full-model proved highest overall accuracy in both test and external sets, and highest accuracy for moderate binders in test and external sets. • D i -model proved highest overall accuracy in training set, highest accuracy for strong binders in training set, highest accuracy for weak binders in training set, and highest accuracy of moderate binders in training set. • h i -model proved highest overall accuracy, as well as higher accuracy for weak binders, moderate binders and strong binders for withdrawn compounds. • No model proved abilities in correct classification of weak binders in test set or of strong binders in external set. Regarding the accuracy of investigated models it is impossible to classify them since their performances are generally the same (38%). It could be observed that models had abilities to accurately identify the compounds on average of two sets out of three or four. The absence of accurate classification of weak binders in test set and strong binders in externals set could be explained by differences in the chemical structure or measured logRBA of compounds included in these sets.

III. SUMMARY AND FURHER WORK
Choosing a proper linear model is crucial in QSAR analysis because a model able to predict accurately the activity of interest of new chemical compounds is desired under the hypothesis that changes in molecular structure directly reflect in the compound activity/property. Input data and data preparation for regression analysis are of great importance but these subjects were beyond the aim of the present manuscript.
Linear regression analyses identify in QSAR analysis the linearity between compound's activity and calculated descriptors based on chemical structure. Regression analysis answer to the following questions: Does the biological activity depend on structural information? If so, the nature of the relationship is linear? If yes, how good is the model in prediction of the biological activity of new compounds?
In this manuscript, some rules had been presented: 1 test the assumption of linear regression (normality, linearity, independence, homoscedascity, and/or collinearity); 2 construct the model(s) if assumptions are accomplished -analyze the data (choose the best performing model); 3 assess and diagnose the alternative models -analyze the MLR; 4 decide which model fit best to your objectives.
Following these steps in linear regression analysis certainly led to a performing estimation model but the prediction power of the model will always depend on the structure of compounds and their biological activity on which the model is used to predict; in other words, will be dependent by similarity in terms of structure and activity.
Researches on linear regression analysis are of general interest since MLR found its applicability in many research fields. The classical approach implemented in available dedicated software deal with maximization of correlation coefficient. Maximization of the observed probability under assumption of random error affecting all variables in the model is an ongoing research and will be reported somewhere else. It is known that the classical method is exposed to type I errors (to accept a regression model obtained by maximization of determination correlation even if it does not exist) while this new approach does not because it maximize just the observation chance having as hypothesis that the errors between observed value and value obtained by the model is random and depend just by the observed/measured value (therefore being symmetric relative to its arithmetic mean).