- On 16 Sep 2005 at 09:34:00, "Walt Woltosz" (walt.aaa.simulations-plus.com) sent the message

Back to the Top

The following message was posted to: PharmPK

Hans,

Good points about R^2.

There is another paper that should be of interest to this group:

Golbraikh & Tropsha, J Molecular Graphics and Modeling 20 (2002)

269-276.

Q^2 is another abused statistic, especially with respect to predictive

models in commercial software. We believe true validation only comes

from

properly selected external test sets and acceptable values for a

combination

of statistics, including RMSE, and the slope and intercept between

predicted

and observed.

I'm the first to admit I'm no statistician, so I would welcome

comments on

model validation statistics!

Walt Woltosz

Chairman & CEO

Simulations Plus, Inc. (AMEX: SLP)

1220 W. Avenue J

Lancaster, CA 93534-2902

U.S.A.

http://www.simulations-plus.com

Phone: (661) 723-7723

FAX: (661) 723-5524

E-mail: walt.-a-.simulations-plus.com - On 16 Sep 2005 at 14:26:10, Stanley110.aaa.aol.com sent the message

Back to the Top

>I'm the first to admit I'm no statistician, so I would welcome

>comments on model validation statistics!

Hi Walt,

Well, I am a statistician but I did not think this website is

appropriate for that kind of discussion.

Let me just comment that model building or curve fitting is just

that, building a model that fits experimental data sufficiently well

to make valid predictions in the future.

That means that models need not be perfect fits although perfect fits

are most satisfying. Approximate fits are useful provided we know the

uncertainty associated with prediction and the uncertainty we can

tolerate.

It is not how linear a system is; it is how far it is from linearity

that is important to understand its utility.

Very often the deviation from linearity is not sufficient to

disqualify the linear model if a better model cannot easily be found.

The deviation, which can be quantitated, may be acceptably small.

Literature is full of linear models that upon review are not truly

linear and the uncertainty is unknown to the experimenter. But the

misassignment has proven useful because the deviation was small and

tolerable.

Least-squares regression over a wide dynamic range of calibrators is

fraught with real problems. That is why there are variations on least-

squares regression like orthogonal regression (Deming), the Wald

line, and others.

I do model building for a living and I propose to everyone that

functional (predictive) models are very acceptable unless a

theoretical model is needed to develop underlying mechanisms. I once

gave a 2-day seminar on model building to estimate shelf life of

drugs that presented these ideas and procedures.

Regards,

Stan Alekman

Stanley L, Alekman Ph.D.

S.L. Alekman Associates Inc.

Pharmaceutical Consultants

Inverness, Illinois - On 16 Sep 2005 at 16:33:34, "Walt Woltosz" (walt.-a-.simulations-plus.com) sent the message

Back to the Top

The following message was posted to: PharmPK

Stan,

Thanks for your comments.

I think my post was somewhat misleading. I did not intend it to refer to

only linear models, or in fact to linear models at all. In our hands,

artificial neural network ensembles and support vector machine ensembles

have provided the best results for prediction of a wide variety of

properties from molecular structure (the kind of models I had in

mind, but

did not state). So in general, I was referring to nonlinear models.

"Nature

is not linear" is one of my favorite expressions, and I think it is

generally true for QSPR/QSAR models.

It is in this area that I think there is a great inconsistency in model

validation statistics. In particular, and IMHO, the use of R (instead of

R^2), the use of LOO q^2, the use of MAE instead of RMSE, and so on, all

serve to make attractive marketing and sales materials, but can hide the

real (lesser) value of the models to which they are applied.

I thought this web site was appropriate because many of the

subscribers to

this web site use structure-property prediction tools as well as pk/pd

modeling tools. If there is a better place for this discussion, please

advise.

Walt Woltosz

Chairman & CEO

Simulations Plus, Inc. (AMEX: SLP)

1220 W. Avenue J

Lancaster, CA 93534-2902

U.S.A.

http://www.simulations-plus.com

Phone: (661) 723-7723

FAX: (661) 723-5524

E-mail: walt.at.simulations-plus.com - On 19 Sep 2005 at 10:08:05, "Frederik B. Pruijn" (f.pruijn.aaa.auckland.ac.nz) sent the message

Back to the Top

The following message was posted to: PharmPK

Dear Walt,

I agree with you on external validation but don't you think that

internal cross-validation is better than no validation at all?

Kind regards,

Frederik Pruijn - On 18 Sep 2005 at 17:47:44, "Walt Woltosz" (walt.-a-.simulations-plus.com) sent the message

Back to the Top

The following message was posted to: PharmPK

Frederik,

>I agree with you on external validation but don't you think that

internal

>cross-validation is better than no validation at all?

One would think so - but apparently this is an illusion. To quote

from the

article by Golbraikh and Tropsha:

". . . As the first example, we consider a well-known group of

ligands of

corticosteroid binding globulin [16]. This dataset is frequently

referred to

as a benchmark [17] for the development and testing of novel QSAR

methods.

In [13], many 3D QSAR models have been built based on the divisions

of this

dataset into training and test sets and no relationship between high

q2 and

predictive R2 values was found. In this paper, we employ the k nearest

neighbors (kNN) QSAR variable selection method that was recently

developed

in this laboratory [18]. kNN QSAR uses 2D descriptors of chemical

structures

such as connectivity indices and atom pairs. We show that the

application of

this approach to the steroid dataset [16] leads to the same

observations as

using 3D QSAR: high q2 does not automatically imply a high predictive

power

of the model. We also consider 2D QSAR models built for two other

examples:

a set of 78 ecdysteroids [19] and 66 Histamine H1 receptor ligands

[20]. In

all these cases, we consider training and test sets as they were

defined in

the original publications. We demonstrate the lack of any relationship

between high q2 and predictive R2 in all cases. The lack of this

relationship appears to be the common feature of the QSAR methods

that must

be always taken into account in QSAR studies."

It is this article that triggered my initial post and question. We've

been

accepting leave-one-out (LOO) q2 as a measure of model quality for some

time, yet it appears that it is not at all a reliable indicator. So

if not,

then what?

Walt

Walt Woltosz

Chairman & CEO

Simulations Plus, Inc. (AMEX: SLP)

1220 W. Avenue J

Lancaster, CA 93534-2902

U.S.A.

http://www.simulations-plus.com

Phone: (661) 723-7723

FAX: (661) 723-5524

E-mail: walt.-a-.simulations-plus.com - On 19 Sep 2005 at 10:43:41, "Hans Proost" (j.h.proost.aaa.rug.nl) sent the message

Back to the Top

The following message was posted to: PharmPK

Dear Walt,

Thank you for pointing to the paper of Golbraikh and Tropsha. I will

certainly read the paper carefully. At first glance I saw that this

paper

mentioned RMSE as 'residual mean square error', whereas Sheiner and Beal

used RMSE as 'root mean squared error'.

Best regards,

Hans Proost

Johannes H. Proost

Dept. of Pharmacokinetics and Drug Delivery

University Centre for Pharmacy

Antonius Deusinglaan 1

9713 AV Groningen, The Netherlands

tel. 31-50 363 3292

fax 31-50 363 3247

Email: j.h.proost.-at-.rug.nl - On 20 Sep 2005 at 12:33:31, "Walt Woltosz" (walt.-at-.simulations-plus.com) sent the message

Back to the Top

The following message was posted to: PharmPK

Hans,

>At first glance I saw that this paper mentioned RMSE as 'residual mean

square error', whereas Sheiner and Beal used RMSE as 'root mean squared

error<

I've always used RMSE to refer to root mean square error.

After spending some time Googling both terms and not finding much

that is

definitive, perhaps one of our statisticians would like to explain the

difference between these two statistics, and when one or the other is

preferred.

Walt - On 20 Sep 2005 at 18:00:55, (Valeria.Chu.-a-.sanofi-aventis.com) sent the message

Back to the Top

The following message was posted to: PharmPK

I learned a lot from all these discussions but I'm still not clear

what to use to express goodness of fit for a predicted model and

observed data. A paper by Jones HM and Houston JB (Drug metab dispos

32; 973, 2004) mentioned using Akaike information criterion (AIC) to

check goodness of fit. Any comment?

Thanks,

Valeria - On 21 Sep 2005 at 10:48:11, (jeroen.elassaiss.-a-.organon.com) sent the message

Back to the Top

The following message was posted to: PharmPK

Valeria,

You wrote:

"I learned a lot from all these discussions but I'm still not clear

what to use to express goodness of fit for a predicted model and

observed data. A paper by Jones HM and Houston JB (Drug metab dispos

32; 973, 2004) mentioned using Akaike information criterion (AIC) to

check goodness of fit. Any comment?"

My $.02 would be not to rely on a single parameter but rather use

a range of statistical tools at hand for your particular problem.

See e.g. (for PK or PK/PD models) the chapter on Modeling strategies

in Gabrielsson & Weiner ('PK/PD Data analysis: concepts and

applications', 3rd ed. 2000, swedish pharm. press, Stockholm: Ch5).

One would have at least have to look at the distribution of

residuals, evaluate the weighted sum of squares balanced with model

complexity in some way and examine the covariance matrix.

Furthermore one should consider the assumptions generally

underlying statistical models such as (inter)independence of samples

and error-free estimation of independent variables.

So I am afraid there is no simple answer to your question. The AIC

is IMHO just one way of balancing WSS against model complexity*. The

Schwarz criterion is a similar parameter. Hierarchical tests may be

preferred in some cases, e.g. F-test. For all of these holds that

the actual degrees of freedom may or may not be easily defined. And

many more methods are available for these kind of evaluations.

How one evaluates a model (therefore) should be decided with the

goal of the model in mind, as has been stated in this thread

previously. In the end the performance of a model is determined by

how well it accomplishes its task....

Best regards,

Jeroen

Disclaimer: I am also not a statistician

(*) so the AIC assesses one out of three major dimensions to model

evaluation. Examples of the other two: predicted vs observed plots

and parameter correlation/error

J. Elassaiss-Schaap

Scientist PK/PD

Organon NV

PO Box 20, 5340 BH Oss, Netherlands

Phone: + 31 412 66 9320

Fax: + 31 412 66 2506

e-mail: jeroen.elassaiss.-at-.organon.com - On 21 Sep 2005 at 12:46:35, "Hans Proost" (j.h.proost.at.rug.nl) sent the message

Back to the Top

The following message was posted to: PharmPK

Dear Valeria,

I suggest to use RMSE (root mean squared error) as a measure of

goodness-of-fit. It is easy to calculate (the full name gives the

equation!)

and easy to interpret; it gives the size of the 'typical' deviation

between

predicted and true values. In many cases it may be more appropriate to

convert the 'error' (difference between predicted and true value) to a

relative value by dividing by the true true value, e.g. in cases

where the

values and differences are of a different order of magnitude. After

multiplying by 100 you get the RMSE as a percentage of the true value.

Please note that RMSE does not tell the whole story on 'goodness-of-

fit'. It

does not discriminate between random errors and bias (systematic

deviations); it includes both. Therefore it is usually combined with ME

(mean error) as a measure of bias. Again, ME is often expressed as a

relative value or percentage.

Akaike information criterion (AIC) is not really a measure of

goodness of

fit, but a criterion to discriminate between two models. More

specifically,

to check whether a more complex model is statistically justified

compared to

a more simple model.

Best regards,

Hans Proost

Johannes H. Proost

Dept. of Pharmacokinetics and Drug Delivery

University Centre for Pharmacy

Antonius Deusinglaan 1

9713 AV Groningen, The Netherlands

tel. 31-50 363 3292

fax 31-50 363 3247

Email: j.h.proost.-a-.rug.nl - On 21 Sep 2005 at 09:15:00, "Pereira, Luis" (Luis.Pereira.-at-.bos.mcphs.edu) sent the message

Back to the Top

The following message was posted to: PharmPK

Valeria

Since nobody picked it up, I'll tell you that information criteria,

such as the Akaike, or its derivations such as the Schwartz and the

Hanna-Quinn criteria, are not goodness-of-fit metrics. They rather

provide a way to compare models in terms of information by equating

explanatory capability with dimensionality. An AIC value estimated

for a given fit means nothing if not compared with another with a

different model. Then one may choose the model with the lowest AIC

even though it has a higher sum of squares, for example, since that

added goodness-of-fit is probably due to overparameterization.

Luis

--

Luis M. Pereira, Ph.D.

Assistant Professor, Biopharmaceutics and Pharmacokinetics

Massachusetts College of Pharmacy and Health Sciences

179 Longwood Ave, Boston, MA 02115

Phone: (617) 732-2905

Fax: (617) 732-2228

Luis.Pereira.at.bos.mcphs.edu - On 21 Sep 2005 at 10:14:44, "Wang, Yaning" (WangYA.at.cder.fda.gov) sent the message

Back to the Top

The following message was posted to: PharmPK

I believe RMSE means root mean square error. It is the square root of

mean

square error (MSE). MES is also called residual mean square. I think

'residual mean square error' is referring to MSE. The relationship of

these

two is like that of standard deviation and variance. Standard

deviation has

the same unit as the observed data. It can be compared with the data

directly to evaluate the deviation (e.g. when we calculate CV%). MSE

can be

used for likelihood ratio test.

Yaning Wang, Ph.D.

Pharmacometrician

Office of Clinical Pharmacology and Biopharmaceutics

Center of Drug Research and Evaluation

Food and Drug Administration

Office: 301-827-9763 - On 21 Sep 2005 at 12:59:57, "Bonate, Peter" (Peter.Bonate.-a-.genzyme.com) sent the message

Back to the Top

The following message was posted to: PharmPK

I've been reading this GOF discussion with some interest. Assessing

the GOF

of a model should never, ever be based on a metric like RMSE, AIC, or

R**2.

RMSE, MSE, AIC, and BIC are not of value in and of themselves because

they

are just a number. An AIC of 45 or an MSE of 600 is meaningless

unless you

have a point of reference, which is often a competing model. A high

R**2 in

and of itself does not say anything because you can have a very high

R**2 in

the face of obvious model misspecification.

Model assessment is a holistic approach based on many factors. I would

even say that this is a Gestalt process. In the end, the adequacy of a

model is quite subjective. One modeler may judge a model adequate but

another might not. Use of these statistics is an attempt to quantify a

nonquantitative measure.

Peter L. Bonate, PhD, FCP

Director, Pharmacokinetics

Genzyme Corporation

4545 Horizon Hill Blvd

San Antonio, TX 78229

phone: 210-949-8662

fax: 210-949-8219

email: peter.bonate.at.genzyme.com - On 21 Sep 2005 at 13:35:06, Prah.James.at.epamail.epa.gov sent the message

Back to the Top

The following message was posted to: PharmPK

There is an excellent and succint discussion of model comparison at this

site : http://www.duke.edu/~rnau/compare.htm

James D. Prah, PhD

US EPA

Human Studies Division MD (58B)

Research Triangle Park, NC, 27711

919 966 6244

919 966 6367 FAX - On 21 Sep 2005 at 14:24:00, RPop.-a-.pharmamedica.com sent the message

Back to the Top

Hi All,

Hans Proost wrote:

"I suggest to use RMSE (root mean squared error) as a measure of

goodness-of-fit. .....

Akaike information criterion (AIC) is not really a measure of

goodness of fit, but a criterion to discriminate between two models.

More

specifically, to check whether a more complex model is statistically

justified

compared to a more simple model."

I think this is exactly why one wants to check the "goodness of fit",

to discrminate between different models.

Including more variables in the model will most of the time decrease

RMSE.

However, the question is if the observed decrease is significant

(statistically) and for that RMSE by itself will not suffice.

The criterion should be a statistics that can indicate a statically

signifcant change.

I hope I did not create more confusion!

radu - On 21 Sep 2005 at 14:23:14, Stanley110.-at-.aol.com sent the message

Back to the Top

Would Dr. Bonate care to comment on the adequacy of the Lack of Fit

test which I proposed earlier as a statistical measure for goodness

of fit of a regression model? He did not comment on this test. This

test is not a Gestalt process or subjective. The Lack of Fit test

compares the variability of the residuals a fitted model with the

variability between observations at replicate values of the

independent variable. The test assumes that the observations, Y for

given X, are independent and normally distributed, and that the

distributions of Y have the save variance. Statistically significant

Lack of Fit indicates that the fitted model does not adequately fit

the data. The test requires repeated observations at X.

One further comment. The adequacy of a fitted model depends on the

use of the model. Statistically inadequate models may be very useful.

Regards,

Stan Alekman

S.L. Alekman Associates Inc.

Pharmaceutical Consultants

Inverness, Illinois - On 21 Sep 2005 at 15:26:13, "Bonate, Peter" (Peter.Bonate.-at-.genzyme.com) sent the message

Back to the Top

Dr Alekman asked me to comment on the Lack of fit test. There is not

a lot of experience in the pk literature using the lack of fit test

simply because it does require replicates. Therefore, I think this

test has limited utility to a pharmacokineticist and I really can't

think of a case where I have ever seen it used outside of a textbook.

Peter L. Bonate, PhD, FCP

Director, Pharmacokinetics

Genzyme Corporation

4545 Horizon Hill Blvd

San Antonio, TX 78229

phone: 210-949-8662

email: peter.bonate.-a-.genzyme.com - On 21 Sep 2005 at 12:36:53, "Walt Woltosz" (walt.-at-.simulations-plus.com) sent the message

Back to the Top

The following message was posted to: PharmPK

Valeria,

As already noted by others, the Akaike Information Criterion is one

statistic that can be used to compare models, but it is not an

indication of

the quality of the models. It is often used to compare, for example,

one-,

two- and three-compartment PK models so see which one might be

favored for a

particular set of data. So the intended purpose is to indicate

whether an

improvement in the error function by adding more parameters could be

considered statistically significant.

AIC = (#Pts) * Log(Obj) + 2(#Parameters)

A lower value for AIC is said to indicate a better model.

As you can see, an increase in the log of the objective function

(weighted

error function) increases AIC, as does the number of fitted

parameters in

the model. So if you add more parameters, such as going from a

one-compartment model to a two-compartment model, and the objective

function

goes down, but not enough to offset the 2(#Parameters) term, then AIC

will

increase.

Note that comparing AIC can only be done when comparing objective

function

values that are calculated in the same way (i.e., same observed data

points,

and same objective function weighting).

The Schwartz Criterion, also mentioned earlier, is a similar function

(as

with the AIC, lower is better):

Schwarz Criterion (SC) = (#Pts) * Log(Obj) +

(#Parameters)*(Log(#Pts))

Slope and intercept of predicted vs observed and observed vs

predicted plots

are also very useful, as well as an examination of residual plots to

see if

there is any systematic error (indicated by a sinusoidal pattern of

residual

error above and below zero at the observed data points, rather than a

more

uniform scattering of + and - errors).

Walt Woltosz

Chairman & CEO

Simulations Plus, Inc. (AMEX: SLP)

1220 W. Avenue J

Lancaster, CA 93534-2902

U.S.A.

http://www.simulations-plus.com

Phone: (661) 723-7723

FAX: (661) 723-5524

E-mail: walt.aaa.simulations-plus.com - On 21 Sep 2005 at 15:57:26, "Clerk Maxwell" (clerkmaxwell.-a-.hotmail.com) sent the message

Back to the Top

The following message was posted to: PharmPK

Here's some good fun and pithy information on modeling: From, A.

Bloch "Murphy's Law: Book Three," Price/Stern/Sloan Publishers Inc.,

Los Angeles, 1982.

"Golomb's Don'ts of Mathematical Modeling"

(1) Don't believe the 33rd order consequences of a 1 st order

model catch phrase cum grano salis; (2) don't extrapolate beyond

the region of fit catch phrase don't go off the deep end; (3)

don't apply any model until you understand the simplifying

assumptions on which it is based, and can test their applicability

catch phrase use only as directed; (4) don't believe that the model

is the reality catch phrase don't eat the menu; (5) don't distort

reality to fit the model catch

phrase the 'Procrustes Method'; (6) don't limit yourself to a

single model: More than one may be useful for understanding different

aspects of the same phenomenon catch phrase legalize polygamy;

(7) don't retain a discredited model catch phrase don't beat a

dead horse; (8) don't fall in love with your model catch phrase

pygmalion; (9) don't apply the terminology of subject A to the

problems of subject B if it is to the enrichment of neither catch

phrase new names for old. - On 21 Sep 2005 at 17:00:54, neeraj-gupta.-at-.uiowa.edu sent the message

Back to the Top

The following message was posted to: PharmPK

Dear Walt;

You said ".....So the intended purpose (of AIC) is to indicate

whether an

improvement in the error function by adding more parameters could be

considered statistically significant.

AIC = (#Pts) * Log(Obj) + 2(#Parameters)...

You are right that lower the value of AIC, better the model but you

cannot say

anything about the statistical signicance because AIC values don't

belong to

any distribution. So the difference of 1 unit or 100 units between

the AIC

values for two models hold the same ground (likely disadvantage of

using AIC?)

Your comments are welcome.

Neeraj Gupta

PhD Student

University of Iowa

Iowa city,IA 52246

Email: neeraj-gupta.at.uiowa.edu - On 21 Sep 2005 at 16:18:14, "Walt Woltosz" (walt.at.simulations-plus.com) sent the message

Back to the Top

The following message was posted to: PharmPK

Dear Neeraj,

My understanding (I reiterate that I am not a statistician) is that the

Akaike Information Criterion is used to help in deciding when additional

model parameters are justified. In most situations, additional

parameters

will reduce the error function. But at what point does the reduction in

error become overfitting? When I mentioned statistical significant

(perhaps

I'm abusing the term) I was referring not to the significance of the

models,

but the significance in the reduction of the error function produced

by the

additional model parameters.

The AIC (again, as I understand it) is an approximate indicator of

when the

additional parameters are justified. When you add model parameters,

if the

log(Obj) term does not decrease more than 2(#Parameters), then the AIC

increases, and the model would be suspect. It is always a judgment

call, but

we typically will not use a higher order model unless there is a

significant

decrease in AIC.

Often AIC is a negative number. Suppose it is -15.78364 for a

one-compartment model. And suppose it is -16.78551 for a two-compartment

model (which should indicate a better model). I would not go to the

two-compartment model unless I also saw convincing information from

slope

and intercept of predicted vs observed plots and a change from

systematic

error to random error in the residuals.

Of course, both models could be poor - AIC tells us nothing about the

quality of the models.

Walt Woltosz

Chairman & CEO

Simulations Plus, Inc. (AMEX: SLP)

1220 W. Avenue J

Lancaster, CA 93534-2902

U.S.A.

http://www.simulations-plus.com

Phone: (661) 723-7723

FAX: (661) 723-5524

E-mail: walt.-at-.simulations-plus.com - On 23 Sep 2005 at 12:09:04, "Hans Proost" (j.h.proost.-at-.rug.nl) sent the message

Back to the Top

The following message was posted to: PharmPK

Dear colleagues,

From the excellent comments by others I conclude that the main

problem is in

the meaning of 'goodness-of-fit'. This term is used for different

purposes,

e.g.

#1: for comparing models, to select the 'best' model, e.g by

comparing AIC.

#2: as a measure of 'how good the model fits', e.g. how large is the

model

misspecification. I don't know of a good test for this in case of single

measurements (for replicate the lack-of-fit test can be used, as

stated by

Stan Alekman and Peter Bonate), but visual inspection of residuals

plots are

definitely very useful an appropriate (although not objective).

#3: as a measure of 'how good is the prediction of the model'; for this

purpose I recommended RMSE (I am sorry that this was not really clear

in my

earlier message). Please note that it should not be used for

selecting the

'best' model (#1), since it does not include a penalty for

overparametrization, and does not give information about model

misspecification (#2). One of the attractive features of RMSE is that

its

value has a meaning that can be understood by itself, depending on the

purpose. E.g. an RMSE of 15% is usually acceptable for a prediction

of PK

parameters in an individual, but probably not for an analytical assay.

Does anybody have authorative references that give appropriate

definitions

for these three situations, so we can end the confusion (and

discussion)?

Best regards,

Hans Proost

Johannes H. Proost

Dept. of Pharmacokinetics and Drug Delivery

University Centre for Pharmacy

Antonius Deusinglaan 1

9713 AV Groningen, The Netherlands

tel. 31-50 363 3292

fax 31-50 363 3247

Email: j.h.proost.aaa.rug.nl

[Hans, RMSE as a percent? Is RMSE = sqrt(sum(o-c)*sum(o-c)/n), where

o is observed data point, c is calculated data point and n is the

number of data points? Where does the percent denominator appear? I

ask because I was going to add RMSE to Boomer over the weekend. In

the past I've been more concerned with #1 and #2. My major objective

has been to get the best fit with the best model, naively leaving the

usefulness of the model to someone else. ;-) - db] - On 23 Sep 2005 at 16:48:42, "Pereira, Luis" (Luis.Pereira.-at-.bos.mcphs.edu) sent the message

Back to the Top

The following message was posted to: PharmPK

Dear All

The RMSE is a kind of generalized standard deviation. It's the spread

left over when we have accounted for a given relationship in the

data, meaning when we have fitted a model to the data. Hence its

other name, residual variation. Unfortunately (IMHO) people do say

"the RMSE was 10%" (of the mean of the observed observations). But

this statistic is called coefficient of variation or relative

standard deviation, and that's a convenient way to talk across

different data sets and models. It puts the standard deviation (or

estimate of the random variation or noise in the model) in the

context of the magnitude of the mean. I try not to use RMSE in this

way though. However, it may be argued that it is an indicator of the

reliability (extent to which the measurement is inherently

reproducible, or the degree to which the measurement is influenced by

measurement errors). Then what we say is that the standard deviation

of the variation not accounted for by the model is ten percent of the

average value.

But model building, as I was once told, it's so much of an art as it

is a science. I like to say it's like a jigsaw puzzle. So I look

certainly at the objective function being minimized (e.g. sum of

squares, weighted or not, regular or orthogonal), R^2, RMSE and so

forth. I do compare models before choosing one for the same data set

using information criteria. But I also pay great deal of attention to

the variance-covariance and correlation matrices, to the standard

deviations (or confidence limits) for the parameter estimates, to the

studentized residuals, their serial correlation, skewness and

kurtosis, to the optimization algorithm, and most of all to the

plots. The goodness-of-fit that matters must be the full picture that

all these strokes make at the end.

--

Luis M. Pereira, Ph.D.

Assistant Professor, Biopharmaceutics and Pharmacokinetics

Massachusetts College of Pharmacy and Health Sciences

179 Longwood Ave, Boston, MA 02115

Phone: (617) 732-2905

Fax: (617) 732-2228

Luis.Pereira.at.bos.mcphs.edu - On 24 Sep 2005 at 15:06:25, "Frederik B. Pruijn" (f.pruijn.aaa.auckland.ac.nz) sent the message

Back to the Top

Dear Hans,

for related models (e.g., 1st vs 2nd order polynomial) you could use

the F-test to see whether any improvement was worth the sacrifice of

degree of freedom. More often than not the simplest model is the 'best'.

To visually inspect the residuals is "not objective" is an

interesting statement. One could and should test for normality,

constant variance, and other possible test include a runs test and

tests like Durbin-Watson. How one interprets these tests, and

prediction vs measured, is another matter.

FYI, I am not a statistician either.

Frederik Pruijn - On 24 Sep 2005 at 12:50:25, Stanley110.-a-.aol.com sent the message

Back to the Top

f.pruijn.aaa.auckland.ac.nz writes:

>for related models (e.g., 1st vs 2nd order polynomial) you could use

>the F-test to see whether any improvement was worth the sacrifice of

>degree of freedom. More often than not the simplest model is the

'best'.

To visually inspect the residuals is "not objective" is an

interesting statement. One could and should test for normality,

constant variance, and other possible test include a runs test and

tests like Durbin-Watson. How one interprets these tests, and

prediction vs measured, is another matter.

These comments are very good guidance. Inspection of residuals is

very important. It alone is sufficient to reject a model when other

indicators suggest a fitted model is adequate.

For those interested in reading about this subject, see R.B.

D'Agostino and M.A. Stephens, "Goodness of Fit Techniques",

statistics textbooks and monographs vol 68, Marcel Dekker, 1986. This

text is written for statisticians and is not based on the

experimental work that is performed by participants in this

discussion group. But it can be helpful for those willing to work

through it.

Fitting models can fall in two categories: models for purpose of

prediction and models for purpose of theory. Developing theoretical

models where variables and coefficients describe underlying physical

mechanisms is research. The former are the modeling studies I perform.

Peter Bonate informed me that experimental work done by this

discussion group does not collect multiple readings at each level of

the independent variable. The Lack of Fit test that I proposed in

several postings requires multiple readings. That is unfortunate

because Lack of Fit testing is an appropriate tool for goodness of

fit, independent of the type of fitted model, linear, exponential, etc.

Regards,

Stan Alekman

Stanley L. Alekman PhD

S.L. Alekman Associates Inc.

Pharmaceutical Consultants

Inverness, Illinois - On 26 Sep 2005 at 13:12:30, "Hans Proost" (j.h.proost.-at-.rug.nl) sent the message

Back to the Top

The following message was posted to: PharmPK

Dear David,

Thanks for your comments and questions:

> [Hans, RMSE as a percent? Is RMSE = sqrt(sum(o-c)*sum(o-c)/n), where

> o is observed data point, c is calculated data point and n is the

> number of data points? Where does the percent denominator appear?

I am sorry that I did not express myself very clear with respect to the

conversion to percentage. Actually one can calculate RMSE as such,

i.e. the

equation you mentioned:

RMSE = sqrt[sum((c-o)*(c-o))/n]

So, RMSE has the same unit as c and o. In case of order of magnitudes

differences between the values o, it is likely that the order of

magnitidue

of the differences c-o will have also a differnent order of

magnitude, and a

relative value (i.e. (c-o)/o)would be more appropriate for

evaluation; for

convenience a conversion to percentage can be done, resulting in

RMSE% = 100*sqrt[sum((c-o)*(c-o)/(o*o))/n]

Please note that I used here o in the denominator instead of c. It

would be

illogical to use c in the denominator, since c is a dependent variable.

Both RMSE and RMSE% can be used for any data set, but one is more

meaningful

than the other in case of different order of magnitude of (c-o) and

o. There

is a close resemblance to the problem of weighting, i.e. using RMSE

in case

of no weighting and RMSE% in case of proportional weighting. For other

weighting schemes an analogous concept of RMSE does not seem to be

appropriate.

> In the past I've been more concerned with #1 and #2. My major

> objective has been to get the best fit with the best model, naively

> leaving the usefulness of the model to someone else. ;-) - db]

I agree! RMSE is a measure of prediction error, and 'predicting' data

after

fitting a model to the same data is not a real prediction! As I said

in my

earlier message, RMSE can be used as a measure of 'how good is the fit'.

On reflection, however, I think that we should not use RMSE to this

purpose,

and leave RMSE for 'true' predictions and related situations, e.g.

comparison of fitted parameters (c) to the true parameter values (o)

in case

of analysis of data generated by Monte Carlo simulation.

To compensate for the number of parameters in a model, it would be

better to

replace n in the denominator of RMSE and RMSE% by (n-p) where p is the

number of parameters. This gives a penalty on the number of

parameters, and

then 'RMSE' becomes the 'residual standard deviation' and 'RMSE%' the

'residual coefficient of variation', as was pointed to by Luis Pereira.

Again, the choice between actual values and percentage should be

related to

the applied weighting scheme. From Monte Carlo simulations I learned

that

these values (using (n-p)) are virtually identical to the added data

noise.

But I agree with Luis Pereira that we should not use the term RMSE

for these

quantities.

Best regards,

Hans Proost

Johannes H. Proost

Dept. of Pharmacokinetics and Drug Delivery

University Centre for Pharmacy

Antonius Deusinglaan 1

9713 AV Groningen, The Netherlands

tel. 31-50 363 3292

fax 31-50 363 3247

Email: j.h.proost.-at-.rug.nl

[Hans, thank for the subtle correction of my equation ;-) I had some

missing parentheses. Also, I apologize, you had mention RMSE% in an

earlier post that I had forgotten. I think I like the division by o

(not c). One might argue that 'c' is the true value but 'o' is the

real number (in your terms independent number). Must remember to

avoid dividing by zero. I'm not sure about your argument regarding

RMSE and RMSE% versus weighting scheme. I would view both as a

measure of goodness of fit, just scaled differently -- although I

have little (/no) experience looking at these numbers. I'll leave the

'n' versus 'n-p' to someone else - db] - On 10 Oct 2005 at 18:54:03, Roger Jelliffe (jelliffe.at.usc.edu) sent the message

Back to the Top

Dear All:

In all of this back and forth about goodness of fit, why not

consider the likelihood of the results given the data? Most of the

criteria discussed so far seem to assume that the parameter

distributions are Gaussian or lognormal in their shape, and often

they are not. Most methods for population PK/PD analysis, such as

NONMEM, compute the likelihood using either the first order (FO) or

the first order, conditional expectation (FOCE) approximations. These

approaches do not have the desirable property of statistical

consistency, and so there is no guarantee that the more patients you

study, the closer the parameter estimates get to the true values.

Interesting!!

On the other hand, there are now a number of other methods

that compute the likelihood accurately or exactly. There is a good

French method, Dave D'Argenio has one, and Bob Leary has PEM, a

parametric EM method that uses a Faure low discrepancy integration

method. In addition, the nonparametric approaches such as NPEM, NPAG,

and NPOD, from USC, also compute the likelihood exactly. All these

methods are consistent. All have visibly greater precision of

parameter estimates that NONMEM and similar methods. The NP methods

do not make any assumptions at all about the shape of the parameter

distributions. They also lead to maximum precision in dosage regimens

using the method of multiple model dosage design. There is more

material on our web site www.lapk.org, under teaching topics, for

example. I would like very much to see comments on this.

Very best regards,

Roger Jelliffe

Roger W. Jelliffe, M.D. Professor of Medicine,

Division of Geriatric Medicine,

Laboratory of Applied Pharmacokinetics,

USC Keck School of Medicine

2250 Alcazar St, Los Angeles CA 90033, USA

Phone (323)442-1300, fax (323)442-1302, email= jelliffe.-at-.usc.edu

Our web site= http://www.lapk.org - On 10 Oct 2005 at 20:05:24, "Walt Woltosz" (walt.aaa.simulations-plus.com) sent the message

Back to the Top

The following message was posted to: PharmPK

Roger,

The original context of this thread was for structure-property

relationships, and specifically concerning such statistics as q^2, R^2,

RMSE, slop & intercept, etc. for such correlations. These are

single-measurement data per structure.

Do the methods you cited apply for this kind of data?

Best regards,

Walt

Walt Woltosz

Chairman & CEO

Simulations Plus, Inc. (AMEX: SLP)

1220 W. Avenue J

Lancaster, CA 93534-2902

U.S.A.

http://www.simulations-plus.com

Phone: (661) 723-7723

FAX: (661) 723-5524 - On 11 Oct 2005 at 00:01:40, Stanley110.aaa.aol.com sent the message

Back to the Top

Dr. Jelliffe,

Is it convenient for you to post literature references for these

procedures?

Thank you.

Stan Alekman - On 11 Oct 2005 at 18:00:03, Stephen Duffull (steveduffull.at.yahoo.com.au) sent the message

Back to the Top

Roger

Thanks for bringing up this subject. I have often wondered about

appropriate statistics for comparisons of nonparametric likelihoods

for dichotomous model selection decisions.

It can be shown that the difference of two log(parametric

likelihoods) [LL] of the same parametric distribution for nested

models are asymptotically chi-squared distributed. If the

likelihoods are approximated, e.g. by numerical integration or by

linearisation, then you might say that the difference of two LL for

nested models are approximately and asymptotically chi-squared

distributed. [Note I have left out "-2*" the LL here for simplicity.]

However can you say this for nonparametric LL's? Nonparametric LL's

do not conform to a particular parametric family of distributions and

therefore I am not sure whether the above assumption of asymptotic

chi-square distribution necessarily applies when you are trying to

make model comparisons. I would be interested on any views on this.

If it is not possible to use the likelihood ratio test then how do

you statistically compare two models from knowledge of their

nonparametric maximum likelihood value?

Regards

Steve

Steve Duffull

http://www.uq.edu.au/pharmacy/index.html?page=31309 - On 20 Oct 2005 at 18:43:35, Roger Jelliffe (jelliffe.at.usc.edu) sent the message

Back to the Top

Dear Steve:

Thanks for your note. You are correct about the likelihoods

for parameters having Gaussian of lognormal distributions, and you

can do the significance tests you described. You are also correct

that the distribution of the likelihoods for other distributions is

not known, and that because of this, you cannot do the same kind of

significance testing that way.

What I wanted to say mainly was that it is important to use

likelihood as an index of fitting data, and that methods that compute

it with approximations, such as most of the currently available pop

modeling methods (NONMEM FO and FOCE, our IT2B, etc,) do not have the

guarantee of statistical consistency, that studying more subjects

will give results that more closely approach the true results. That

is a real scientific problem, and it is interesting that most people

seem not to be concerned by it.

On the other hand, methods that compute the likelihood

accurately or exactly (our NPEM, NPAG, NPOD, PEM, a good French

parametric model program, and others) do the job much better, and are

consistent, and also more precise in their parameter estimates. There

was a conference in Lyon last fall where this was shown in a blind

competition of several different methods, and also at the PAGE

meeting last June in Pamplona by Pascal Girard and his group.

I mainly wanted to make the point that comparing various

indices of so-called goodness of fit, when the methods giving it are

NOT statistically consistent and precise, can be misleading, and that

it is important to use methods that are known to be statistically

consistent, with precise parameter estimates. If this is not the

case, the rest way well be moot.

All the best,

Roger Jelliffe

Roger W. Jelliffe, M.D. Professor of Medicine,

Division of Geriatric Medicine,

Laboratory of Applied Pharmacokinetics,

USC Keck School of Medicine

2250 Alcazar St, Los Angeles CA 90033, USA

Phone (323)442-1300, fax (323)442-1302, email= jelliffe.-a-.usc.edu

Our web site= http://www.lapk.org - On 21 Oct 2005 at 17:58:41, Stephen Duffull (steveduffull.-a-.yahoo.com.au) sent the message

Back to the Top

Roger

Thanks for your email. I think you raise an important point about

approximations to the likelihood that occur for some parametric

methods and that this may influence inference based on these

approximations.

Without wanting to prolong the debate on various tools for population

modelling - I would like to come back to my question:

How do you discriminate between competing models that are developed

based on a nonparametric likelihood? What test statistic do you use?

Regards

Steve

===========Steve Duffull

http://www.uq.edu.au/pharmacy/index.html?page=31309 - On 25 Oct 2005 at 17:21:18, Roger Jelliffe (jelliffe.-at-.usc.edu) sent the message

Back to the Top

Dear Stan:

About references concerning our population modeling

approaches - you can go to our web site www.lapk.org, and click

around under new developments in pop modeling, and teaching topics.

We also now have a paper in prress in Clinical Pharmacokinetics.

Roger W. Jelliffe, M.D. Professor of Medicine,

Division of Geriatric Medicine,

Laboratory of Applied Pharmacokinetics,

USC Keck School of Medicine

2250 Alcazar St, Los Angeles CA 90033, USA

Phone (323)442-1300, fax (323)442-1302, email= jelliffe.at.usc.edu

Our web site= http://www.lapk.org

Want to post a follow-up message on this topic? If this link does not work with your browser send a follow-up message to PharmPK@boomer.org with "Goodness of fit statistics" as the subject

PharmPK Discussion List Archive Index page

Copyright 1995-2010 David W. A. Bourne (david@boomer.org)