Back to the Top
The following message was posted to: PharmPK
Hans,
Good points about R^2.
There is another paper that should be of interest to this group:
Golbraikh & Tropsha, J Molecular Graphics and Modeling 20 (2002)
269-276.
Q^2 is another abused statistic, especially with respect to predictive
models in commercial software. We believe true validation only comes
from
properly selected external test sets and acceptable values for a
combination
of statistics, including RMSE, and the slope and intercept between
predicted
and observed.
I'm the first to admit I'm no statistician, so I would welcome
comments on
model validation statistics!
Walt Woltosz
Chairman & CEO
Simulations Plus, Inc. (AMEX: SLP)
1220 W. Avenue J
Lancaster, CA 93534-2902
U.S.A.
http://www.simulations-plus.com
Phone: (661) 723-7723
FAX: (661) 723-5524
E-mail: walt.-a-.simulations-plus.com
Back to the Top
>I'm the first to admit I'm no statistician, so I would welcome
>comments on model validation statistics!
Hi Walt,
Well, I am a statistician but I did not think this website is
appropriate for that kind of discussion.
Let me just comment that model building or curve fitting is just
that, building a model that fits experimental data sufficiently well
to make valid predictions in the future.
That means that models need not be perfect fits although perfect fits
are most satisfying. Approximate fits are useful provided we know the
uncertainty associated with prediction and the uncertainty we can
tolerate.
It is not how linear a system is; it is how far it is from linearity
that is important to understand its utility.
Very often the deviation from linearity is not sufficient to
disqualify the linear model if a better model cannot easily be found.
The deviation, which can be quantitated, may be acceptably small.
Literature is full of linear models that upon review are not truly
linear and the uncertainty is unknown to the experimenter. But the
misassignment has proven useful because the deviation was small and
tolerable.
Least-squares regression over a wide dynamic range of calibrators is
fraught with real problems. That is why there are variations on least-
squares regression like orthogonal regression (Deming), the Wald
line, and others.
I do model building for a living and I propose to everyone that
functional (predictive) models are very acceptable unless a
theoretical model is needed to develop underlying mechanisms. I once
gave a 2-day seminar on model building to estimate shelf life of
drugs that presented these ideas and procedures.
Regards,
Stan Alekman
Stanley L, Alekman Ph.D.
S.L. Alekman Associates Inc.
Pharmaceutical Consultants
Inverness, Illinois
Back to the Top
The following message was posted to: PharmPK
Stan,
Thanks for your comments.
I think my post was somewhat misleading. I did not intend it to refer to
only linear models, or in fact to linear models at all. In our hands,
artificial neural network ensembles and support vector machine ensembles
have provided the best results for prediction of a wide variety of
properties from molecular structure (the kind of models I had in
mind, but
did not state). So in general, I was referring to nonlinear models.
"Nature
is not linear" is one of my favorite expressions, and I think it is
generally true for QSPR/QSAR models.
It is in this area that I think there is a great inconsistency in model
validation statistics. In particular, and IMHO, the use of R (instead of
R^2), the use of LOO q^2, the use of MAE instead of RMSE, and so on, all
serve to make attractive marketing and sales materials, but can hide the
real (lesser) value of the models to which they are applied.
I thought this web site was appropriate because many of the
subscribers to
this web site use structure-property prediction tools as well as pk/pd
modeling tools. If there is a better place for this discussion, please
advise.
Walt Woltosz
Chairman & CEO
Simulations Plus, Inc. (AMEX: SLP)
1220 W. Avenue J
Lancaster, CA 93534-2902
U.S.A.
http://www.simulations-plus.com
Phone: (661) 723-7723
FAX: (661) 723-5524
E-mail: walt.at.simulations-plus.com
Back to the Top
The following message was posted to: PharmPK
Dear Walt,
I agree with you on external validation but don't you think that
internal cross-validation is better than no validation at all?
Kind regards,
Frederik Pruijn
Back to the Top
The following message was posted to: PharmPK
Frederik,
>I agree with you on external validation but don't you think that
internal
>cross-validation is better than no validation at all?
One would think so - but apparently this is an illusion. To quote
from the
article by Golbraikh and Tropsha:
". . . As the first example, we consider a well-known group of
ligands of
corticosteroid binding globulin [16]. This dataset is frequently
referred to
as a benchmark [17] for the development and testing of novel QSAR
methods.
In [13], many 3D QSAR models have been built based on the divisions
of this
dataset into training and test sets and no relationship between high
q2 and
predictive R2 values was found. In this paper, we employ the k nearest
neighbors (kNN) QSAR variable selection method that was recently
developed
in this laboratory [18]. kNN QSAR uses 2D descriptors of chemical
structures
such as connectivity indices and atom pairs. We show that the
application of
this approach to the steroid dataset [16] leads to the same
observations as
using 3D QSAR: high q2 does not automatically imply a high predictive
power
of the model. We also consider 2D QSAR models built for two other
examples:
a set of 78 ecdysteroids [19] and 66 Histamine H1 receptor ligands
[20]. In
all these cases, we consider training and test sets as they were
defined in
the original publications. We demonstrate the lack of any relationship
between high q2 and predictive R2 in all cases. The lack of this
relationship appears to be the common feature of the QSAR methods
that must
be always taken into account in QSAR studies."
It is this article that triggered my initial post and question. We've
been
accepting leave-one-out (LOO) q2 as a measure of model quality for some
time, yet it appears that it is not at all a reliable indicator. So
if not,
then what?
Walt
Walt Woltosz
Chairman & CEO
Simulations Plus, Inc. (AMEX: SLP)
1220 W. Avenue J
Lancaster, CA 93534-2902
U.S.A.
http://www.simulations-plus.com
Phone: (661) 723-7723
FAX: (661) 723-5524
E-mail: walt.-a-.simulations-plus.com
Back to the Top
The following message was posted to: PharmPK
Dear Walt,
Thank you for pointing to the paper of Golbraikh and Tropsha. I will
certainly read the paper carefully. At first glance I saw that this
paper
mentioned RMSE as 'residual mean square error', whereas Sheiner and Beal
used RMSE as 'root mean squared error'.
Best regards,
Hans Proost
Johannes H. Proost
Dept. of Pharmacokinetics and Drug Delivery
University Centre for Pharmacy
Antonius Deusinglaan 1
9713 AV Groningen, The Netherlands
tel. 31-50 363 3292
fax 31-50 363 3247
Email: j.h.proost.-at-.rug.nl
Back to the Top
The following message was posted to: PharmPK
Hans,
>At first glance I saw that this paper mentioned RMSE as 'residual mean
square error', whereas Sheiner and Beal used RMSE as 'root mean squared
error<
I've always used RMSE to refer to root mean square error.
After spending some time Googling both terms and not finding much
that is
definitive, perhaps one of our statisticians would like to explain the
difference between these two statistics, and when one or the other is
preferred.
Walt
Back to the Top
The following message was posted to: PharmPK
I learned a lot from all these discussions but I'm still not clear
what to use to express goodness of fit for a predicted model and
observed data. A paper by Jones HM and Houston JB (Drug metab dispos
32; 973, 2004) mentioned using Akaike information criterion (AIC) to
check goodness of fit. Any comment?
Thanks,
Valeria
Back to the Top
The following message was posted to: PharmPK
Valeria,
You wrote:
"I learned a lot from all these discussions but I'm still not clear
what to use to express goodness of fit for a predicted model and
observed data. A paper by Jones HM and Houston JB (Drug metab dispos
32; 973, 2004) mentioned using Akaike information criterion (AIC) to
check goodness of fit. Any comment?"
My $.02 would be not to rely on a single parameter but rather use
a range of statistical tools at hand for your particular problem.
See e.g. (for PK or PK/PD models) the chapter on Modeling strategies
in Gabrielsson & Weiner ('PK/PD Data analysis: concepts and
applications', 3rd ed. 2000, swedish pharm. press, Stockholm: Ch5).
One would have at least have to look at the distribution of
residuals, evaluate the weighted sum of squares balanced with model
complexity in some way and examine the covariance matrix.
Furthermore one should consider the assumptions generally
underlying statistical models such as (inter)independence of samples
and error-free estimation of independent variables.
So I am afraid there is no simple answer to your question. The AIC
is IMHO just one way of balancing WSS against model complexity*. The
Schwarz criterion is a similar parameter. Hierarchical tests may be
preferred in some cases, e.g. F-test. For all of these holds that
the actual degrees of freedom may or may not be easily defined. And
many more methods are available for these kind of evaluations.
How one evaluates a model (therefore) should be decided with the
goal of the model in mind, as has been stated in this thread
previously. In the end the performance of a model is determined by
how well it accomplishes its task....
Best regards,
Jeroen
Disclaimer: I am also not a statistician
(*) so the AIC assesses one out of three major dimensions to model
evaluation. Examples of the other two: predicted vs observed plots
and parameter correlation/error
J. Elassaiss-Schaap
Scientist PK/PD
Organon NV
PO Box 20, 5340 BH Oss, Netherlands
Phone: + 31 412 66 9320
Fax: + 31 412 66 2506
e-mail: jeroen.elassaiss.-at-.organon.com
Back to the Top
The following message was posted to: PharmPK
Dear Valeria,
I suggest to use RMSE (root mean squared error) as a measure of
goodness-of-fit. It is easy to calculate (the full name gives the
equation!)
and easy to interpret; it gives the size of the 'typical' deviation
between
predicted and true values. In many cases it may be more appropriate to
convert the 'error' (difference between predicted and true value) to a
relative value by dividing by the true true value, e.g. in cases
where the
values and differences are of a different order of magnitude. After
multiplying by 100 you get the RMSE as a percentage of the true value.
Please note that RMSE does not tell the whole story on 'goodness-of-
fit'. It
does not discriminate between random errors and bias (systematic
deviations); it includes both. Therefore it is usually combined with ME
(mean error) as a measure of bias. Again, ME is often expressed as a
relative value or percentage.
Akaike information criterion (AIC) is not really a measure of
goodness of
fit, but a criterion to discriminate between two models. More
specifically,
to check whether a more complex model is statistically justified
compared to
a more simple model.
Best regards,
Hans Proost
Johannes H. Proost
Dept. of Pharmacokinetics and Drug Delivery
University Centre for Pharmacy
Antonius Deusinglaan 1
9713 AV Groningen, The Netherlands
tel. 31-50 363 3292
fax 31-50 363 3247
Email: j.h.proost.-a-.rug.nl
Back to the Top
The following message was posted to: PharmPK
Valeria
Since nobody picked it up, I'll tell you that information criteria,
such as the Akaike, or its derivations such as the Schwartz and the
Hanna-Quinn criteria, are not goodness-of-fit metrics. They rather
provide a way to compare models in terms of information by equating
explanatory capability with dimensionality. An AIC value estimated
for a given fit means nothing if not compared with another with a
different model. Then one may choose the model with the lowest AIC
even though it has a higher sum of squares, for example, since that
added goodness-of-fit is probably due to overparameterization.
Luis
--
Luis M. Pereira, Ph.D.
Assistant Professor, Biopharmaceutics and Pharmacokinetics
Massachusetts College of Pharmacy and Health Sciences
179 Longwood Ave, Boston, MA 02115
Phone: (617) 732-2905
Fax: (617) 732-2228
Luis.Pereira.at.bos.mcphs.edu
Back to the Top
The following message was posted to: PharmPK
I believe RMSE means root mean square error. It is the square root of
mean
square error (MSE). MES is also called residual mean square. I think
'residual mean square error' is referring to MSE. The relationship of
these
two is like that of standard deviation and variance. Standard
deviation has
the same unit as the observed data. It can be compared with the data
directly to evaluate the deviation (e.g. when we calculate CV%). MSE
can be
used for likelihood ratio test.
Yaning Wang, Ph.D.
Pharmacometrician
Office of Clinical Pharmacology and Biopharmaceutics
Center of Drug Research and Evaluation
Food and Drug Administration
Office: 301-827-9763
Back to the Top
The following message was posted to: PharmPK
I've been reading this GOF discussion with some interest. Assessing
the GOF
of a model should never, ever be based on a metric like RMSE, AIC, or
R**2.
RMSE, MSE, AIC, and BIC are not of value in and of themselves because
they
are just a number. An AIC of 45 or an MSE of 600 is meaningless
unless you
have a point of reference, which is often a competing model. A high
R**2 in
and of itself does not say anything because you can have a very high
R**2 in
the face of obvious model misspecification.
Model assessment is a holistic approach based on many factors. I would
even say that this is a Gestalt process. In the end, the adequacy of a
model is quite subjective. One modeler may judge a model adequate but
another might not. Use of these statistics is an attempt to quantify a
nonquantitative measure.
Peter L. Bonate, PhD, FCP
Director, Pharmacokinetics
Genzyme Corporation
4545 Horizon Hill Blvd
San Antonio, TX 78229
phone: 210-949-8662
fax: 210-949-8219
email: peter.bonate.at.genzyme.com
Back to the Top
The following message was posted to: PharmPK
There is an excellent and succint discussion of model comparison at this
site : http://www.duke.edu/~rnau/compare.htm
James D. Prah, PhD
US EPA
Human Studies Division MD (58B)
Research Triangle Park, NC, 27711
919 966 6244
919 966 6367 FAX
Back to the Top
Hi All,
Hans Proost wrote:
"I suggest to use RMSE (root mean squared error) as a measure of
goodness-of-fit. .....
Akaike information criterion (AIC) is not really a measure of
goodness of fit, but a criterion to discriminate between two models.
More
specifically, to check whether a more complex model is statistically
justified
compared to a more simple model."
I think this is exactly why one wants to check the "goodness of fit",
to discrminate between different models.
Including more variables in the model will most of the time decrease
RMSE.
However, the question is if the observed decrease is significant
(statistically) and for that RMSE by itself will not suffice.
The criterion should be a statistics that can indicate a statically
signifcant change.
I hope I did not create more confusion!
radu
Back to the Top
Would Dr. Bonate care to comment on the adequacy of the Lack of Fit
test which I proposed earlier as a statistical measure for goodness
of fit of a regression model? He did not comment on this test. This
test is not a Gestalt process or subjective. The Lack of Fit test
compares the variability of the residuals a fitted model with the
variability between observations at replicate values of the
independent variable. The test assumes that the observations, Y for
given X, are independent and normally distributed, and that the
distributions of Y have the save variance. Statistically significant
Lack of Fit indicates that the fitted model does not adequately fit
the data. The test requires repeated observations at X.
One further comment. The adequacy of a fitted model depends on the
use of the model. Statistically inadequate models may be very useful.
Regards,
Stan Alekman
S.L. Alekman Associates Inc.
Pharmaceutical Consultants
Inverness, Illinois
Back to the Top
Dr Alekman asked me to comment on the Lack of fit test. There is not
a lot of experience in the pk literature using the lack of fit test
simply because it does require replicates. Therefore, I think this
test has limited utility to a pharmacokineticist and I really can't
think of a case where I have ever seen it used outside of a textbook.
Peter L. Bonate, PhD, FCP
Director, Pharmacokinetics
Genzyme Corporation
4545 Horizon Hill Blvd
San Antonio, TX 78229
phone: 210-949-8662
email: peter.bonate.-a-.genzyme.com
Back to the Top
The following message was posted to: PharmPK
Valeria,
As already noted by others, the Akaike Information Criterion is one
statistic that can be used to compare models, but it is not an
indication of
the quality of the models. It is often used to compare, for example,
one-,
two- and three-compartment PK models so see which one might be
favored for a
particular set of data. So the intended purpose is to indicate
whether an
improvement in the error function by adding more parameters could be
considered statistically significant.
AIC = (#Pts) * Log(Obj) + 2(#Parameters)
A lower value for AIC is said to indicate a better model.
As you can see, an increase in the log of the objective function
(weighted
error function) increases AIC, as does the number of fitted
parameters in
the model. So if you add more parameters, such as going from a
one-compartment model to a two-compartment model, and the objective
function
goes down, but not enough to offset the 2(#Parameters) term, then AIC
will
increase.
Note that comparing AIC can only be done when comparing objective
function
values that are calculated in the same way (i.e., same observed data
points,
and same objective function weighting).
The Schwartz Criterion, also mentioned earlier, is a similar function
(as
with the AIC, lower is better):
Schwarz Criterion (SC) = (#Pts) * Log(Obj) +
(#Parameters)*(Log(#Pts))
Slope and intercept of predicted vs observed and observed vs
predicted plots
are also very useful, as well as an examination of residual plots to
see if
there is any systematic error (indicated by a sinusoidal pattern of
residual
error above and below zero at the observed data points, rather than a
more
uniform scattering of + and - errors).
Walt Woltosz
Chairman & CEO
Simulations Plus, Inc. (AMEX: SLP)
1220 W. Avenue J
Lancaster, CA 93534-2902
U.S.A.
http://www.simulations-plus.com
Phone: (661) 723-7723
FAX: (661) 723-5524
E-mail: walt.aaa.simulations-plus.com
Back to the Top
The following message was posted to: PharmPK
Here's some good fun and pithy information on modeling: From, A.
Bloch "Murphy's Law: Book Three," Price/Stern/Sloan Publishers Inc.,
Los Angeles, 1982.
"Golomb's Don'ts of Mathematical Modeling"
(1) Don't believe the 33rd order consequences of a 1 st order
model catch phrase cum grano salis; (2) don't extrapolate beyond
the region of fit catch phrase don't go off the deep end; (3)
don't apply any model until you understand the simplifying
assumptions on which it is based, and can test their applicability
catch phrase use only as directed; (4) don't believe that the model
is the reality catch phrase don't eat the menu; (5) don't distort
reality to fit the model catch
phrase the 'Procrustes Method'; (6) don't limit yourself to a
single model: More than one may be useful for understanding different
aspects of the same phenomenon catch phrase legalize polygamy;
(7) don't retain a discredited model catch phrase don't beat a
dead horse; (8) don't fall in love with your model catch phrase
pygmalion; (9) don't apply the terminology of subject A to the
problems of subject B if it is to the enrichment of neither catch
phrase new names for old.
Back to the Top
The following message was posted to: PharmPK
Dear Walt;
You said ".....So the intended purpose (of AIC) is to indicate
whether an
improvement in the error function by adding more parameters could be
considered statistically significant.
AIC = (#Pts) * Log(Obj) + 2(#Parameters)...
You are right that lower the value of AIC, better the model but you
cannot say
anything about the statistical signicance because AIC values don't
belong to
any distribution. So the difference of 1 unit or 100 units between
the AIC
values for two models hold the same ground (likely disadvantage of
using AIC?)
Your comments are welcome.
Neeraj Gupta
PhD Student
University of Iowa
Iowa city,IA 52246
Email: neeraj-gupta.at.uiowa.edu
Back to the Top
The following message was posted to: PharmPK
Dear Neeraj,
My understanding (I reiterate that I am not a statistician) is that the
Akaike Information Criterion is used to help in deciding when additional
model parameters are justified. In most situations, additional
parameters
will reduce the error function. But at what point does the reduction in
error become overfitting? When I mentioned statistical significant
(perhaps
I'm abusing the term) I was referring not to the significance of the
models,
but the significance in the reduction of the error function produced
by the
additional model parameters.
The AIC (again, as I understand it) is an approximate indicator of
when the
additional parameters are justified. When you add model parameters,
if the
log(Obj) term does not decrease more than 2(#Parameters), then the AIC
increases, and the model would be suspect. It is always a judgment
call, but
we typically will not use a higher order model unless there is a
significant
decrease in AIC.
Often AIC is a negative number. Suppose it is -15.78364 for a
one-compartment model. And suppose it is -16.78551 for a two-compartment
model (which should indicate a better model). I would not go to the
two-compartment model unless I also saw convincing information from
slope
and intercept of predicted vs observed plots and a change from
systematic
error to random error in the residuals.
Of course, both models could be poor - AIC tells us nothing about the
quality of the models.
Walt Woltosz
Chairman & CEO
Simulations Plus, Inc. (AMEX: SLP)
1220 W. Avenue J
Lancaster, CA 93534-2902
U.S.A.
http://www.simulations-plus.com
Phone: (661) 723-7723
FAX: (661) 723-5524
E-mail: walt.-at-.simulations-plus.com
Back to the Top
The following message was posted to: PharmPK
Dear colleagues,
From the excellent comments by others I conclude that the main
problem is in
the meaning of 'goodness-of-fit'. This term is used for different
purposes,
e.g.
#1: for comparing models, to select the 'best' model, e.g by
comparing AIC.
#2: as a measure of 'how good the model fits', e.g. how large is the
model
misspecification. I don't know of a good test for this in case of single
measurements (for replicate the lack-of-fit test can be used, as
stated by
Stan Alekman and Peter Bonate), but visual inspection of residuals
plots are
definitely very useful an appropriate (although not objective).
#3: as a measure of 'how good is the prediction of the model'; for this
purpose I recommended RMSE (I am sorry that this was not really clear
in my
earlier message). Please note that it should not be used for
selecting the
'best' model (#1), since it does not include a penalty for
overparametrization, and does not give information about model
misspecification (#2). One of the attractive features of RMSE is that
its
value has a meaning that can be understood by itself, depending on the
purpose. E.g. an RMSE of 15% is usually acceptable for a prediction
of PK
parameters in an individual, but probably not for an analytical assay.
Does anybody have authorative references that give appropriate
definitions
for these three situations, so we can end the confusion (and
discussion)?
Best regards,
Hans Proost
Johannes H. Proost
Dept. of Pharmacokinetics and Drug Delivery
University Centre for Pharmacy
Antonius Deusinglaan 1
9713 AV Groningen, The Netherlands
tel. 31-50 363 3292
fax 31-50 363 3247
Email: j.h.proost.aaa.rug.nl
[Hans, RMSE as a percent? Is RMSE = sqrt(sum(o-c)*sum(o-c)/n), where
o is observed data point, c is calculated data point and n is the
number of data points? Where does the percent denominator appear? I
ask because I was going to add RMSE to Boomer over the weekend. In
the past I've been more concerned with #1 and #2. My major objective
has been to get the best fit with the best model, naively leaving the
usefulness of the model to someone else. ;-) - db]
Back to the Top
The following message was posted to: PharmPK
Dear All
The RMSE is a kind of generalized standard deviation. It's the spread
left over when we have accounted for a given relationship in the
data, meaning when we have fitted a model to the data. Hence its
other name, residual variation. Unfortunately (IMHO) people do say
"the RMSE was 10%" (of the mean of the observed observations). But
this statistic is called coefficient of variation or relative
standard deviation, and that's a convenient way to talk across
different data sets and models. It puts the standard deviation (or
estimate of the random variation or noise in the model) in the
context of the magnitude of the mean. I try not to use RMSE in this
way though. However, it may be argued that it is an indicator of the
reliability (extent to which the measurement is inherently
reproducible, or the degree to which the measurement is influenced by
measurement errors). Then what we say is that the standard deviation
of the variation not accounted for by the model is ten percent of the
average value.
But model building, as I was once told, it's so much of an art as it
is a science. I like to say it's like a jigsaw puzzle. So I look
certainly at the objective function being minimized (e.g. sum of
squares, weighted or not, regular or orthogonal), R^2, RMSE and so
forth. I do compare models before choosing one for the same data set
using information criteria. But I also pay great deal of attention to
the variance-covariance and correlation matrices, to the standard
deviations (or confidence limits) for the parameter estimates, to the
studentized residuals, their serial correlation, skewness and
kurtosis, to the optimization algorithm, and most of all to the
plots. The goodness-of-fit that matters must be the full picture that
all these strokes make at the end.
--
Luis M. Pereira, Ph.D.
Assistant Professor, Biopharmaceutics and Pharmacokinetics
Massachusetts College of Pharmacy and Health Sciences
179 Longwood Ave, Boston, MA 02115
Phone: (617) 732-2905
Fax: (617) 732-2228
Luis.Pereira.at.bos.mcphs.edu
Back to the Top
Dear Hans,
for related models (e.g., 1st vs 2nd order polynomial) you could use
the F-test to see whether any improvement was worth the sacrifice of
degree of freedom. More often than not the simplest model is the 'best'.
To visually inspect the residuals is "not objective" is an
interesting statement. One could and should test for normality,
constant variance, and other possible test include a runs test and
tests like Durbin-Watson. How one interprets these tests, and
prediction vs measured, is another matter.
FYI, I am not a statistician either.
Frederik Pruijn
Back to the Top
f.pruijn.aaa.auckland.ac.nz writes:
>for related models (e.g., 1st vs 2nd order polynomial) you could use
>the F-test to see whether any improvement was worth the sacrifice of
>degree of freedom. More often than not the simplest model is the
'best'.
To visually inspect the residuals is "not objective" is an
interesting statement. One could and should test for normality,
constant variance, and other possible test include a runs test and
tests like Durbin-Watson. How one interprets these tests, and
prediction vs measured, is another matter.
These comments are very good guidance. Inspection of residuals is
very important. It alone is sufficient to reject a model when other
indicators suggest a fitted model is adequate.
For those interested in reading about this subject, see R.B.
D'Agostino and M.A. Stephens, "Goodness of Fit Techniques",
statistics textbooks and monographs vol 68, Marcel Dekker, 1986. This
text is written for statisticians and is not based on the
experimental work that is performed by participants in this
discussion group. But it can be helpful for those willing to work
through it.
Fitting models can fall in two categories: models for purpose of
prediction and models for purpose of theory. Developing theoretical
models where variables and coefficients describe underlying physical
mechanisms is research. The former are the modeling studies I perform.
Peter Bonate informed me that experimental work done by this
discussion group does not collect multiple readings at each level of
the independent variable. The Lack of Fit test that I proposed in
several postings requires multiple readings. That is unfortunate
because Lack of Fit testing is an appropriate tool for goodness of
fit, independent of the type of fitted model, linear, exponential, etc.
Regards,
Stan Alekman
Stanley L. Alekman PhD
S.L. Alekman Associates Inc.
Pharmaceutical Consultants
Inverness, Illinois
Back to the Top
The following message was posted to: PharmPK
Dear David,
Thanks for your comments and questions:
> [Hans, RMSE as a percent? Is RMSE = sqrt(sum(o-c)*sum(o-c)/n), where
> o is observed data point, c is calculated data point and n is the
> number of data points? Where does the percent denominator appear?
I am sorry that I did not express myself very clear with respect to the
conversion to percentage. Actually one can calculate RMSE as such,
i.e. the
equation you mentioned:
RMSE = sqrt[sum((c-o)*(c-o))/n]
So, RMSE has the same unit as c and o. In case of order of magnitudes
differences between the values o, it is likely that the order of
magnitidue
of the differences c-o will have also a differnent order of
magnitude, and a
relative value (i.e. (c-o)/o)would be more appropriate for
evaluation; for
convenience a conversion to percentage can be done, resulting in
RMSE% = 100*sqrt[sum((c-o)*(c-o)/(o*o))/n]
Please note that I used here o in the denominator instead of c. It
would be
illogical to use c in the denominator, since c is a dependent variable.
Both RMSE and RMSE% can be used for any data set, but one is more
meaningful
than the other in case of different order of magnitude of (c-o) and
o. There
is a close resemblance to the problem of weighting, i.e. using RMSE
in case
of no weighting and RMSE% in case of proportional weighting. For other
weighting schemes an analogous concept of RMSE does not seem to be
appropriate.
> In the past I've been more concerned with #1 and #2. My major
> objective has been to get the best fit with the best model, naively
> leaving the usefulness of the model to someone else. ;-) - db]
I agree! RMSE is a measure of prediction error, and 'predicting' data
after
fitting a model to the same data is not a real prediction! As I said
in my
earlier message, RMSE can be used as a measure of 'how good is the fit'.
On reflection, however, I think that we should not use RMSE to this
purpose,
and leave RMSE for 'true' predictions and related situations, e.g.
comparison of fitted parameters (c) to the true parameter values (o)
in case
of analysis of data generated by Monte Carlo simulation.
To compensate for the number of parameters in a model, it would be
better to
replace n in the denominator of RMSE and RMSE% by (n-p) where p is the
number of parameters. This gives a penalty on the number of
parameters, and
then 'RMSE' becomes the 'residual standard deviation' and 'RMSE%' the
'residual coefficient of variation', as was pointed to by Luis Pereira.
Again, the choice between actual values and percentage should be
related to
the applied weighting scheme. From Monte Carlo simulations I learned
that
these values (using (n-p)) are virtually identical to the added data
noise.
But I agree with Luis Pereira that we should not use the term RMSE
for these
quantities.
Best regards,
Hans Proost
Johannes H. Proost
Dept. of Pharmacokinetics and Drug Delivery
University Centre for Pharmacy
Antonius Deusinglaan 1
9713 AV Groningen, The Netherlands
tel. 31-50 363 3292
fax 31-50 363 3247
Email: j.h.proost.-at-.rug.nl
[Hans, thank for the subtle correction of my equation ;-) I had some
missing parentheses. Also, I apologize, you had mention RMSE% in an
earlier post that I had forgotten. I think I like the division by o
(not c). One might argue that 'c' is the true value but 'o' is the
real number (in your terms independent number). Must remember to
avoid dividing by zero. I'm not sure about your argument regarding
RMSE and RMSE% versus weighting scheme. I would view both as a
measure of goodness of fit, just scaled differently -- although I
have little (/no) experience looking at these numbers. I'll leave the
'n' versus 'n-p' to someone else - db]
Back to the Top
Dear All:
In all of this back and forth about goodness of fit, why not
consider the likelihood of the results given the data? Most of the
criteria discussed so far seem to assume that the parameter
distributions are Gaussian or lognormal in their shape, and often
they are not. Most methods for population PK/PD analysis, such as
NONMEM, compute the likelihood using either the first order (FO) or
the first order, conditional expectation (FOCE) approximations. These
approaches do not have the desirable property of statistical
consistency, and so there is no guarantee that the more patients you
study, the closer the parameter estimates get to the true values.
Interesting!!
On the other hand, there are now a number of other methods
that compute the likelihood accurately or exactly. There is a good
French method, Dave D'Argenio has one, and Bob Leary has PEM, a
parametric EM method that uses a Faure low discrepancy integration
method. In addition, the nonparametric approaches such as NPEM, NPAG,
and NPOD, from USC, also compute the likelihood exactly. All these
methods are consistent. All have visibly greater precision of
parameter estimates that NONMEM and similar methods. The NP methods
do not make any assumptions at all about the shape of the parameter
distributions. They also lead to maximum precision in dosage regimens
using the method of multiple model dosage design. There is more
material on our web site www.lapk.org, under teaching topics, for
example. I would like very much to see comments on this.
Very best regards,
Roger Jelliffe
Roger W. Jelliffe, M.D. Professor of Medicine,
Division of Geriatric Medicine,
Laboratory of Applied Pharmacokinetics,
USC Keck School of Medicine
2250 Alcazar St, Los Angeles CA 90033, USA
Phone (323)442-1300, fax (323)442-1302, email= jelliffe.-at-.usc.edu
Our web site= http://www.lapk.org
Back to the Top
The following message was posted to: PharmPK
Roger,
The original context of this thread was for structure-property
relationships, and specifically concerning such statistics as q^2, R^2,
RMSE, slop & intercept, etc. for such correlations. These are
single-measurement data per structure.
Do the methods you cited apply for this kind of data?
Best regards,
Walt
Walt Woltosz
Chairman & CEO
Simulations Plus, Inc. (AMEX: SLP)
1220 W. Avenue J
Lancaster, CA 93534-2902
U.S.A.
http://www.simulations-plus.com
Phone: (661) 723-7723
FAX: (661) 723-5524
Back to the Top
Dr. Jelliffe,
Is it convenient for you to post literature references for these
procedures?
Thank you.
Stan Alekman
Back to the Top
Roger
Thanks for bringing up this subject. I have often wondered about
appropriate statistics for comparisons of nonparametric likelihoods
for dichotomous model selection decisions.
It can be shown that the difference of two log(parametric
likelihoods) [LL] of the same parametric distribution for nested
models are asymptotically chi-squared distributed. If the
likelihoods are approximated, e.g. by numerical integration or by
linearisation, then you might say that the difference of two LL for
nested models are approximately and asymptotically chi-squared
distributed. [Note I have left out "-2*" the LL here for simplicity.]
However can you say this for nonparametric LL's? Nonparametric LL's
do not conform to a particular parametric family of distributions and
therefore I am not sure whether the above assumption of asymptotic
chi-square distribution necessarily applies when you are trying to
make model comparisons. I would be interested on any views on this.
If it is not possible to use the likelihood ratio test then how do
you statistically compare two models from knowledge of their
nonparametric maximum likelihood value?
Regards
Steve
Steve Duffull
http://www.uq.edu.au/pharmacy/index.html?page=31309
Back to the Top
Dear Steve:
Thanks for your note. You are correct about the likelihoods
for parameters having Gaussian of lognormal distributions, and you
can do the significance tests you described. You are also correct
that the distribution of the likelihoods for other distributions is
not known, and that because of this, you cannot do the same kind of
significance testing that way.
What I wanted to say mainly was that it is important to use
likelihood as an index of fitting data, and that methods that compute
it with approximations, such as most of the currently available pop
modeling methods (NONMEM FO and FOCE, our IT2B, etc,) do not have the
guarantee of statistical consistency, that studying more subjects
will give results that more closely approach the true results. That
is a real scientific problem, and it is interesting that most people
seem not to be concerned by it.
On the other hand, methods that compute the likelihood
accurately or exactly (our NPEM, NPAG, NPOD, PEM, a good French
parametric model program, and others) do the job much better, and are
consistent, and also more precise in their parameter estimates. There
was a conference in Lyon last fall where this was shown in a blind
competition of several different methods, and also at the PAGE
meeting last June in Pamplona by Pascal Girard and his group.
I mainly wanted to make the point that comparing various
indices of so-called goodness of fit, when the methods giving it are
NOT statistically consistent and precise, can be misleading, and that
it is important to use methods that are known to be statistically
consistent, with precise parameter estimates. If this is not the
case, the rest way well be moot.
All the best,
Roger Jelliffe
Roger W. Jelliffe, M.D. Professor of Medicine,
Division of Geriatric Medicine,
Laboratory of Applied Pharmacokinetics,
USC Keck School of Medicine
2250 Alcazar St, Los Angeles CA 90033, USA
Phone (323)442-1300, fax (323)442-1302, email= jelliffe.-a-.usc.edu
Our web site= http://www.lapk.org
Back to the Top
Roger
Thanks for your email. I think you raise an important point about
approximations to the likelihood that occur for some parametric
methods and that this may influence inference based on these
approximations.
Without wanting to prolong the debate on various tools for population
modelling - I would like to come back to my question:
How do you discriminate between competing models that are developed
based on a nonparametric likelihood? What test statistic do you use?
Regards
Steve
===========Steve Duffull
http://www.uq.edu.au/pharmacy/index.html?page=31309
Back to the Top
Dear Stan:
About references concerning our population modeling
approaches - you can go to our web site www.lapk.org, and click
around under new developments in pop modeling, and teaching topics.
We also now have a paper in prress in Clinical Pharmacokinetics.
Roger W. Jelliffe, M.D. Professor of Medicine,
Division of Geriatric Medicine,
Laboratory of Applied Pharmacokinetics,
USC Keck School of Medicine
2250 Alcazar St, Los Angeles CA 90033, USA
Phone (323)442-1300, fax (323)442-1302, email= jelliffe.at.usc.edu
Our web site= http://www.lapk.org
PharmPK Discussion List Archive Index page
Copyright 1995-2010 David W. A. Bourne (david@boomer.org)