Question on kstest

Question on kstest

Post by Haewoon Na » Sat, 17 Jul 2004 13:12:21


I'm trying to fit my data to one of well known distribution functions
using kstest. What I'd like to say at the end is something like
"Gamma distribution passed Kolmogorov-Smirnov test with 90%
confidence level for my data" (Gamma distribution and the 90% value
were arbitrarily chosen).

Please take a look at the following code.
Three different distributions were tested to fit my data 'xx' using
normfit, gamfit, and weibfit as shown in the code.
'alpha' is the significance level, and whatever it is, that is the
question I'd like to make.

[norm_para(1) norm_para(2)] = normfit(xx);
[H(1),P(1),KSTAT(1),CV(1)]=kstest(xx', [xx'
gam_para = gamfit(xx);
[H(2),P(2),KSTAT(2),CV(2)]=kstest(xx', [xx'
weib_para = weibfit(xx);
[H(3),P(3),KSTAT(3),CV(3)]=kstest(xx', [xx'

First, alpha = 0.05, and run the code.
The result shows H=[1 1 0], P=[0.0127 0.0121 0.2041].

Second, alpha = 0.01;
then, H=[0 0 0], P==[0.0127 0.0121 0.2041].

What result could I get from this?
Does it mean that the third distribution(weibull) fits the best?
(Why? biggest p-value?)
Can I simply conclude the distribution with the biggest p-value is
the best fit?
What is the confidence level in this case? (what is the link between
p-value and confidence level and significance level?)

Thank you in advance.


Question on kstest

Post by Peter Perk » Sat, 17 Jul 2004 22:33:24

Hi Haewoon -

A couple of things:

1) P-values from the one-sample K-S test are only valid if the distribution
you are testing against is _fully known in advance_. They are _not_ valid if
you test against a distribution that you have estimated from data. The
p-values in that case are typically "too large", because you are testing
against a distribution that is "too close" to your data. This will be
discussed in any stats text that describes the K-S test. That being said, it
is not unreasonable to use the p-values as a relative measure of goodness of
fit, as long as you don't interpret them too seriously as an absolute
indicator of fit.

2) The normal and the (logged) Weibull are both location-scale families, so it
is possible to get "correct" p-values. In the case of the normal, you can use
LILLIETEST. For the Weibull, it's not too hard to do something similar with a
Monte-Carlo test.

3) When choosing between distributions, and deciding on goodness of fit, it's
always a good idea to make CDF plots of your data against the fitted
distribution. The K-S test gives you a number, a good plot gives you the
whole story. If you have R14, a new GUI in the Stats Toolbox, called
DFITTOOL, is a big help in that direction. Or, use ECDF and STAIRS and
overlay the fitted CDF values on that.

4) If you have a recent release (R13 with Service Pack 1, or R14), you should
use WBLFIT and WBLCDF instead of WEIBFIT and WEIBCDF. The first two use a
somewhat better parameterization for the distribution.

Hope this helps.

- Peter Perkins
The MathWorks, Inc.