chisq.gof(x, n.classes=ceiling(2 * (length(x)^(2/5))), cut.points=NULL, distribution="normal", n.param.est=0, ...)
NA
s and
Inf
s are allowed but will be removed.
cut.points
is supplied, then
n.classes
is set to
length(cut.points) - 1
. The default is recommended by Moore (1986).
x[i]
is allocated to cell
j
if
cut.points[j]
<
x[i]
<=
cut.points[j+1]
.
If
x[i]
is less than or equal to the first cutpoint or greater than the last
cutpoint, then
x[i]
is treated as missing.
If the hypothesized distribution is discrete,
cut.points
must be supplied.
distribution
can be one of:
"normal", "beta", "cauchy", "chisquare", "exponential", "f", "gamma",
"lognormal", "logistic", "t", "uniform", "weibull", "binomial",
"geometric", "hypergeometric", "negbinomial", "poisson", or "wilcoxon".
You need only supply the first characters
that uniquely specify the distribution name.
For example, "logn" and "logi" uniquely specify the
lognormal and logistic distributions.
htest
", containing the following components:
names
attribute "
chisq
".
parameters
has
names
attribute "
df
".
x
.
Let G(x) denote a distribution function.
The null hypothesis is that G(x) is the true
distribution function of
x
. The alternative hypothesis
is that the true distribution function of
x
is not G(x).
Pearson's chi-square statistic, the same used in the S-PLUS function
chisq.test
.
Asymptotically, the distribution of this statistic is
the chi-square distribution.
If the hypothesized distribution function is completely specified,
the degrees of freedom are m - 1, where m is the number of cells.
If any parameters are estimated, the degrees of freedom depend on
the method of estimation. The usual procedure is to estimate the
parameters from the original (i.e., not grouped) data, and then to
subtract one degree of freedom for each parameter estimated.
In truth, if the parameters are estimated by maximum likelihood, the
degrees of freedom are bounded between (m-1) and (m-1-k), where k is the
number of parameters estimated. Therefore,
especially when the sample size is small,
it is important to compare the test statistic to the
chi-square distribution with both (m-1) and (m-1-k) degrees of freedom.
See Kendall and Stuart (1979) for a more complete discussion.
The chi-square test, introduced by Pearson in 1900, is the oldest and best known goodness-of-fit test. The idea is to reduce the goodness-of-fit problem to a multinomial setting by comparing the observed cell counts with their values expected values under the null hypothesis. Grouping the data sacrifices information, especially if the underlying variable is continuous. On the other hand, chi-squared tests can be applied to any type of variable: continuous, discrete, or a combination of these.
The distribution theory of chi-square statistics is a large sample theory. The expected cell counts are assumed to be at least moderately large. As a rule of thumb, the each should be at least 5. Although authors have found this rule to be conservative (especially when the class probabilities are not too unequal), the user should regard p-values with caution when expected cell counts are small.
Kendall, M. G., and Stuart, A. (1979).
The Advanced Theory of Statistics, Volume 2: Inference and Relationship,
(4th edition).
New York: Oxford University Press. Chapter 30
Moore, D. S. (1986). Tests of chi-squared type. In
Goodness-of-Fit Techniques.
R. B. D'Agostino and M. A. Stevens, eds.
New York: Marcel Dekker.
Conover, W. J. (1980).
Practical Nonparametric Statistics.
New York: John Wiley and Sons. pp. 189-199.
# generate an exponential sample x <- rexp(50, rate=1.0) chisq.gof(x) # hypothesize a normal distribution chisq.gof(x,dist="exponential",rate=1.0) # hypothesize an exponential distn. x <- rpois(50,lambda=3) breaks <- quantile(x) breaks[1] <- breaks[1] - 1 # want to include the minimum value z <- chisq.gof(x,cut.points=breaks,dist="poisson",lambda=3) z$count z$expected