Chi square Goodness-of-Fit Test

DESCRIPTION:

Performs a chi square goodness-of-fit test.

USAGE:

chisq.gof(x, n.classes=ceiling(2 * (length(x)^(2/5))), 
          cut.points=NULL, distribution="normal", n.param.est=0, ...)  

REQUIRED ARGUMENTS:

x
numeric vector. NAs and Infs are allowed but will be removed.

OPTIONAL ARGUMENTS:

n.classes
the number of cells into which the observations are to be allocated. If the vector cut.points is supplied, then n.classes is set to length(cut.points) - 1. The default is recommended by Moore (1986).
cut.points
vector of cutpoints that define the cells. x[i] is allocated to cell j if cut.points[j] < x[i] <= cut.points[j+1]. If x[i] is less than or equal to the first cutpoint or greater than the last cutpoint, then x[i] is treated as missing. If the hypothesized distribution is discrete, cut.points must be supplied.
distribution
character string that specifies the hypothesized distribution. distribution can be one of: "normal", "beta", "cauchy", "chisquare", "exponential", "f", "gamma", "lognormal", "logistic", "t", "uniform", "weibull", "binomial", "geometric", "hypergeometric", "negbinomial", "poisson", or "wilcoxon". You need only supply the first characters that uniquely specify the distribution name. For example, "logn" and "logi" uniquely specify the lognormal and logistic distributions.
n.param.est
number of parameters estimated from the data.
...
parameters for the S-PLUS function that generates p-values for the hypothesized distribution.

VALUE:

list of class " htest", containing the following components:
statistic:
chi square statistic, with names attribute " chisq".
parameters:
degrees of freedom of the chi square distribution associated with the statistic. Component parameters has names attribute " df".
p.value:
p-value for the test.
data.name:
character string (vector of length 1) containing the actual name of the input vector x.
counts:
vector of the number of data points that fall into each cell.
expected:
vector of counts expected under the null hypothesis.

NULL HYPOTHESIS::

Let G(x) denote a distribution function. The null hypothesis is that G(x) is the true distribution function of x. The alternative hypothesis is that the true distribution function of x is not G(x).

TEST STATISTIC::

Pearson's chi-square statistic, the same used in the S-PLUS function chisq.test . Asymptotically, the distribution of this statistic is the chi-square distribution. If the hypothesized distribution function is completely specified, the degrees of freedom are m - 1, where m is the number of cells. If any parameters are estimated, the degrees of freedom depend on the method of estimation. The usual procedure is to estimate the parameters from the original (i.e., not grouped) data, and then to subtract one degree of freedom for each parameter estimated. In truth, if the parameters are estimated by maximum likelihood, the degrees of freedom are bounded between (m-1) and (m-1-k), where k is the number of parameters estimated. Therefore, especially when the sample size is small, it is important to compare the test statistic to the chi-square distribution with both (m-1) and (m-1-k) degrees of freedom. See Kendall and Stuart (1979) for a more complete discussion.

DETAILS:

The chi-square test, introduced by Pearson in 1900, is the oldest and best known goodness-of-fit test. The idea is to reduce the goodness-of-fit problem to a multinomial setting by comparing the observed cell counts with their values expected values under the null hypothesis. Grouping the data sacrifices information, especially if the underlying variable is continuous. On the other hand, chi-squared tests can be applied to any type of variable: continuous, discrete, or a combination of these.

NOTE:

The distribution theory of chi-square statistics is a large sample theory. The expected cell counts are assumed to be at least moderately large. As a rule of thumb, the each should be at least 5. Although authors have found this rule to be conservative (especially when the class probabilities are not too unequal), the user should regard p-values with caution when expected cell counts are small.

REFERENCES:

Kendall, M. G., and Stuart, A. (1979). The Advanced Theory of Statistics, Volume 2: Inference and Relationship, (4th edition). New York: Oxford University Press. Chapter 30

Moore, D. S. (1986). Tests of chi-squared type. In Goodness-of-Fit Techniques. R. B. D'Agostino and M. A. Stevens, eds. New York: Marcel Dekker.

Conover, W. J. (1980). Practical Nonparametric Statistics. New York: John Wiley and Sons. pp. 189-199.

SEE ALSO:

.

EXAMPLES:

# generate an exponential sample 
x <- rexp(50, rate=1.0) 
chisq.gof(x)  # hypothesize a normal distribution   
chisq.gof(x,dist="exponential",rate=1.0)  # hypothesize an exponential distn.   
x <- rpois(50,lambda=3) 
breaks <- quantile(x) 
breaks[1] <- breaks[1] - 1   # want to include the minimum value 
z <- chisq.gof(x,cut.points=breaks,dist="poisson",lambda=3) 
z$count 
z$expected