chisq.test(x, y=NULL, correct=T)
x
is a data frame, it is immediately coerced
to a matrix with the
as.matrix
function.
If
x
is a contingency table,
it must have at least two rows and two columns,
all elements must be non-negative, and
NA
s
or
Inf
s are not allowed.
The elements of the contingency table should be whole numbers,
as the test is based on counts;
however, since all computations are carried out to double precision
accuracy where possible,
the storage mode of
x
will be
coerced to
"double"
.
For restrictions on
x
when it is a factor or a category object,
see argument
y
.
x
is a matrix or data frame,
y
is ignored.
If
x
is a factor or category object,
y
is required
and must have the same length as
x
.
Both factor/category objects must have at least two levels.
NA
s in the category index vectors are allowed,
but pairs
(x[i],y[i])
containing these
will be removed.
Each element of the index vectors of
x
and
y
should give the membership
of that observation in one of the groups present
in the
levels
attributes;
an
NA
in an index vector means
that the observation is not in one of the groups listed for that object.
Inf
s have no meaning as indices,
and should not be present.
Conversely,
if
x
or
y
is
not a factor/category object
(and
x
is not a contingency table),
it will be coerced to one implicitly.
In this case pairs
(x[i],y[i])
containing
NA
s will be removed,
but not pairs with
Inf
s.
Coercion of
x
and
y
in this manner is intended for datasets
of mode
numeric
,
whose elements are typically small integers.
TRUE
,
Yates' continuity correction will be applied,
but only for dichotomous categories (2 by 2 tables).
"htest"
, containing the following components:
names
attribute
"
X-squared
".
See section DETAILS for a definition.
statistic
.
parameters
has
names
attribute
"df"
.
x
,
and of
y
if both are factor or category objects.
The expected cell counts are estimated as the products of the observed marginal totals divided by the table total. These expected counts are relevant to several types of null hypothesis: statistical independence of the rows and columns, homogeneity of groups, etc. The appropriateness of the test to a particular null hypothesis and the interpretation of the results depend on the nature of the data at hand, in particular on the sampling scheme. See for example Fleiss (1981).
The returned
p.value
should be
interpreted carefully.
Its validity depends heavily on the assumption that the expected cell counts
are at least moderately large;
a minimum size of five is often quoted as a rule of thumb.
Even when cell counts are adequate, the chi-square is only a large-sample
approximation to the true distribution of X-squared under the null hypothesis.
Indiscriminate use of
chisq.test
with arbitrary count data is to be discouraged.
The null hypothesis (i.e., probability model),
sampling scheme and sizes of the counts all have bearing on the meaningfulness
of the test,
and some thought should be given to these.
The degrees of freedom (returned component
parameters
) are given by the product
(R-1)*(C-1)
,
where
R
is the number of rows
and
C
the number of columns
of the contingency table.
Fienberg, S. E. (1983). The Analysis of Cross-Classified Categorical Data, 2nd ed. Cambridge, Mass.: The MIT Press.
Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions, 2nd ed. New York: Wiley.
Snedecor, G. W. and Cochran, W. G. (1980). Statistical Methods, 7th ed. Ames, Iowa: Iowa State University Press.
x <- factor(c("A","B","A","A","B","B","B","A","B","B","B","B","B", "A","B","B","A","B","A","A","A","A","B","A","A","B","A", "B","B","A","A")) y <- factor(c("Yes","No","No","No","No","No","Yes","Yes","Yes","No", "No","Yes","No","Yes","No","No","Yes","Yes","Yes","No","Yes", "Yes","No","No","No","Yes","No","No","No","Yes","Yes")) table(x,y) # Gives: # No Yes # A 6 9 # B 11 5 chisq.test(x,y) chisq.test(table(x,y)) # same thing as chisq.test(x,y)