crossValidation
function is generic (see Methods); method
functions can be written to handle specific classes of
data. Classes which already have methods for this function include:
formula
crossValidation(x, <<y or data>>, modelFit, K = n, args.modelFit = NULL, predFun = <<see below>>, args.predFun = NULL, passOldData.predFun = F, errFun = <<see below>>, args.errFun = NULL, seed =.Random.seed, label, trace = resampleOptions()$trace, assign.frame1 = F, save.indices = F) crossValidation.default(x, y, <<modelFit and subsequent arguments>>) crossValidation.formula(x, data, <<modelFit and subsequent arguments>>)
crossValidation.default
, a data frame or matrix containing
the explanatory variables.
For
crossValidation.formula
, a formula object that specifies the model, with
the response on the left of a
~
operator and the explanatory terms,
separated by
+
operators, on the right.
For
crossValidation.formula
:
the function must accept a formula as its first argument,
and have a
data
argument;
e.g.
modelFit(x, data=data)
.
For
crossValidation.default
: this function must take arguments
x
and
y
,
not necessarily in that order.
modelFit
when fitting the model.
predict
.
predFun
when calculating predicted values.
errFun
when calculating the prediction error.
set.seed
.
assign.frame1=T
if all estimates are identical (this is slower).
indices
below.
crossValidation
, with the following components:
crossValidation
, but with all the arguments explicitly named.
.Random.seed
.
.Random.seed
.
indices[2] = 4
, the second
observation was placed in the fourth group for cross-validation.
modelFit
. If
passOldData.predFun=T
, the prediction
algorithm assigns the training data to frame 1 using the name
oldData
.
(Note that
passOldData.predFun
gets set to
T
automatically when
gam
is used.) If
assign.frame1=T
, the data is assigned to frame 1 using
the name of the data frame or the name
data
. You must be sure that
these assignments to frame 1 do not overwrite some quantity of interest
stored in frame 1.
Performs cross-validation modeling for a wide scope of expressions.
The algorithm samples by leaving out certain rows of a data frame or matrix,
so this function is not generally applicable to grouped-data problems that
use modeling functions like
lme
and
nlme
, unless you use the
subject
variable.
Normally the first two arguments to
predFun
are the model object
and new data. Most methods for
predict
(the default
predFun
)
satisfy this.
However,
predict.censorReg
currently has
first four arguments
object, p, q, newdata
.
To use this, you could either write your
own
predFun
which calls
predict.censorReg
with arguments in a different
order, or supply
args.predFun = list(p=c(.1,.5,.9),q=NULL)
;
this results in internal calls of the form
predict(model object, new data, p=c(.1,.5,.9), q=NULL)
.
Because named arguments (
p
and
q
) take precedence, the new
data will end up being used as the fourth argument to
predict.censorReg
, as desired.
Similarly, the first two arguments to
errFun
are normally the
actual and fitted values of the response variable, but these may be
displaced to later positions by named arguments in
args.errFun
.
The combination of
predFun
and
errFun
, and their arguments,
should be appropriate for your model.
For example, in a logistic regression (
glm
with
family=binomial
),
args.predFun=list(type="response")
puts predictions on the
probability scale, and
errFun
could compute a weighted sum of squares.
The defaults are appropriate for the usual linear least-squares regression.
Stone, M. (1974), "Cross-validatory choice and assessment of statistical predictions," Journal of the Royal Statistical Society, Ser. B, 36, pp. 111-147.
Breiman L., Friedman J.H., Olshen R.A., and Stone, C.J. (1984), Classification and Regression Trees, Wadsworth International Group, Belmont CA.
Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, San Francisco: Chapman & Hall.
For an annotated list of functions in the package, including other high-level resampling functions, see: .
crossValidation(ozone ~ ., air, lm, K = 10) crossValidation(skips ~ ., data = solder2, glm, K = 10, args.modelFit = list(family = poisson)) # crossValidation.default method crossValidation(air$wind, air$ozone, smooth.spline, K = 10, predFun = function(object, newdata) predict(object, x = newdata)$y) # model selection with smooth.spline attach(air) plot(ozone,temperature) tempErr <- rep(NA, 11) for(i in 1:11){ res <- crossValidation(ozone, temperature, smooth.spline, args.modelFit = list(df = i+1), predFun = function(object, newdata){ predict(object, x = newdata)$y}, K = 10) tempErr[i] <- res$error } argminErr <- which(tempErr == min(tempErr))[1] + 1 lines(smooth.spline(ozone,temperature, df = argminErr)) # note: this simple example ignores the variability # in the CV estimates, and just picks the # minimum error as the winner crossValidation(NOx ~ C * E, data = ethanol, loess, K = 10, args.modelFit = list(span = 1/2, degree = 2, parametrix = "C", drop.square = "C", control = loess.control("direct"))) crossValidation(ozone^(1/3) ~ radiation + s(wind, df = 3), data = air, modelFit = gam, K = 10) # supply the prediction function crossValidation(ozone ~ ., air, lm, K = 10, predFun = function(object, newdata, se.fit) predict.lm(object, newdata, se.fit = T)$fit) # supply the error function crossValidation(ozone ~ ., air, lm, K = 10, errFun = function(y, fitted) sum((y - fitted)^2))