Cross-validation

DESCRIPTION:

Performs cross-validation; i.e., fits a model with certain observations left out and forms predictions for the observations that were left out, thus allowing a more accurate estimate of prediction error (compared with the sample prediction error of the model fitted to the entire data set). The crossValidation function is generic (see Methods); method functions can be written to handle specific classes of data. Classes which already have methods for this function include:
formula

USAGE:

crossValidation(x, <<y or data>>, 
         modelFit, K = n, args.modelFit = NULL, 
         predFun = <<see below>>, args.predFun = NULL, 
         passOldData.predFun = F, 
         errFun = <<see below>>, args.errFun = NULL, 
         seed =.Random.seed, label, 
         trace = resampleOptions()$trace, assign.frame1 = F, 
         save.indices = F) 
crossValidation.default(x, y, 
         <<modelFit and subsequent arguments>>) 
crossValidation.formula(x, data, 
         <<modelFit and subsequent arguments>>) 

REQUIRED ARGUMENTS:

x
For crossValidation.default, a data frame or matrix containing the explanatory variables. For crossValidation.formula, a formula object that specifies the model, with the response on the left of a ~ operator and the explanatory terms, separated by + operators, on the right.
y
the response variable.
data
data frame used to fit the model.
modelFit
function that fits the model under consideration.

For crossValidation.formula: the function must accept a formula as its first argument, and have a data argument; e.g. modelFit(x, data=data).

For crossValidation.default: this function must take arguments x and y, not necessarily in that order.

OPTIONAL ARGUMENTS:

K
number of groups to be formed. Each group is left out once, the model fit with the remaining data, and predictions made for observations within the group. Default value is the number of observations (leave-one-out cross-validation).
args.modelFit
list of arguments to pass to modelFit when fitting the model.
predFun
function that returns predicted values for a given model object and new data values. The first two arguments to this function are the model object and new data (except see Details below). The default is a version of predict.
args.predFun
list of additional arguments to pass to predFun when calculating predicted values.
passOldData.predFun
logical flag indicating if the prediction algorithm refits the original model. If so, the training data must be passed to the prediction function; this is done with an assignment to frame 1.
errFun
function that computes a measure of error, based on actual values of the response variable and fitted values. The first two arguments to this function are the actual and fitted values (except see Details below). The default computes squared error.
args.errFun
list of arguments to pass to the function given in errFun when calculating the prediction error.
seed
seed for randomly choosing group membership. May be a legal random number seed or an integer between 0 and 1023 which is passed to set.seed.
label
character, if supplied is used when printing.
trace
logical flag indicating whether the algorithm should print a message indicating which cross-validation group is currently being processed. The default is determined by .
assign.frame1
logical flag indicating whether the resampled data should be assigned to frame 1 before fitting the model. Try assign.frame1=T if all estimates are identical (this is slower).
save.indices
logical flag indicating whether to save the indices. See return component indices below.

VALUE:

an object of class crossValidation, with the following components:
call
the call to crossValidation, but with all the arguments explicitly named.
fitted
the values fitted in the cross-validation routine, listed in order of the original observations.
K
the number of groups that were formed.
err
the prediction error, averaged over all cases.
seed.start
the initial value of the random seed for dividing the data into groups, in the same format as .Random.seed.
seed.end
the final value of the random seed, after the groups have all been generated, in the same format as .Random.seed.
indices
optionally, a vector of length equal to the sample size, containing integer values corresponding to group numbers; e.g. if indices[2] = 4, the second observation was placed in the fourth group for cross-validation.
label
optionally, a label to be used for printing.
defaultLabel
a default label constructed from the call, may be used for printing.

SIDE EFFECTS:

To avoid scoping problems, the model fitting function is assigned to frame 1 using the name modelFit. If passOldData.predFun=T, the prediction algorithm assigns the training data to frame 1 using the name oldData. (Note that passOldData.predFun gets set to T automatically when gam is used.) If assign.frame1=T, the data is assigned to frame 1 using the name of the data frame or the name data. You must be sure that these assignments to frame 1 do not overwrite some quantity of interest stored in frame 1.

DETAILS:

Performs cross-validation modeling for a wide scope of expressions. The algorithm samples by leaving out certain rows of a data frame or matrix, so this function is not generally applicable to grouped-data problems that use modeling functions like lme and nlme, unless you use the subject variable.

Normally the first two arguments to predFun are the model object and new data. Most methods for predict (the default predFun) satisfy this. However, predict.censorReg currently has first four arguments object, p, q, newdata. To use this, you could either write your own predFun which calls predict.censorReg with arguments in a different order, or supply args.predFun = list(p=c(.1,.5,.9),q=NULL); this results in internal calls of the form predict(model object, new data, p=c(.1,.5,.9), q=NULL). Because named arguments ( p and q) take precedence, the new data will end up being used as the fourth argument to predict.censorReg, as desired.

Similarly, the first two arguments to errFun are normally the actual and fitted values of the response variable, but these may be displaced to later positions by named arguments in args.errFun.

The combination of predFun and errFun, and their arguments, should be appropriate for your model. For example, in a logistic regression ( glm with family=binomial), args.predFun=list(type="response") puts predictions on the probability scale, and errFun could compute a weighted sum of squares. The defaults are appropriate for the usual linear least-squares regression.

REFERENCES:

Stone, M. (1974), "Cross-validatory choice and assessment of statistical predictions," Journal of the Royal Statistical Society, Ser. B, 36, pp. 111-147.

Breiman L., Friedman J.H., Olshen R.A., and Stone, C.J. (1984), Classification and Regression Trees, Wadsworth International Group, Belmont CA.

Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, San Francisco: Chapman & Hall.

SEE ALSO:

, , .

For an annotated list of functions in the package, including other high-level resampling functions, see: .

EXAMPLES:

crossValidation(ozone ~ ., air, lm, K = 10) 
 
crossValidation(skips ~ ., data = solder2, glm, 
  K = 10, args.modelFit = list(family = poisson)) 
 
# crossValidation.default method 
crossValidation(air$wind, air$ozone, smooth.spline, K = 10, predFun = 
 function(object, newdata) predict(object, x = newdata)$y) 
 
# model selection with smooth.spline 
attach(air) 
plot(ozone,temperature) 
tempErr <- rep(NA, 11) 
for(i in 1:11){ 
  res <- crossValidation(ozone, temperature, 
    smooth.spline, args.modelFit = list(df = i+1), 
    predFun = function(object, newdata){ predict(object, 
    x = newdata)$y}, K = 10) 
  tempErr[i] <- res$error 
  } 
argminErr <- which(tempErr == min(tempErr))[1] + 1 
lines(smooth.spline(ozone,temperature, df = argminErr)) 
# note: this simple example ignores the variability 
# in the CV estimates, and just picks the 
# minimum error as the winner 
 
crossValidation(NOx ~ C * E, data = ethanol, loess, K = 10, 
  args.modelFit = list(span = 1/2, degree = 2, 
  parametrix = "C", drop.square = "C", 
  control = loess.control("direct"))) 
 
crossValidation(ozone^(1/3) ~ radiation + s(wind, df = 3), 
  data = air, modelFit = gam, K = 10) 
 
# supply the prediction function 
crossValidation(ozone ~ ., air, lm, K = 10, predFun = 
  function(object, newdata, se.fit) predict.lm(object, 
  newdata, se.fit = T)$fit) 
 
# supply the error function 
crossValidation(ozone ~ ., air, lm, K = 10, errFun = 
  function(y, fitted) sum((y - fitted)^2))