Recursive Partitioning and Regression Trees

DESCRIPTION:

Fit an arbor model.

USAGE:

arbor(formula, data, weights, subset, na.action=na.arbor, method,
      cppFunctions, model=F, x=F, y=T, parms, control=arbor.control(),
      cost, nRandomSplitVars=0, ...)

REQUIRED ARGUMENTS:

formula
a formula, as in the lm function.

OPTIONAL ARGUMENTS:

data
an optional data frame in which to interpret the variables named in the formula
weights
optional weights.
subset
optional expression saying that only a subset of the rows of the data should be used in the fit.
na.action
The default action deletes all observations for which y is missing, but keeps those in which one or more predictors are missing.
method
one of "anova", "poisson", "class", "exp", or a list which implies a user specified method written in S-PLUS. If method is missing hen the routine tries to make an intelligent guess. However, the code cannot distinguish between the two column response input for Poisson and longitudinal data. For multi-column input the method must be specified. If y is a survival object, then method="exp" is assumed, if y is a factor then method="class" is assumed, otherwise method="anova" is assumed. It is wisest to specify the method directly, especially as more criteria are added to the function.

See manual for details on user specified method.

The "longitudinal" method is not implemented in this version of arbor.

cppFunctions
a named list of character strings which provides a flexible way for users to specify their own split, eval and error functions to be used in the partitioning algorithm. The split function is used to decide the best split for a node. error is used to compute the error at an individual observation. eval is used to compute the prediction value at a node and the node error. The default, which depends on method, is used for any function not specified.
model
keep a copy of the model frame in the result. If the input value for model is a model frame (likely from an earlier call to the arbor function), then this frame is used rather than constructing new data.
x
keep a copy of the x matrix in the result.
y
keep a copy of the dependent variable in the result.
parms
optional list of parameters for the splitting function.

Anova and longitudinal methods have no parameters.

For Poisson splitting, the list components can include the coefficient of variation of the prior distribution on the rates (component shrink), and an error method (component method). method can be either "deviance" or "sqrt". method defaults to "deviance" . shrink can be any positive numeric value. The default for shrink is 1 when method = "deviance" and 0 when method= "sqrt".

Exponential splitting uses the same parameter options as Poisson.

For classification splitting, the list can contain any of: the vector of prior probabilities (component prior), the loss matrix (component loss) or the splitting index (component split). The priors must be positive and sum to 1. The loss matrix must have zeros on the diagonal and positive off-diagonal elements. The splitting index can be "gini" or "information". The default priors are proportional to the data counts, the losses default to 1, and the split defaults to "gini".

control
options that control details of the arbor algorithm.
cost
optional vector of variable costs, one value per predictor. In choosing the primary split variable, each variable's improvement is divided by the cost; this modified improvement is what is used to rank the variables, and what will be listed in the output. Values must be greater than 0, the default value is 1. Costs are not used in defining surrogate splits.
nRandomSplitVars
Number of variables to sample at each tree node as candidates for splitting. Default value of 0 means all variables are candidate split variables at a node. A value > 0 gives a randomized tree. Usually this argument is ignored. To fit a random forest, see the function forest().
...
arguments to arbor.control may also be specified in the call to arbor.

VALUE:

an object of class arbor, a superset of class tree.

REFERENCES:

Atkinson and Therneau (1997). An Introduction to Recursive Partitioning Using the RPART Routines. Technical Report.

Breiman, L. (2001). Statistical Modeling: The Two Cultures. Statistical Science, Vol. 16, No. 3, 199-231.

Breiman, L. (2001). Random Forests. University of California Statistics Dept. Tech. Report.

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. Monterey: Wadsworth and Brooks/Cole.

Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning New York: Springer.

SEE ALSO:

, , , , , ,

EXAMPLES:

fit <- arbor(Kyphosis ~ Age + Number + Start, data=kyphosis)
fit2 <- arbor(Kyphosis ~ Age + Number + Start, data=kyphosis,
    parms=list(prior=c(.65, .35), split='information'))
fit3 <- arbor(Kyphosis ~ Age + Number + Start, data=kyphosis,
    control=arbor.control(cp=.05))
par(mfrow=c(1,2))
plot(fit)
text(fit, use.n=T)
plot(fit2)
text(fit2, use.n=T)

#  return the model frame and use it in a new fit
fit4 <- arbor(cbind(time,status) ~ inst + age + sex +
    ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss,
    method="poisson", data=lung, model=T)
fit5 <- arbor(model=fit4$model, method=fit4$method, cp=.001, xval=0)