Tree Models

Tree-based models provide an alternative to linear and additive models for regression problems and to linear and additive logistic models for classification problems. Tree models are fit by successively splitting the data to form homogeneous subsets. The result is a hierarchical tree of decision rules useful for prediction or classification. The Tree Models dialog provides tools for fitting and examining tree models.

To perform tree regression

Choose Statistics __image\arrow5.gif Tree __image\arrow5.gif Tree Models. The dialog shown below appears.

Model page

__image\tree1.gif

In the Tree Models dialog, the Model page has the following options:

Data

Data Set

Select a data set from the dropdown list or type the name of a data set. You can also type into the Data Set edit field any expression that evaluates to a data set.

Subset Rows

Enter an S-PLUS expression that identifies the rows to use in the analysis. To use all the rows in the data set, leave this field blank.

Omit Rows with Missing Values

Select this box to omit from the analysis any rows in the data set that contain missing values for any of the variables in the model.

Variables

Dependent Variables

Select a variable as the dependent variable in the formula. The variable name will appear in the formula field below, followed by a '~'.

Independent Variables

Select one or more variables as the independent variables, or predictor, in the formula. To select more than one variable, Ctrl-click the variables.

Formula

In the Formula field, enter a formula specifying the desired model. In its simplest form a formula consists of the response variable, a tilde (~), and a list of predictor variables separated by "+"s. An intercept is automatically included by default.

Create Formula

Click the Create Formula button to open a formula builder dialog used to construct a formula specifying the desired model. See the online Help section Building Formulas for more information.

Fitting Options

Min. No. of Obs. Before Split

Enter the minimum number of observations to include before the first cut on a variable. The default is 5.

Min. Node Size

Enter the minimum node size at which the last split is performed. The default of 10 means that growing continues if there are at least 10 observations in a node.

Min. Node Deviance

Enter the minimum node deviance before growing stops. The default value is 0.010.

Save Model Object

In the Save As field, enter the name for the object in which to save the results of the analysis. If an object with this name already exists, its contents are overwritten. The model object can be used in later functions such as plotting.

Results page

__image\tree2.gif

In the Tree Models dialog, the Results page has the following options:

Printed Results

Summary Description

Select this for a short description of the fitted model.

Full Tree

Select this to print the fitted tree with all its branches and leaves. This can lead to a large amount of output.

Saved Results

Save In

Enter the name of a data set in which a part of the analysis, such as fitted values and residuals, predictions, confidence intervals, or standard errors, is saved.

Misclassification Errors

Select this to save misclassification errors. See the online Help for residuals.tree for details.

Pearson Residuals

Select to save the Pearson residuals. They are a rescaled version of the working residuals. Their sums-of-squares is the chi-squared statistic.

Deviance Residuals

Select to save the deviance residuals. These residuals are reasonable for use in detecting observations with unduly large influence in the fitting process, since they reflect the same criterion as used in the fitting.

Plot page

__image\tree3.gif

In the Tree Regression dialog, the Plot page has the following options:

Branch Size

Proportional to Node Deviance Select this to size branches in the plotted tree roughly proportionally to the deviance of the node.

Uniformly Sized Select this to plot all branches with uniform size.

Branch Text

Add Text Labels Select this to create and add text labels to the tree plot.

Labels Select the type of label to be used. Choose from Response-Value, Node-Size, and Node-Deviance. See the online Help for text.tree for more details.

Prune/Shrink page

A standard approach to fitting tree models is to fit an overly large tree, and then use pruning or shrinking to simplify the tree. By default, the Prune/Shrink page allows you to see a plot of deviance versus complexity for a sequence of pruned or shrunk trees. Use this plot to determine an appropriate complexity for the tree, and then fit a tree model of a particular complexity. When a particular complexity is specified, the pruned or shrunk tree is used as the basis for summaries, plots, and predictions specified on other dialog pages.

Cost complexity pruning determines a nested sequence of subtrees by recursively "snipping" off the least important splits, based upon the cost-complexity measure. To create a tree of a specific complexity, specify the cost complexity parameter or the size of the returned tree. For classification trees, deviance or misclassification rate may be minimized.

Optimal recursive shrinking shrinks lower nodes to their parent nodes based upon the magnitude of the difference between the fitted values of the lower nodes and the fitted values of their parent nodes. To create a tree of a specified complexity, specify the shrinkage parameter.

__image\tree4.gif

In the Tree Models dialog, the Prune/Shrink page has the following options:

Prune

Cost Complexity Pruning

Select this to specify pruning of the fitted tree.

Cost Complexity Parameter

Specify the cost complexity parameter for the pruned tree. This is the penalty parameter in a penalized log-likelihood. The default of NULL indicates that a sequence of trees of all possible sizes is desired.

Size of Returned Tree

This field is optional. Enter an integer specifying the desired size of the returned tree; that is, the desired number of terminal nodes. The best tree of that size in the cost complexity sequence is returned. This is a more intuitive complexity measure than the Cost Complexity Parameter.

Pruning Method

Select either deviance or misclass to determine the measure of node heterogeneity used to guide the pruning. For regression trees deviance is minimized. For classification trees deviance or misclassification rate may be minimized.

Shrink

Optimal Recursive Shrinking

Select this to shrink the fitted tree.

Shrinkage Parameter

Enter a vector of numbers between 0 and 1. A sequence of shrunken trees is determined by optimal shrinking for each value in the vector. By default the vector (1/20, 2/19, 3/18, ..., 10/11) is used. This vector is expressed in the S-PLUS command syntax as (1:10)/(20:11).

Plot Result

Select this to see a plot of deviance versus complexity for the sequence of trees. If a single tree is specified, this box is ignored.

New Data (Optional)

This field is optional. Enter a data set to use in evaluating the goodness of fit of the sequence of trees. Using a different set of data for comparing the sequence of trees than was used to fit the model provides a better estimate of the goodness of fit on new data.

Save As

Enter the name for the object in which to save the results of the analysis.

Predict page

__image\tree5.gif

In the Tree Models dialog, the Predict page has the following options:

New Data

Enter the name of a data set containing data at which predictions will be computed. Column names must be those that appear in the formula of the tree model fitted on the Model page. By default, predictions are computed at the data used to fit the original tree.

Prediction Type

Select one of vector, tree, or class. See the Help file for the S-PLUS function predict.tree for a detailed description of these options.

Save As

Enter the name for the object in which to save the results of the analysis.

Related S-Plus language functions

tree, tree.control, plot.tree, text.tree, summary.tree, print.tree, prune.tree, shrink.tree, plot.tree.sequence, predict.tree, menuTree