predab.resample
is a general-purpose
function that is used by functions for specific models.
It computes estimates of optimism of, and bias-corrected estimates of a vector
of indexes of predictive accuracy, for a model with a specified
design matrix, with or without fast backward step-down of predictors. If
bw=TRUE
, the design
matrix
x
must have been created by
ols
,
lrm
, or
cph
.
If
bw=TRUE
,
predab.resample
prints a matrix of asterisks showing which
factors were selected at each repetition, along with a frequency distribution
of the number of factors retained across re-samples.
predab.resample(fit.orig, fit, measure, method=c("boot","crossvalidation",".632","randomization"), bw=FALSE, B=50, pr=FALSE, rule="aic", type="residual", sls=.05, aics=0, strata=FALSE, tol=1e-12, non.slopes.in.x=TRUE, kint=1, cluster, subset, group=NULL, ...)
x=TRUE
and
y=TRUE
options specified to the model fitting function. This model
should be the FULL model including all candidate variables ever excluded
because of poor associations with the response.
x
,
y
,
iter
,
penalty
,
penalty.matrix
,
xcol
, and other arguments passed to
predab.resample
.
If you don't want
iter
as an argument inside the definition of
fit
, add ... to the end of its
argument list.
iter
is passed to
fit
to inform the function of the
sampling repetition number (0=original sample). If
bw=TRUE
,
fit
should
allow for the possibility of selecting no predictors, i.e., it should fit an
intercept-only model if the model has intercept(s).
fit
must return
objects
coef
and
fail
(
fail=TRUE
if
fit
failed due to singularity or
non-convergence - these cases are excluded from summary statistics).
fit
must add design attributes to the returned object if
bw=TRUE
.
The
penalty.matrix
parameter is not used if
penalty=0
. The
xcol
vector is a vector of columns of
X
to be used in the current model fit.
For
ols
and
psm
it includes a
1
for the intercept position.
xcol
is not defined if
iter=0
unless the initial fit had been from
a backward step-down.
xcol
is used to select the correct rows and columns
of
penalty.matrix
for the current variables selected, for example.
method=".632"
or
method="crossval"
, it will make the most sense for
measure to compute only indexes that are independent of sample size. The
measure function should take the following arguments or use ...:
xbeta
(X beta for
current fit),
y
,
evalfit
,
fit
,
iter
, and
fit.orig
.
iter
is as in
fit
.
evalfit
is set to
TRUE
by
predab.resample
if the fit is being evaluated on the sample used to make the
fit,
FALSE
otherwise;
fit.orig
is the fit object returned by the original fit on the whole
sample. Using
evalfit
will sometimes save computations. For example, in
bootstrapping the area under an ROC curve for a logistic regression model,
lrm
already computes the area if the fit is on the training sample.
fit.orig
is used to pass computed configuration parameters from the original fit such as
quantiles of predicted probabilities that are used as cut points in other samples.
The vector created by measure should have
names()
associated with it.
"boot"
for ordinary bootstrapping (Efron, 1983, Eq. 2.10).
Use
".632"
for Efron's
.632
method (Efron, 1983, Section 6 and Eq. 6.10),
"crossvalidation"
for grouped cross–validation,
"randomization"
for the randomization method. May
be abbreviated down to any level, e.g.
"b"
,
"."
,
"cross"
,
"rand"
.
TRUE
to do fast backward step-down for each training sample. Default is
FALSE
.
method="crossvalidation"
, this is also
the number of groups the original sample is split into.
TRUE
to print results for each sample. Default is
FALSE
.
"aic"
or
"p"
. Default is
"aic"
to use Akaike's
information criterion.
"residual"
(the default) or
"individual"
.
rule="p"
. Default is
.05
.
rule="aic"
. Stops deleting factors when
chi-square - 2 times d.f. falls below
aics
. Default is
0
.
TRUE
if
fit.orig
has an
x
element that contains a
"strata"
attribute which is a vector
that should be sampled the same way as the observations in
x
and
y
fit
and
fastbw
.
FALSE
if the design matrix
x
does not have columns for intercepts and these columns are needed
kint
. This affects the linear
predictor that is passed to
measure
.
method="boot"
. If it is present, the bootstrap is done using sampling
with replacement from the clusters rather than from the original records.
If this vector is not the same length as the number of rows in the data
matrix used in the fit, an attempt will be made to use
naresid
on
fit.orig
to conform
cluster
to the data.
See
bootcov
for more about this.
measure
function compute measures of accuracy on
a subset of the data. The whole dataset is still used for all model development.
For example, you may want to
validate
or
calibrate
a model by
assessing the predictions on females when the fit was based on males and
females. When you use
cr.setup
to build extra observations for fitting the
continuation ratio ordinal logistic model, you can use
subset
to specify
which
cohort
or observations to use for deriving indexes of predictive
accuracy. For example, specify
subset=cohort=="all"
to validate the
model for the first layer of the continuation ratio model (Prob(Y=0)).
fit
and
measure
.
For
method=".632"
, the program stops with an error if every observation
is not omitted at least once from a bootstrap sample. Efron's ".632" method
was developed for measures that are formulated in terms on per-observation
contributions. In general, error measures (e.g., ROC areas) cannot be
written in this way, so this function uses a heuristic extension to
Efron's formulation in which it is assumed that the average error measure
omitting the
i
th observation is the same as the average error measure
omitting any other observation. Then weights are derived
for each bootstrap repetition and weighted averages over the
B
repetitions
can easily be computed.
measure
, and the following columns:
training-test
except for
method=".632"
- is .632 times
(index.orig - test)
index.orig-optimism
Frank Harrell
Department of Biostatistics, Vanderbilt University
f.harrell@vanderbilt.edu
Efron B, Tibshirani R (1997). Improvements on cross-validation: The .632+ bootstrap method. JASA 92:548–560.
# See the code for validate.ols for an example of the use of # predab.resample