transcan
is a nonlinear additive transformation and imputation
function, and there are several functions for using and operating on
its results.
transcan
automatically transforms continuous and
categorical variables to have maximum correlation with the best linear
combination of the other variables. There is also an option to use a
substitute criterion - maximum correlation with the first principal
component of the other variables. Continuous variables are expanded
as restricted cubic splines and categorical variables are expanded as
contrasts (e.g., dummy variables). By default, the first canonical
variate is used to find optimum linear combinations of component
columns. This function is similar to
ace
except that
transformations for continuous variables are fitted using restricted
cubic splines, monotonicity restrictions are not allowed, and NAs are
allowed. When a variable has any NAs, transformed scores for that
variable are imputed using least squares multiple regression
incorporating optimum transformations, or NAs are optionally set to
constants. Shrinkage can be used to safeguard against overfitting
when imputing. Optionally, imputed values on the original scale are
also computed and returned. For this purpose, recursive partitioning
or multinomial logistic models can
optionally be used to impute categorical variables, using what is
predicted to be the most probable category.
By default,
transcan
imputes NAs with "best guess" expected values
of transformed variables, back transformed to the original scale.
Values thus imputed are most like conditional medians assuming the
transformations make variables' distributions symmetric (imputed
values are similar to conditionl modes for categorical variables). By
instead specifying
n.impute
,
transcan
does approximate multiple imputation
from the distribution of each variable conditional on all other
variables. This is done by sampling
n.impute
residuals from the
transformed variable, with replacement (a la bootstrapping), or by
default, using Rubin's approximate Bayesian bootstrap, where a sample
of size n with replacement is selected from the residuals on n
non-missing values of the target variable, and then a sample of size m
with replacement is chosen from this sample, where m is the number of
missing values needing imputation for the current multiple imputation
repetition. Neither of these bootstrap procedures
assume normality or even symmetry of residuals.
For sometimes-missing categorical variables, optimal scores are
computed by adding the "best guess" predicted mean score to random
residuals off this score. Then categories having scores closest to
these predicted scores are taken as the random multiple imputations
(
impcat="tree"
or
"rpart"
are not currently allowed with
n.impute
). The literature recommends using
n.impute=5
or greater.
transcan
provides only an approximation to multiple imputation,
especially since it "freezes" the imputation model before drawing the
multiple imputations rather than using different estimates of
regression coefficients for each imputation. For multiple imputation,
the
aregImpute
function provides a much better approximation to the
full Bayesian approach while still not requiring linearity assumptions.
When you specify
n.impute
to
transcan
you can use
fit.mult.impute
to re-fit any model
n.impute
times based on
n.impute
completed datasets (if there are any sometimes missing
variables not specified to
transcan
, some observations will still be
dropped from these fits). After fitting
n.impute
models,
fit.mult.impute
will return the fit object from the last imputation,
with
coefficients
replaced by the average of the
n.impute
coefficient vectors and with a component
var
equal to the
imputation-corrected variance-covariance matrix.
fit.mult.impute
can also use the object created by the
mice
function in the MICE
library to draw the multiple imputations, as well as objects created
by
aregImpute
.
The
summary
method for
transcan
prints the function call,
R-squares achieved in transforming each variable, and for each variable
the coefficients of all other transformed variables that are used to
estimate the transformation of the initial variable. If
imputed=TRUE
was used in the call to transcan, also uses the
describe
function to print a summary of imputed values. If
long=TRUE
, also prints all imputed values with observation
identifiers. There is also a simple function
print.transcan
which merely prints the transformation matrix and the function call. It
has an optional argument
long
, which if set to
TRUE
causes
detailed parameters to be printed. Instead of plotting while
transcan()
is running, you can plot the final transformations
after the fact using
plot.transcan
, if the option
trantab=TRUE
was specified to
transcan
. If in addition
the option
imputed=TRUE
was specified to
transcan
,
plot.transcan
will show the location of imputed values (including
multiples) along the axes.
impute
does imputations for a selected original data variable, on
the original scale (if
imputed=TRUE
was given to
transcan
). If you do not specify a variable to
impute
, it
will do imputations for all variables given to
transcan
which had
at least one missing value. This assumes that the original variables
are accessible (i.e., they have been
attach
ed) and that you want
the imputed variables to have the same names are the original variables.
If
n.impute
was specified to
transcan
you must tell
impute
which
imputation
to use.
predict
computes predicted variables and imputed values from a
matrix of new data. This matrix should have the same column variables
as the original matrix used with
transcan
, and in the same order
(unless a formula was used with
transcan
).
Function
is a generic function generator.
Function.transcan
creates S functions to transform variables using
transformations created by
transcan
. These functions are useful
for getting predicted values with predictors set to values on the original
scale.
Varcov
methods are defined here so that imputation-corrected
variance-covariance matrices are readily extracted from
fit.mult.impute
objects, and so that
fit.mult.impute
can easily
compute traditional covariance matrices for individual completed
datasets. Specific
Varcov
methods are defined for
lm
,
glm
, and
multinom
fits.
The subscript function preserves attributes.
The
invertTabulated
function does either inverse linear
interpolation or uses sampling to sample qualifying x-values having
y-values near the desired values. The latter is used to get inverse
values having a reasonable distribution (e.g., no floor or ceiling
effects) when the transformation has a flat or nearly flat segment,
resulting in a many-to-one transformation in that region. Sampling
weights are a combination of the frequency of occurrence of x-values
that are within
tolInverse
times the range of
y
and the squared
distance between the associated y-values and the target y-value (
aty
).
transcan(x, method=c("canonical","pc"), categorical=NULL, asis=NULL, nk, imputed=FALSE, n.impute, boot.method=c('approximate bayesian', 'simple'), trantab=FALSE, transformed=FALSE, impcat=c("score", "multinom", "rpart", "tree"), mincut=40, inverse=c('linearInterp','sample'), tolInverse=.05, pr=TRUE, pl=TRUE, allpl=FALSE, show.na=TRUE, imputed.actual=c('none','datadensity','hist','qq','ecdf'), iter.max=50, eps=.1, curtail=TRUE, imp.con=FALSE, shrink=FALSE, init.cat="mode", nres=if(boot.method=='simple')200 else 400, data, subset, na.action, treeinfo=FALSE, rhsImp=c('mean','random'), details.impcat='', ...) ## S3 method for class 'transcan': summary(object, long=FALSE, ...) ## S3 method for class 'transcan': print(x, long=FALSE, ...) ## S3 method for class 'transcan': plot(x, ...) ## S3 method for class 'transcan': impute(x, var, imputation, name, where.in, data, where.out=1, frame.out, list.out=FALSE, pr=TRUE, check=TRUE, ...) fit.mult.impute(formula, fitter, xtrans, data, n.impute, fit.reps=FALSE, derived, pr=TRUE, subset, ...) ## S3 method for class 'transcan': predict(object, newdata, iter.max=50, eps=0.01, curtail=TRUE, type=c("transformed","original"), inverse, tolInverse, ...) Function(object, ...) ## S3 method for class 'transcan': Function(object, prefix=".", suffix="", where=1, ...) invertTabulated(x, y, freq=rep(1,length(x)), aty, name='value', inverse=c('linearInterp','sample'), tolInverse=0.05, rule=2) Varcov(object, ...) ## Default S3 method: Varcov(object, regcoef.only=FALSE, ...) ## S3 method for class 'lm': Varcov(object, ...) ## S3 method for class 'glm': Varcov(object, ...) ## S3 method for class 'multinom': Varcov(object, ...) ## S3 method for class 'fit.mult.impute': Varcov(object, ...)
dimnames
). If row
names are present, they are used in forming the
names
attribute
of imputed values if
imputed=TRUE
.
x
may also be a formula, in which
case the model matrix is created automatically, using data in the calling
frame. Advantages of using a formula are that
categorical
variables
can be determined automatically by a variable being a
factor
variable, and variables with two unique levels are modeled
asis
.
Variables with 3 unique values are considered to be
categorical
if
a formula is specified. For a formula you may also specify that a
variable is to remain untransformed by enclosing its name with the
identify function, e.g.
I(x3)
. The user may add other variable names to the
asis
and
categorical
vectors. For
invertTabulated
,
x
is a
vector or a list with three components: the x vector, the
corresponding vector of transformed values, and the corresponding
vector of frequencies of the pair of original and transformed variables.
For
print
,
plot
,
impute
, and
predict
,
x
is an object created by
transcan
.
coefficients
and for which
Varcov
will return a
variance-covariance matrix. E.g.,
fitter=lm, glm, ols
. At present models
involving non-regression parameters (e.g., scale parameters in
parametric survival models) are not handled fully.
transcan
,
aregImpute
, or
Mice
method="canonical"
or any abbreviation thereof, to use canonical
variates (the default).
method="pc"
transforms a variable instead so as to maximize
the correlation with the first principal component of the other
variables.
x
which are categorical,
for which the ordering of re-scored values is not necessarily preserved.
If
categorical
is omitted, it is assumed that all variables are
continuous (or binary). Set
categorical="*"
to treat all variables
as categorical.
lm.fit.qr
is used to impute missing values.
You may want to treat binary variables
asis
(this is automatic if
using a formula). If imputed=TRUE, you
may want to use
"categorical"
for binary variables if you want
to force imputed values to be one of the original data values.
Set
asis="*"
to treat all variables
asis
.
asis
) in a restricted cubic spline function. Default is 3 (yielding
2 parameters for a variable) if
n < 30
, 4 if
30 <= n < 100
, and 5 if
n >= 100
(4 parameters).
TRUE
to return a list containing imputed values on the original
scale.
If the transformation for a variable is non-monotonic, imputed
values are not unique.
transcan
uses the
approx
function,
which returns the highest value of the variable with the transformed
score equalling the imputed score.
imputed=TRUE
also causes original-scale imputed values to be shown as tick
marks on the top margin of each graph
when
show.na=TRUE
(for the final iteration only).
For categorical predictors, these imputed values are
jitter
ed so
that their frequencies can be visualized. When
n.impute
is used,
each NA will have
n.impute
tick marks.
n.impute=5
is frequently recommended.
boot.method="simple"
to use the usual
bootstrap one-stage sampling with replacement.
TRUE
to add an attribute
trantab
to the returned matrix. This
contains a vector of lists each with components
x
and
y
containing
the unique values and corresponding transformed values for the
columns of
x
. This is set up to be used easily with the
approx
function. You must specify
trantab=TRUE
if you want to later use the
predict.transcan
function with
type="original"
.
TRUE
to cause
transcan
to return an object
transformed
containing the matrix of transformed variables
impcat="score"
to impute the category
whose canonical variate score is closest to the predicted score.
Use
impcat="tree"
to impute categorical variables using the
tree()
function, using the values of all other transformed
predictors.
impcat="rpart"
will use
rpart
. A better but somewhat
slower approach is to use
impcat="multinom"
to fit a multinomial
logistic model to the categorical variable, at the last iteraction of
the
transcan
algorithm. This uses the
multinom
function in the
nnet
library of the
MASS
package (which is assumed to have been
installed by the user) to fit a polytomous logistic model to the
current working transformations of all the other variables (using
conditional mean imputation for missing predictors). Multiple
imputations are made by drawing multinomial values from the vector of
predicted probabilities of category membership for the missing
categorical values.
imputed=TRUE
, there are categorical variables, and
impcat="tree"
,
mincut
specifies the lowest node size that will be allowed to be
split by
tree
. The default is 40.
invertTabulated
function
(see above) with the
"sample"
option, specify
inverse="sample"
.
freq
and by the distance measure, for determining the set of x
values having y values within a tolerance of the value of
aty
in
invertTabulated
. For
predict.transcan
,
inverse
and
tolInverse
are obtained from options that were specified to
transcan
by default. Otherwise, if not specified by the user, these
default to the defaults used to
invertTabulated
.
transcan
, set to
FALSE
to suppress printing r-squares
and shrinkage factors. For
impute.transcan
set to
FALSE
to suppress messages concerning the number of NAs imputed, or for
fit.mult.impute
set to
FALSE
to suppress printing variance
inflation factors accounting for imputation, rate of missing
information, and degrees of freedom.
FALSE
to suppress plotting the final transformations with
distribution of scores for imputed values (if
show.na=TRUE
).
TRUE
to plot transformations for intermediate iterations.
FALSE
to suppress the distribution of scores assigned to
missing values (as tick marks on the right margin of each graph).
See also
imputed
.
"none"
to suppress plotting of actual vs. imputed
values for all variables having any NAs. Other choices are
"datadensity"
to use
datadensity
to make a single plot,
"hist"
to make a series of back-to-back histograms,
"qq"
to make a series
of q-q plots, or
"ecdf"
to make a series of empirical cdfs. For
imputed.actual="datadensity"
for example you get
a rug plot of the non-missing values for the variable with beneath it
a rug plot of the imputed values.
When
imputed.actual
is not
"none"
,
imputed
is automatically set
to
TRUE
.
transcan
or
predict
.
For
predict
, only one iteration is used if there
are no NAs in the data or if
imp.con
was used.
transcan
and
predict
.
eps
is the
maximum change in transformed values from one iteration to the next.
If for a given iteration all new transformations
of variables differ by less than
eps
(with or without negating the
transformation to allow for "flipping") from the transformations in
the previous iteration, one more iteration is done for
transcan
.
During this
last iteration, individual transformations are not updated but
coefficients of transformations are. This improves stability of
coefficients of canonical variates on the right-hand-side.
eps
is ignored when
rhsImp="random"
.
transcan
, causes imputed values on the transformed scale to
be truncated so that their ranges are within the ranges of
non-imputed transformed values.
For
predict
,
curtail
defaults to
TRUE
to truncate predicted transformed
values to their ranges in the original fit (
xt
).
transcan
, set to
TRUE
to impute NAs on the original scales with
constants (medians or most frequent category codes). Set to a vector
of constants to instead always use these constants for imputation.
These imputed values are ignored when fitting the current working
transformation for a single variable.
FALSE
to use ordinary least squares or canonical variate estimates.
For the purposes of imputing NAs, you may want to set
shrink=TRUE
to avoid
overfitting when developing a prediction equation to predict each variables
from all the others (see details below).
"mode"
to use a dummy variable set to 1 if the value is the most
frequent value (this is the default).
Use
"random"
to use a random 0-1 variable. Set
to
"asis"
to use the original integer codes as starting scores.
n.impute
is specified. If the
dataset has fewer than
nres
observations, all residuals are saved.
Otherwise a random sample of the residuals of length
nres
without
replacement is saved. The default for
nres
is higher if
boot.method="approximate bayesian"
.
x
is a formula. The default
na.action
is
na.retain
(defined by
transcan
) which keeps all observations with
any
NA
s.
For
impute.transcan
,
data
is a data frame to use as the source of
variables to be imputed, rather than using
where.in
. For
fit.mult.impute
,
data
is mandatory and is a data frame containing
the data to be used in fitting the model but before imputations
are applied. Variables omitted from
data
are assumed to be
available from frame 1 and do not need to be imputed.
TRUE
to get additional information printed when
impcat="tree"
,
such as the predicted probabilities of category membership.
"random"
to use random draw imputation when a sometimes
missing variable is moved to be a predictor of other sometimes missing
variables. Default is
rhsImp="mean"
, which uses conditional mean
imputation on the transformed scale. Residuals used are residuals
from the transformed scale. When
"random"
is used,
transcan
runs
5 iterations and ignores
eps
.
transcan
object
an element
details.impcat
containing details of how the
categorical variable was multiply imputed.
scat1d
or to the
fitter
function (for
fit.mult.impute
)
summary
, set to
TRUE
to print all imputed values.
For
print
, set to
TRUE
to print details of transformations/imputations.
impute
, is a variable that was originally a column in
x
, for
which imputated values are to be filled in.
imputed=TRUE
must have been
used in
transcan
. Omit
var
to impute all variables, creating new
variables in
search
position
where
.
impute()
. Default is character
string version of the second argument (
var
) in the call to
impute
. For
invertTabulated
, is the name of variable being
transformed (used only for warning messages).
search
list to find variables that need to be imputed, when
all variables are to be imputed automatically by
impute.transcan
(i.e., when no input variable name is specified).
Default is first
search
position that contains the first variable to
be imputed.
search
list for storing variables with missing values
set to imputed values, for
impute.transcan
when all variables with
missing values are being imputed automatically.
where.out
you can specify an S frame
number into which individual new imputed variables will be written.
For example,
frame.out=1
is useful for putting new variables into a
temporary local frame when
impute
is called within another function
(see
fit.mult.impute
). See
assign
for details about frames.
var
is not specified, you can set
list.out=TRUE
to have
impute.transcan
return a list containing variables with needed
values imputed. This list will contain a single imputation.
FALSE
to suppress certain warning messages
transcan
. If a formula was originally specified to
transcan
(instead of a data matrix),
newdata
is optional and if
given must be a data frame; a model
frame is generated automatically from the previous formula. The
na.action
is handled automatically, and the levels for factor variables
must be the same and in the same order as were used in the original
variables specified in the formula given to
transcan
.
TRUE
to save all fit objects from the fit for each imputation in
fit.mult.impute
. Then the object returned will have a component
fits
which is a list whose
i
th element is the
i
th fit object.
derived=expression(ratio <- weight/height)
. For multiple derived
variables use the form
derived=expression({ratio <- weight/height;
product <- weight*height})
or put the expression on separate input
lines. To monitor the multiply-imputed derived
variables you can add to the
expression
a command such as
print(describe(ratio))
. See the example below.
trantab=TRUE
to
transcan
, specifying
type="original"
does the table look-ups with
linear interpolation to return the input matrix
x
but with imputed
values on the original scale inserted for NAs. For categorical variables,
the method used here is to select
the category code having a corresponding scaled value closest to the
predicted transformed value. This corresponds to the default
impcat
;
a problem in getting predicted
values for
tree
objects prevented using
tree
for this. Note:
imputed values thus returned when
type="original"
are single
expected value imputations even in
n.impute
is given.
transcan
, or an object to be
converted to S function code, typically a model fit object of some sort
x
, the name
of the new function will be
prefix
placed in front of the variable name,
and
suffix
placed in back of the name. The default is to use names
of the form
.varname
, where
varname
is the variable name.
search
list at which to store new functions (for
Function
).
Default is position 1 in the search list. See the
assign
function for more
documention on the
where
argument.
x
for
invertTabulated
, if its first
argument
x
is not a list
x
and
y
if
x
is not a list. Default is a vector of ones.
approx
.
transcan
assumes
rule
is always
2
TRUE
to make
Varcov.default
delete positions in the covariance matrix for any non-regression
coefficients (e.g., log scale parameter from
psm
or
survreg
)
The starting approximation to the transformation for each variable
is taken to be the original coding of the variable. The initial
approximation for each missing value is taken to be the median of
the non-missing values for the variable (for continuous ones) or
the most frequent category (for categorical ones). Instead, if
imp.con
is
a vector, its values are used for imputing NAs. When using each
variable as a dependent variable, NAs on that variable cause all
observations to be temporarily deleted. Once a new working transformation
is found for the variable, along with a model to predict that transformation
from all the other variables, that latter model is used to impute
NAs in the selected dependent variable if
imp.con
is not specified.
When that variable is used
to predict a new dependent variable, the current working imputed values
are inserted. Transformations are updated after each variable becomes
a dependent variable, so the order of variables on
x
could conceivably
make a difference in the final estimates. For obtaining out-of-sample
predictions/transformations,
predict
uses the same iterative
procedure as
transcan
for imputation, with the same starting
values for fill-ins as were used by
transcan
. It also (by default)
uses a conservative approach of curtailing transformed variables to
be within the range of the original ones.
Even when
method="pc"
is specified, canonical variables are used
for imputing missing values.
Note that fitted transformations, when evaluated at imputed variable
values (on the original scale), will not precisely match the transformed
imputed values returned in
xt
. This is because
transcan
uses an
approximate method based on linear interpolation to back-solve for
imputed values on the original scale.
Shrinkage uses the method of Van Houwelingen and Le Cessie (1990) (similar to
Copas, 1983). The shrinkage factor is
[1-(1-R2)(n-1)/(n-k-1)]/R2
, where
R2
is the apparent R-squared for predicting the variable,
n
is the number
of non-missing values, and
k
is the effective number of degrees of freedom
(aside from intercepts). A heuristic estimate is used for
k
:
A - 1 + sum(max(0,Bi-1))/m + m
, where
A
is the number of d.f. required
to represent the variable being predicted, the
Bi
are the number of
columns required to represent all the other variables, and
m
is the
number of all other variables. Division by
m
is done because the
transformations for the other variables are fixed at their current
transformations the last time they were being predicted. The
+ m
term
comes from the number of coefficients estimated on the right hand side,
whether by least squares or canonical variates. If a shrinkage factor
is negative, it is set to 0. The shrinkage factor is the ratio of
the adjusted R-squared to the ordinary R-squared.
The adjusted R-squared is
1 - (1 - R2)(n-1)/(n-k-1)
, which is also set to
zero if it is negative. If
shrink=FALSE
and the adjusted R-squares are much
smaller than
the ordinary R-squares, you may want to run
transcan
with
shrink=TRUE
.
Canonical variates are scaled to have variance of 1.0, by multiplying canonical
coefficients from
cancor
by
sqrt(n-1)
.
When specifying a non-Design library fitting function to
fit.mult.impute
(e.g.,
lm
,
glm
), running the result of
fit.mult.impute
through that fit's
summary
method will not use the
imputation-adjusted variances. You may obtain the new variances using
fit$var
or
Varcov(fit)
.
When you specify a Design function to
fit.mult.impute
(e.g.,
lrm, ols, cph, psm, bj
), automatically computed transformation
parameters (e.g., knot locations for
rcs
) that are estimated for the
first imputation are used for all other imputations. This ensures
that knot locations will not vary, which would change the meaning of
the regression coefficients.
Warning: even though
fit.mult.impute
takes imputation into account
when estimating variances of regression coefficient, it does not take
into account the variation that results from estimation of the shapes
and regression coefficients of the customized imputation equations.
Specifying
shrink=TRUE
solves a small part of this problem. To fully
account for all sources of variation you should consider putting the
transcan
invocation inside a bootstrap or loop, if execution time
allows. Better still, use
aregImpute
or one of the libraries such
as MICE that uses real Bayesian posterior realizations to multiply
impute missing values correctly.
It is strongly recommended that you use the Hmisc
naclus
function to
determine is there is a good basis for imputation.
naclus
will tell
you, for example, if systolic blood pressure is missing whenever
diastolic blood pressure is missing. If the only variable that is
well correlated with diastolic bp is systolic bp, there is no basis
for imputing diastolic bp in this case.
At present,
predict
does not work with multiple imputation.
When calling
fit.mult.impute
with
glm
as the
fitter
argument, if
you need to pass a
family
argument to
glm
do it by quoting the
family, e.g.,
family="binomial"
.
You should be able to use a variable in the formula given to
fit.mult.impute
as a numeric variable in the regression model even
though it was a factor variable in the invocation of
transcan
. Use
for example
fit.mult.impute(y ~ codes(x), lrm, trans)
(thanks to
Trevor Thompson
mailto:trevor@hp5.eushc.org).
transcan
, a list of class
transcan
with elements
call
(with the function call),
iter
(number of
iterations done) and
rsq
and
rsq.adj
containing the R-squares and
adjusted R-squares achieved in predicting each variable from all the
others. It also has elements
categorical
,
asis
,
coef
,
xcoef
,
parms
,
fillin
,
ranges
,
scale
, and
formula
containing respectively the values supplied for
categorical
and
asis
, the within-variable coefficients used to compute the first
canonical variate, the (possibly shrunk) across-variables coefficients
of the first canonical variate that predicts each variable in turn,
the parameters of the transformation (knots for splines, contrast
matrix for categorical variables), the initial estimates for missing
values (NA if variable never missing), the matrix of ranges of the
transformed variables (min and max in first and second row), a vector
of scales used to determine convergence for a transformation, the
formula (if
x
was a formula), and optionally a vector of shrinkage
factors used for predicting each variable from the others. For
"asis"
variables, the scale is the average absolute difference about
the median. For other variables it is unity, since canonical
variables are standardized. For
xcoef
, row
i
has the coefficients
to predict transformed variable
i
, with the column for the
coefficient of variable
i
set to NA. If
imputed=TRUE
was given, an
optional element
imputed
also appears. This is a list with the
vector of imputed values (on the original scale) for each variable
containing NAs. Matrices rather than vectors are returned if
n.impute
is given. If
trantab=TRUE, the `trantab
element also
appears, as described above. If
n.impute > 0
,
transcan
also returns
a list
residuals
that can be used for future multiple imputation.
impute
returns a vector (the same
length as
var
) of class
"impute"
with NAs imputed.
predict
returns a matrix with the same number of columns or variables as were
in
x
.
fit.mult.impute
returns a fit object that is a modification of the
fit object created by fitting the completed dataset for the final
imputation. The
var
matrix in the fit object has the
imputation-corrected variance-covariance matrix.
coefficients
is
the average (over imputations) of the coefficient vectors,
variance.inflation.impute
is a vector containing the ratios of
the diagonals of the between-imputation variance matrix to the diagonals
of the average apparent (within-imputation) variance matrix.
missingInfo
is Rubin's "rate of missing information" and
dfmi
is Rubin's degrees of freedom for a t-statistic for testing
a single parameter. The last two objects are vectors corresponding to
the diagonal of the variance matrix.
Frank Harrell
Department of Biostatistics
Vanderbilt University
mailto:f.harrell@vanderbilt.edu
Kuhfeld, Warren F: The PRINQUAL Procedure. SAS/STAT User's Guide, Fourth Edition, Volume 2, pp. 1265–1323, 1990.
Van Houwelingen JC, Le Cessie S: Predictive value of statistical models. Statistics in Medicine 8:1303–1325, 1990.
Copas JB: Regression, prediction and shrinkage. JRSS B 45:311–354, 1983.
He X, Shen L: Linear regression after spline transformation. Biometrika 84:474–481, 1997.
Little RJA, Rubin DB: Statistical Analysis with Missing Data. New York: Wiley, 1987.
Rubin DJ, Schenker N: Multiple imputation in health-care databases: An overview and some applications. Stat in Med 10:585–598, 1991.
Faris PD, Ghali WA, et al:Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. J Clin Epidem 55:184–191, 2002.
## Not run: x <- cbind(age, disease, blood.pressure, pH) #cbind will convert factor object `disease' to integer par(mfrow=c(2,2)) x.trans <- transcan(x, categorical="disease", asis="pH", transformed=TRUE, imputed=TRUE) summary(x.trans) #Summary distribution of imputed values, and R-squares f <- lm(y ~ x.trans$transformed) #use transformed values in a regression #Now replace NAs in original variables with imputed values, if not #using transformations age <- impute(x.trans, age) disease <- impute(x.trans, disease) blood.pressure <- impute(x.trans, blood.pressure) pH <- impute(x.trans, pH) #Do impute(x.trans) to impute all variables, storing new variables under #the old names summary(pH) #uses summary.impute to tell about imputations #and summary.default to tell about pH overall # Get transformed and imputed values on some new data frame xnew newx.trans <- predict(x.trans, xnew) w <- predict(x.trans, xnew, type="original") age <- w[,"age"] #inserts imputed values blood.pressure <- w[,"blood.pressure"] Function(x.trans) #creates .age, .disease, .blood.pressure, .pH() #Repeat first fit using a formula x.trans <- transcan(~ age + disease + blood.pressure + I(pH), imputed=TRUE) age <- impute(x.trans, age) predict(x.trans, expand.grid(age=50, disease="pneumonia", blood.pressure=60:260, pH=7.4)) z <- transcan(~ age + factor(disease.code), # disease.code categorical transformed=TRUE, trantab=TRUE, imputed=TRUE, pl=FALSE) plot(z$transformed) ## End(Not run) # Multiple imputation and estimation of variances and covariances of # regression coefficient estimates accounting for imputation set.seed(1) x1 <- factor(sample(c('a','b','c'),100,TRUE)) x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100) y <- x2 + 1*(x1=='c') + rnorm(100) x1[1:20] <- NA x2[18:23] <- NA d <- data.frame(x1,x2,y) n <- naclus(d) plot(n); naplot(n) # Show patterns of NAs f <- transcan(~y + x1 + x2, n.impute=10, shrink=FALSE, data=d) options(digits=3) summary(f) f <- transcan(~y + x1 + x2, n.impute=10, shrink=TRUE, data=d) summary(f) h <- fit.mult.impute(y ~ x1 + x2, lm, f, data=d) # Add ,fit.reps=TRUE to save all fit objects in h, then do something like: # for(i in 1:length(h$fits)) print(summary(h$fits[[i]])) diag(Varcov(h)) h.complete <- lm(y ~ x1 + x2, na.action=na.omit) h.complete diag(Varcov(h.complete)) # Note: had Design's ols function been used in place of lm, any # function run on h (anova, summary, etc.) would have automatically # used imputation-corrected variances and covariances # Example demonstrating how using the multinomial logistic model # to impute a categorical variable results in a frequency # distribution of imputed values that matches the distribution # of non-missing values of the categorical variable ## Not run: set.seed(11) x1 <- factor(sample(letters[1:4], 1000,TRUE)) x1[1:200] <- NA table(x1)/sum(table(x1)) x2 <- runif(1000) z <- transcan(~ x1 + I(x2), n.impute=20, impcat='multinom') table(z$imputed$x1)/sum(table(z$imputed$x1)) ## End(Not run) # Example where multiple imputations are for basic variables and # modeling is done on variables derived from these set.seed(137) n <- 400 x1 <- runif(n) x2 <- runif(n) y <- x1*x2 + x1/(1+x2) + rnorm(n)/3 x1[1:5] <- NA d <- data.frame(x1,x2,y) w <- transcan(~ x1 + x2 + y, n.impute=5, data=d) # Add ,show.imputed.actual for graphical diagnostics ## Not run: g <- fit.mult.impute(y ~ product + ratio, ols, w, data=data.frame(x1,x2,y), derived=expression({ product <- x1*x2 ratio <- x1/(1+x2) print(cbind(x1,x2,x1*x2,product)[1:6,])})) ## End(Not run) # Here's a method for creating a permanent data frame containing # one set of imputed values for each variable specified to transcan # that had at least one NA, and also containing all the variables # in an original data frame. The following is based on the fact # that the default output location for impute.transcan is # given by where.out=1 (search position 1) ## Not run: xt <- transcan(~. , data=mine, imputed=TRUE, shrink=TRUE, n.impute=10, trantab=TRUE) attach(mine, pos=1, use.names=FALSE) impute(xt, imputation=1) # use first imputation # omit imputation= if using single imputation detach(1, 'mine2') ## End(Not run) # Example of using invertTabulated outside transcan x <- c(1,2,3,4,5,6,7,8,9,10) y <- c(1,2,3,4,5,5,5,5,9,10) freq <- c(1,1,1,1,1,2,3,4,1,1) # x=5,6,7,8 with prob. .1 .2 .3 .4 when y=5 # Within a tolerance of .05*(10-1) all y's match exactly # so the distance measure does not play a role set.seed(1) # so can reproduce for(inverse in c('linearInterp','sample')) print(table(invertTabulated(x, y, freq, rep(5,1000), inverse=inverse))) # Test inverse='sample' when the estimated transformation is # flat on the right. First show default imputations set.seed(3) x <- rnorm(1000) y <- pmin(x, 0) x[1:500] <- NA for(inverse in c('linearInterp','sample')) { par(mfrow=c(2,2)) w <- transcan(~ x + y, imputed.actual='hist', inverse=inverse, curtail=FALSE, data=data.frame(x,y)) if(inverse=='sample') next # cat('Click mouse on graph to proceed\n') # locator(1) }