lrm
function) to predict the probability that each subject remains
in the study at that time period. Predictors for the first time
period are those listed in the
pformula formula. These are assumed
to be baseline variables that are never missing. For later time
periods, predictors include the baseline predictors plus the matrix of
response (
y) values for all earlier periods. These "previous
responses" will have missing values imputed from the earlier steps.
Missing responses for time period i are imputed, for one of the
n.impute
multiple imputations, as follows. The period i fitted
propensity model described above is evaluated to obtain the predicted
probability that each subject remained in the study until at least
period i. The estimated propensity is divided into
g quantile
groups. If for period i within a propensity quantile group there are
a
subjects still in the study and
b subjects who have dropped out,
Rubin's approximate Bayesian bootstrap is used to estimate the
predictive distribution of the response values for the
b dropouts,
given that the propensity for remaining in the study is approximately
constant for all subjects (dropouts and non-dropouts) in the group. A
sample of size
a is selected with replacement from the
a subjects
still in the study from the propensity group. Then a sample of size
b
with replacement is selected from this sample of size
a. These
b
responses are used to fill-in the responses for the
b dropouts
in the quantile group for the current imputation and current time
period.
If the right-hand-side of a formula is specified for a univariate
response summary (which may be the last response, mean, or area under
the time-response curve),
rm.impute goes on to fit
rformula to
this response summary for each of the multiple imputations using a
fitting function
fitter. After all
n.impute imputations have been
done, the average "apparent" covariance matrix and the
between-imputation covariance matrix are computed to derive Rubin's
multiple-imputation-corrected covariance matrix for the average of
n.impute
sets of regression coefficients. See
fit.mult.impute for
more details.
The response variable
y may be an array to handle multiple responses
at each time period. This array has number of rows equal to the
number of subjects, number of columns equal to the number of periods,
and number of "pages" equal to the number of different response
measurements. A utility function
pbind is supplied for creating
such arrays from a series of matrices. When multiple responses are
present, all responses are used in the current propensity model, and
the
which,
nk,
rinteraction, and
rint.with arguments will
apply equally to all responses.
rm.impute(pformula, y, last,
rformula, fitter=ols, which=c("last", "mean", "auc"),
data=sys.parent(1), n.impute=10, g=5,
nk=0, rinteraction, rint.with=c('all','recent'),
pr=FALSE, pra=FALSE, npr,
keep.prop=FALSE, keep.pfits=FALSE)
pbind(...)
rformula if
fitter is one of the Design
library fitting functions) can contain any of the Design library's
transformation functions such as
rcs,
pol, etc.
which="auc", column
names of
y must contain numeric measurement times.
last=0.
pbind(newname1=y1,newname2=y2). The
dimnames of the first argument (which will be converted to a matrix
if it is a vector, for the unusual one-period case) will be used as
the first two
dimnames of the resulting array, and the names of the
matrices will form the third vector of
dimnames.
rm.impute will return only the multiple response imputations.
ols. If there are multiple
response variables at each time period and you want to use a different
fitter for different response variables, specify a list of
nr
fitting functions as this argument, where
nr is the number of
response variables.
rformula is given. The default is
the last column of the response matrix.
pformula and
rformula
are not already available via
attach()
n.impute
should be.
"recent" to allow the variables in
rinteraction to only
interact with the response for the most recent time period, and not with
the most recent and all previous responses (the default)
TRUE to print each logistic propensity model fit.
pr=TRUE, you can also set
pra=TRUE to print the Design
anova()
results for each propensity model fit.
pr=TRUE, printing will be done for the first
npr imputations
TRUE to store the array
propensity in the returned list. The
dimensions for
propensity are the same as
Y.
TRUE to store all propensity model fits from
lrm in the result
returned by
rm.impute
The algorithm used here will not correct for non-random dropout due to variables that are not included in the propensity model. A worst-case would be having dropouts at period i due to unmeasured responses at period i.
Ironically, there must be a sufficient number of dropouts for the propensity score method to work, as the propensity models must have adequate numbers of dropouts and non-dropouts at each time period.
Y and optionally
fit (if
rformula is given)
and
propensity (if
keep.prop=TRUE).
Y and
propensity are arrays
whose last dimension
corresponds to the multiple imputations and whose first two dimensions
correspond to
y.
Y is the multiply-imputed response
array and
fit is the imputation-corrected fit object. Note: Aside
from the regression coefficient vector and covariance matrix, this fit
object will have parameters from the fit of the response summary for
the last imputation. If
keep.pfits=TRUE, the returned list will also
have an array of propensity fit objects (
lrm objects) for all
response periods and imputations. If there is more than one response
variable at each time period,
fit will be a list of
nr fit objects
for
nr response variables.
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
f.harrell@vanderbilt.edu
Much valuable input was received from Chris Barker (Roche
Pharmaceuticals) and Phil Lavori (Stanford University).
Lavori PW, Dawson R, Shera, D: A multiple imputation strategy for clinical trials with truncation of patient data. Stat in Med 14:1913–1925, 1995.
Rubin D, Shenker N: Multiple imputation in health-care data bases: An overview and some applications. Stat in Med 10:585–598, 1991.
Engels JM, Diehr P: Imputation of missing longitudinal data: a comparison of methods. J Clin Epi 56:968–976, 2003.
## Not run:
# Generate multiple imputes of the response matrix for later use
Y <- rm.impute(~treatment + pol(age,2)*sex, responses,
last=lastvisit, data=mydata)$Y
# Do some analysis for each imputation
fits <- vector('list',10)
for(i in 1:10) {
y <- Y[,,i]
fits[[i]] <- my.analysis(X,y)
}
# Function to generate a 4-variate equal correlation pattern response
# with missing-at-random responses; missingness is a function of x and
# previous responses.
#
# pna is a function that computes the probability that a subject
# drops out at the current visit. For visit 1 pna is a function
# of treatment and baseline covariable x. For visits > 1 pna is
# a function of the matrix of responses for all previous visits.
#
# If second=TRUE we also generate a second response variable having
# NAs in the same positions as this first one. y2 is generated
# so that its NAs are completely unrelated to any y2 values if
# y2B.effect=0, as the pna function is only given the first
# response variable.
# y2 is N(0,1) for treat='A' and N(y2.treat.effect,1) for treat='B'.
testdf <- function(n=1500, seed=7, pna, second=FALSE, y2.treat.effect=0) {
set.seed(seed)
treat <- sample(c('A','B'),n,TRUE)
x <- runif(n)
nt <- 4
mvrnorm <- function(n, p = 1, u = rep(0, p), S = diag(p)) {
Z <- matrix(rnorm(n * p), p, n)
t(u + t(chol(S)) %*% Z)
}
# Generate multivariate normal errors for n subjects at nt times
# Assume equal correlations of rho=.5, independent subjects
rho <- .5
y <- mvrnorm(n, p=nt, S=diag(rep(1-rho,nt))+rho)
y[treat=='B',] <- y[treat=='B',] + 1
cat('\n\nTreatment-specific means for last period in response variable 1 before generating NAs:\n')
print(tapply(y[,4], treat, mean, na.rm=TRUE))
y[runif(n) < pna(treat, x), 1] <- NA
y[is.na(y[,1]) | runif(n) < pna(treat, x, y[,1]), 2] <- NA
y[is.na(y[,2]) | runif(n) < pna(treat, x, y[,1:2]), 3] <- NA
y[is.na(y[,3]) | runif(n) < pna(treat, x, y[,1:3]), 4] <- NA
last <- rep(4, n)
last[is.na(y[,4])] <- 3
last[is.na(y[,3])] <- 2
last[is.na(y[,2])] <- 1
last[is.na(y[,1])] <- 0
cat('\nNumber of NAs for each time period:\n')
print(apply(y, 2, function(x)sum(is.na(x))))
cat('\n\nTreatment-specific means for last period in response variable 1 after excluding NAs:\n')
print(tapply(y[,4], treat, mean, na.rm=TRUE))
cat('\n\nNaive complete-case analysis:\n\n')
prn(ols(y[,4] ~ pol(x,2) + treat))
if(second) {
y2 <- matrix(rnorm(n*4),ncol=4)
y2[treat=='B',] <- y2[treat=='B',] + y2.treat.effect
cat('\n\nTreatment-specific means for last period in response variable 2 before generating NAs:\n')
print(tapply(y2[,4], treat, mean, na.rm=TRUE))
y2[is.na(y[,1]),1] <- NA
y2[is.na(y[,2]),2] <- NA
y2[is.na(y[,3]),3] <- NA
y2[is.na(y[,4]),4] <- NA
cat('\n\nTreatment-specific means for last period in response variable 2 after excluding NAs:\n')
print(tapply(y2[,4], treat, mean, na.rm=TRUE))
y <- pbind(y1=y, y2=y2)
}
list(x=x, treat=treat, y=y, last=last)
}
pna <- function(treat, x, yprev) {
# In this model for the probability of dropout just before the
# current visit, the probability does not depend on the baseline
# covariable x. For treat='B' the probability of dropout is a
# constant 0.1. For treat='A' it is a curtailed quadratic
# function of the previous visit's response.
#
# If no previous responses available, we are at first follow-up visit
if(missing(yprev)) 0 else {
if(is.matrix(yprev)) yprev <- yprev[,ncol(yprev)]
ifelse(treat=='B', .1,
pmax(0, pmin(1, .124 +.0835*yprev + .020868*yprev^2)))
}
}
df <- testdf(pna = pna, second=TRUE)
g <- rm.impute(~ pol(x,2) + treat, df$y, last=df$last,
rformula=~ pol(x,2) + treat,
n.impute=10, g=4, nk=3,
rinteraction='treat', rint.with='all',
pr=TRUE, pra=TRUE, data=df, keep.prop=TRUE, keep.pfits=TRUE)
# Base propensity model is in.study ~ pol(x,2) + treat
# for visits 2,3,4, filled-in y's from previous visits will also be
# used as predictors, and these interact with treat.
# Restricted cubic spline with 3 knots is assumed for the propensity models
# To fit the multiply-imputed last (4th) response an additive model
# in quadratic x and treat is used
g$fit[[1]] # shows response fit for first response variable
# (y1), with variances adj. for imputation
page(g$Y) # show all 10 imputations for both responses x 4 periods
# Check for the first imputation how well propensity matching achieved
# balance in baseline and period 3 filled-in responses for
# dropouts and non-dropouts. For continuous variables show ECDFs
# using the Hmisc ecdf function, for first 4 imputations. Do this
# with and without stratifying on quintiles of propensity, and also
# show the estimated 3rd period response vs. propensity stratified
# by dropout status. Use only first response (y1) for all of this.
for(imp in 1:4) {
y3 <- g$Y[,3,1,imp]
prop3 <- g$propensity[,3,imp]
prop3g <- cut2(prop3,g=5)
ti <- paste('Imputation',imp)
print(ecdf(~ y3, groups=df$last >= 3, subset=unclass(prop3g)<5))
title(ti)
print(ecdf(~ y3 | prop3g, groups=df$last >= 3,
subset=unclass(prop3g)<5))
# Not enough dropouts in highest quintile of propensity completing
# visit 3
title(ti)
plsmo(prop3, y3, group=df$last >= 3, datadensity=TRUE, col=1:2)
title(ti)
}
# Examine propensity fit for sixth imputation, 4th response
f <- g$pfits[4,6][[1]]
dfr <- as.data.frame(df)
# Edit names of dfr so that responses called y.1, y.2, etc.
# For this example, these are already OK
dd <- datadist(dfr)
options(datadist='dd')
# datadist makes plot below work without specifying variable settings
plot(f, y.3=NA, treat=NA, conf.int=FALSE)
# Analyze multiple response variables. Both systolic.bp and
# diastolic.bp are matrices (columns = time periods)
f <- rm.impute(~treatment + pol(age,2)*sex,
pbind(systolic.bp, diastolic.bp),
last=lastvisit, data=mydata)
# To deal with a continuous and a binary endpoint you can specify
# pbind(sysbolic.bp, stroke), fitter=list(ols, lrm)
## End(Not run)