Impute Data under CGM

DESCRIPTION:

Methods for imputing data sets containing both factor and numeric data under a Conditional Gaussian Model, using data augmentation.

USAGE:

impCgm.default(object, nimpute = 3, margins, gauss, design, optData, 
    subset, prior = 0.5, start = NULL, iterOn1 = T, 
    control = daCgm.control(), contrasts = NULL, 
    return.type = "data.frame") 
impCgm.preCgm(object, nimpute = 3, margins, gauss, design, optData, 
    prior = 0.5, start = NULL, iterOn1 = T, 
    control = daCgm.control(), contrasts = NULL, 
    return.type = "data.frame") 
impCgm.missmodel(object, nimpute = 3, margins, gauss, design, optData, 
    prior = 0.5, start = NULL, iterOn1 = T, 
    control = daCgm.control(), constrasts = NULL, 
    return.type = "data.frame") 

REQUIRED ARGUMENTS:

object
for emCgm.default: a data frame or matrix containing the raw data. When a data frame is input and if the margins argument is not provided, then the loglinear part of the model is assumed to be a saturated model in which all factor variables are used to form the table. If the gauss argument is not provided, then all numeric variables in the data frame are included in the conditional Gaussian part of the model. When a matrix is input, you must provide the margins argument, which identifies the variables to use in the discrete part of the model. If the gauss argument is omitted, then all remaining variables in the matrix are used in the Gaussian part of the conditional Gaussian distribution.

for impCgm.preCgm, an object of class "preCgm" (produced by the preCgm function).

for impCgm.missmodel, an object of class "missmodel" containing the results of a previous analysis. Any of the functions mdCgm, completeCgm, emCgm, or daCgm may be used to produce the missmodel object.

OPTIONAL ARGUMENTS:

nimpute
an integer number of imputations. nimpute is ignored if several chains are used to produce imputations, in which case, nimpute is determined as discussed in describing the argument start below.
margins
the marginal totals to be fit in the log-linear model. A margin is described by the factors not summed over. Thus list(1:2, 3:4) would indicate fitting the 1,2 margin (summing over variables 3 and 4) and the 3,4 margin in a four-way table. This same model can be specified using the names of the variables (e.g., list(c("V1", "V2"), c("V3", "V4"))), or using formula notation, as in margins = ~V1:V2 + V3:V4.

If margins is not specified, a saturated model is fitted.

For impLoglin.default: When a matrix is input as argument object, argument margins must be specified. When a data frame is input and argument margins is missing, then a saturated model involving all factor variables is fitted.

For impLoglin.missmodel: If not given, argument margins defaults to the margins specified in the call statement of the input "missmodel" object.
gauss
identifies the variables to be used in the conditional Gaussian part of the model. These variables may be specified in three ways: as a vector of variable indices, e.g., c(1, 2, 4); as a vector of variable names, e.g. c("V1", "V2", "V4"); and using formula notation, e.g. ~V1+V2+V4. If argument gauss is omitted, then all numeric variables (which do not appear in argument margins) are used in the multivariate gaussian model.
design
a formula giving the regression model for predicting the numeric variable cell means as a linear function of the factor variables and the variables provided in optData. Optionally, an ncell by m matrix may be input directly as the design matrix.

Let i=1, ..., ncell denote the cells in the loglinear model, and let mu(i) denote the vector of numeric variable means in cell i. Then the formula design provides the design matrix for predicting the cell means. As an example, let "V1" and "V2" be the names of the factor variables, and let "age" be a vector giving an average age for the subjects in each cell. Then formula design=~V1+V2 indicates a main effect model for the cell means, while design=~V1 + V2 + age indicates a main effect model for the cell means, adjusted for average cell age.

Optionally, an ncell by m matrix may be input. In this case, the regression model is obtained as a linear function of the columns of the input matrix.

If design is not specified, then the design matrix is taken to be an identity matrix.
optData
a data frame with ncell rows containing predictors to be used in computing the design matrix. In the example given in the description for argument design, the variable age would be input in argument optData.
subset
expression specifying which rows of the data should be used in the fit. This can be a logical vector (which is replicated to have length equal to the number of rows), a numeric vector indicating the observation numbers to be included, or a character vector of the row names to be included. All observations are included by default. If object is a data frame, this expression may use variables in the data frame.
prior
Gives the hyperparameters of the Dirichlet prior distribution assumed for the loglinear part of the model. Note that a noninformative prior is always assumed for the Gaussian parameters.

Supply either a character string, or an object of class "priorLoglin", or a vector of hyperparameters.

Valid character strings are "ml" (maximum likelihood), "noninformative", and "data.dependent". String matching is used, so the characters "m", "n", or "d" are sufficient. The values of the hyperparameters change with the algorithm (see for details). E.g. "noninformative" means a common value of 1 for EM, and a common value of 0.5 for DA.

A class "priorLoglin" object is created by routine priorLoglin.

If a vector of hyperparameters is supplied, the length of the vector equals the number of cells formed by the factor variables. The vector is ordered so that the levels of the first variable vary fastest, the second variable levels vary next fastest, etc. If a single numeric value is input, its value is replicated for all cells in the table. The hyperparameters for a data dependent prior (following an independence model) can be generated using routine dataDepPrior. See for details.

The default value is "noninformative".

For impLoglin.missmodel: If not given, argument prior defaults to the prior probabilities specified in the call statement of the input "missmodel" object. If these are not specified, then the default (which depends on the algorithm) is used.
start
starting values of the parameters. The form of start depends on whether the imputations are generated from one long chain, or from several chains.

The parameters estimated by mdCgm are the cell means and variance--covariance matrix of a multivariate Gaussian distribution, and log-linear model cell probabilities.

Thus, for one long chain, start is a list with matrix component mu giving the matrix of means in each of its ncell columns (where the columns must be in the same order as the log-linear model cells, and the rows must be in the same order as the continuous variables), a matrix component sigma giving the variance-covariance matrix, and a vector pi giving the cell probabilities. If structural zeros appear in the contingency table, start$pi must contain zeros to indicate the structural zeros; see for details. For one long chain, you must supply the argument nimpute.

For several chains, start may be a list of such lists, a class "cgm" object, or a list of "cgm" objects.

For a list of lists, the number of imputations equals the length of the outer list.

A class "cgm" object is the paramIter component of a class "missmodel" object, produced by routines such as mdCgm, daCgm, and emCgm. The number of imputations equals the number of rows in the matrix paramIter.

If a list of class "cgm" objects are input, the estimates in the final row of each paramIter component is used to start a chain. The number of imputations equals the number of "cgm" objects.

Starting values for cells that are structural zeros in the table should be zero.

In most cases the default starting values are equal to a vector of 1s for pi (eventually normalized so they add to 1), and a matrix of means and a diagonal matrix of variances estimated obtained from the numeric observations with no missing values.

For impCgm.missmodel: If start, margins, and gauss are not specified, then argument start defaults to the final estimates in the input "missmodel" object. If either margins or gauss is specified, then start must be provided. Also notice that when argument margins is specified, care must be taken to ensure that structural zeros in these final estimates are also structural zeros in the new model.
iterOn1
logical flag which determines whether the data augmentation algorithm is iterated before producing (1) the first imputation (in one long chain) or (2) each of the imputations (for parallel chains). The default value is TRUE.

In particular, for one long chain, if iterOn1 is FALSE, then the first imputation is drawn under the parameter given in start. If iterOn1 is TRUE, then data augmentation starts from start, and runs for control$niter iterations before producing the first imputation. Each of the rest of the imputations are produced after data augmentation runs for control$niter further iterations.

Similarly, for parallel chains, if iterOn1 is FALSE, then the imputations are drawn under the parameters given in the start matrix. If iterOn1 is TRUE, then data augmentation starts from each row of start, and runs for control$niter iterations before producing each of the imputations.
control
A list of parameters used to control the algorithm; see for details.

For impCgm.missmodel: if not given, argument control defaults to the control parameters specified in the call statement of the input "missmodel" object, but only if these are of the correct class. If these are not given (or are not of the correcl class), then the argument control defaults to the daCgm.control values.
contrasts
a list giving contrasts for some or all of the factors appearing in the design formula. The elements of the list should have the same name as the variable and should be either a contrast matrix (specifically, any full-rank matrix with as many rows as there are levels in the factor), or else a function to compute such a matrix given the number of levels.
return.type
character, if "data.frame" (the default), the returned object is a data frame whose variables may inherit from class "miVariable". If "matrix", then an "miVariable" containing a matrix is returned.

VALUE:

a data frame containing "miVariable" objects, or "miVariable" object containing a matrix, depending on the value of return.type

SIDE EFFECTS:

All methods create the data set .Random.seed if it does not already exist, otherwise its value is updated.

DETAILS:

Computations in the impCgm function are made more efficient by first calculating a preCgm object. Therefore, if a preCgm object already exists (e.g. through using the preCgm function before calling emCgm or daCgm), then it will save computation time to pass in this object instead of the original data.

See the help file for for additional details.

REFERENCES:

Best, N. G., Cowles, M. K. and Vines, S. K. (1997), CODA Convergence, Diagnosis and Output Analysis Software for Gibbs sampling output , Version 0.4., Cambridge: Medical Research Council Biostatistics Unit.

Gilks, W. R., Richardson, S. and Spiegelhalter, D. J., editors (1996), Markov Chain Monte Carlo in Practice , London: Chapman and Hall.

Schafer, J. L. (1997), Analysis of Incomplete Multivariate Data , Chapman & Hall, London.

SEE ALSO:

, , , , , , .

EXAMPLES:

# First generate starting values
# Categorical variables LAN, AGE, PRI, SEX, GRD specify a 5 dimensional
# contingency table with 4*5*5*2*5 = 1000 cells
# Specify loglinear model with all main effects and 2-variable associations:
margins.form <- ~ LAN + AGE + PRI + SEX + GRD +
             LAN:AGE + LAN:PRI + LAN:SEX + LAN:GRD +
             AGE:PRI + AGE:SEX + AGE:GRD +
             PRI:SEX + PRI:GRD +
             SEX:GRD

#linear contrast
lc <- c(-2,-1,0,1,2)
design.form <- ~ LAN + C(AGE,lc,1) + C(PRI,lc,1) + SEX + C(GRD,lc,1)
language.pre <- preCgm(language)

# Set hyperparameter to 1.05 to ensure a mode in the
# interior of the parameter space
language.em <- emCgm(language.pre, margins = margins.form,
                     design = design.form, prior = 1.05)

# 5 imputations produced by parallel chains, each
# started from one row of a matrix of starting values,
# and run for 100 iterations
start.langEM <- matrix(rep(language.em$paramIter[2, ], 5), nrow = 5, byrow = T)
language.imp <- impCgm(language, margins = margins.form,
                       design = design.form, prior = 1.05,
                       start = start.langEM, control = list(niter = 100))

# Single chain
#The following are equivalent:
impCgm.default(language, nimpute = 5, margins = margins.form,
                       design = design.form, prior = 1.05,
                       start = language.em$paramIter[2, ])
language.pre <- preCgm(data = language)
impCgm.preCgm(object = language.pre, nimpute = 5, margins = margins.form,
                       design = design.form, prior = 1.05,
                       start = language.em$paramIter[2, ])
impCgm.missmodel(language.em, nimpute = 5)