Impute Factor Data

DESCRIPTION:

Methods for imputing factor data under a loglinear model, using data augmentation.

USAGE:

impLoglin.default(object, frequency, nimpute = 3, margins, subset, 
    prior = 0.5, start = NULL, iterOn1 = T, 
    control = daLoglin.control(), return.type = "data.frame") 
impLoglin.preLoglin(object, nimpute = 3, margins, 
    prior = 0.5, start = NULL, iterOn1 = T, 
    control = daLoglin.control(), return.type = "data.frame") 
impLoglin.missmodel(object, nimpute = 3, margins,  
    prior = 0.5, start = NULL, iterOn1 = T, 
    control = daLoglin.control(), return.type = "data.frame") 

REQUIRED ARGUMENTS:

object
for emLoglin.default: a data frame or matrix containing the raw data. When a data frame is input, the table is specified by the levels of the factor variables. When a matrix is input, it is assumed that the levels of a variable form a sequence of integers from one to the maximum value of the variable.

for impLoglin.preLoglin, an object of class "preLoglin" (produced by the preLoglin function).

for impLoglin.missmodel, an object of class "missmodel" containing the results of a previous log-linear analysis. Any of the functions mdLoglin, completeLoglin, emLoglin, or daLoglin may be used to produce the missmodel object.

OPTIONAL ARGUMENTS:

frequency
The frequency of the corresponding row in argument object. If object is a data frame and this is the (unquoted) name of a variable in the data frame, then that variable is used. If omitted, all frequencies are assumed to be 1 (unless specified in argument margins).
nimpute
an integer number of imputations. nimpute is ignored if several chains are used to produce imputations, in which case, nimpute is determined as discussed in describing the argument start below.
margins
a formula or a list of vectors containing the marginal totals to be fit. A margin is described by the factors not summed over. Thus list(1:2, 3:4) would indicate fitting the 1,2 margin (summing over variables 3 and 4) and the 3,4 margin in a four-way table. This same model can be specified using the names of the variables (e.g., list(c("V1", "V2"), c("V3", "V4"))), or using formula notation, as in margins = ~V1:V2 + V3:V4. When formula notation is used, the argument frequency can be included as the dependent variable (as in margins = frequency~V1:V2 + V3:V4). If margins is not specified, a saturated model (a single interaction term containing all table variables) is fit.

For impLoglin.default: when a matrix is input, every column in the matrix is used to define the table. When a data frame is input, the table is defined by the "factor" variables in the data frame.

For impLoglin.missmodel: if not given, argument margins defaults to the margins specified in the call statement of the input "missmodel" object.
subset
expression specifying which rows of the data should be used in the fit. This can be a logical vector (which is replicated to have length equal to the number of rows), a numeric vector indicating the observation numbers to be included, or a character vector of the row names to be included. All observations are included by default. If object is a data frame, this expression may use variables in the data frame.
prior
specifies Dirichlet prior hyperparameters. Supply either a character string, or an object of class "priorLoglin", or an array of hyperparameters.

Valid character strings are "ml" (maximum likelihood) or "noninformative". String matching is used, so the characters "m" or "n" are sufficient. The values of the hyperparameters changes with the algorithm (see for details). E.g. "noninformative" means a common value of 1 for EM, and a common value of 0.5 for DA.

A class "priorLoglin" object is created by routine priorLoglin.

See argument start for the order to use in specifying a vector of hyperparameters. If a single numeric value is input, its value is replicated for all cells in the table. The hyperparameters for a data dependent prior (following an independence model) can be generated using routine dataDepPrior. See for details.

The default value is "noninformative". When a class "missmodel" object is input, any value specified in a previous call has priority over the default value (but not over any currently specified value).

Structural zeros must be coded as missing ( NA) when a vector of hyperparameters is input as argument prior.

For impLoglin.missmodel: If not given, argument prior defaults to the prior probabilities specified in the call statement of the input "missmodel" object. If these are not specified, then the prior probability defaults to 0.5.
start
starting values of the parameters. The form of start depends on whether the imputations are generated from one long chain, or from several chains.

For one long chain, start is a vector of cell probabilities. The length of start equals the number of distinct combinations of the factor variable levels. The ordering is such that the first variable varies the fastest, then the second variable, etc. For one long chain, you must supply the argument nimpute.

For several chains, start may be a list of such vectors, a class "Loglin" object, or a list of "Loglin" objects.

For a list of vectors, the number of imputations equals the length of the list.

A class "Loglin" object is the paramIter component of a class "missmodel" object, produced by routines such as mdLoglin, daLoglin ,and emLoglin. This is a matrix with as many rows as there are saved imputations.

If a list of class "Loglin" objects is input, the estimates in the final row of each paramIter component is used to start a chain. The number of imputations equals the number of "Gauss" objects.

Starting values for cells that are structural zeros in the table should be zero.

The default starting values are all equal to one divided by the number of cells in the table.

For impLoglin.missmodel: If not given and if argument margins is not specified, then argument start defaults to the final estimates in the input "missmodel" object. If argument margins is specified, then argument start must be provided. Also notice that when argument margins is specified, care must be taken to ensure that structural zeros in these final estimates are also structural zeros in the new model.
iterOn1
logical flag which determines whether the data augmentation algorithm is iterated before producing (1) the first imputation (in one long chain) or (2) each of the imputations (for parallel chains). The default value is TRUE.

In particular, for one long chain, if iterOn1 is FALSE, then the first imputation is drawn under the parameter given in start. If iterOn1 is TRUE, then data augmentation starts from start, and runs for control$niter iterations before producing the first imputation. Each of the rest of the imputations are produced after data augmentation runs for control$niter further iterations.

Similarly, for parallel chains, if iterOn1 is FALSE, then the imputations are drawn under the parameters given in the start matrix. If iterOn1 is TRUE, then data augmentation starts from each row of start, and runs for control$niter iterations before producing each of the imputations.
control
A list of parameters used to control the algorithm; see for details.

For daLoglin.missmodel: if not given, argument control defaults to the control parameters specified in the call statement of the input "missmodel" object, but only if these are of the correct class. If these are not given (or cannot be used), then the argument control defaults to daLoglin.control.
return.type
character, which determines the structure of the returned value for ungrouped data. If "data.frame" (the default), the returned object is a data frame whose variables may inherit from class "miVariable". If "matrix", then an "miVariable" containing a matrix is returned.

VALUE:

The structure of the returned object depends on the structure of the original data and upon the argument return.type.

Suppose the original incomplete data set is ungrouped, i.e. the frequency argument is all 1s. Then the returned object is a data frame containing "miVariable" objects, or an "miVariable" object containing a matrix, depending on the value of return.type.

If the original incomplete data set is grouped, i.e. argument object consists of a matrix or data frame of the unique level combinations, and a frequency vector gives the number of times each combination occurs, then the returned object is a list of class "miList", each of whose components is a similar data frame.

See and for details.

SIDE EFFECTS:

All methods create the data set .Random.seed if it does not already exist, otherwise its value is updated.

DETAILS:

Computations in the impLoglin function are made more efficient by first calculating a preLoglin object. Therefore, if a preLoglin object already exists (e.g. through using the preLoglin function before calling emLoglin or daLoglin ), then it will save computation time to pass in this object instead of the original data.

See the help file for for additional details.

REFERENCES:

Best, N. G., Cowles, M. K. and Vines, S. K. (1997), CODA Convergence, Diagnosis and Output Analysis Software for Gibbs sampling output , Version 0.4., Cambridge: Medical Research Council Biostatistics Unit.

Gilks, W. R., Richardson, S. and Spiegelhalter, D. J., editors (1996), Markov Chain Monte Carlo in Practice , London: Chapman and Hall.

Schafer, J. L. (1997), Analysis of Incomplete Multivariate Data , Chapman & Hall, London.

SEE ALSO:

, , , , .

EXAMPLES:

#create starting values 
crime.em <- emLoglin(crime, frequency = count) 
start.crime <- t(matrix(rep(crime.em$paramIter[2, ], 5), ncol = 5)) 
crime.imp <- impLoglin(crime, frequency = count, prior = 0.5,  
                       start = start.crime, control = list(niter = 100)) 
# look at second completed data set 
miSubscript(crime.imp,2) 
#The following are equivalent: 
impLoglin.default(crime, frequency = count, nimpute = 5,  
                  start = as.vector(crime.em$paramIter[2, ])) 
crime.pre <- preLoglin(data = crime, frequency = count) 
impLoglin.preLoglin(object = crime.pre, nimpute = 5,  
                 start = as.vector(crime.em$paramIter[2, ])) 
impLoglin.missmodel(crime.em, nimpute = 5)