Estimates for Conditional Gaussian Models

DESCRIPTION:

Estimates parameters for a conditional Gaussian model. There are four methods for handling missing values.

USAGE:

mdCgm(object, margins, gauss, design, optData, subset, 
    prior = <<see below>>, na.proc = "fail", start = NULL, control, 
    contrasts = NULL) 

REQUIRED ARGUMENTS:

object
a class "preCgm" or a "missmodel" object, or a data frame or matrix containing the raw data.

When a data frame is input and if the margins argument is not provided, then the loglinear part of the model is assumed to be a saturated model in which all factor variables are used to form the table. If the gauss argument is not provided, then all numeric variables in the data frame are included in the conditional Gaussian part of the model.

When a matrix is input, you must provide the margins argument, which identifies the variables to use in the discrete part of the model. If the gauss argument is omitted, then all remaining variables in the matrix are used in the Gaussian part of the conditional Gaussian distribution.

If a class "missmodel" object is input, then the paramIter component of the "missmodel" object must be of class "cgm".

OPTIONAL ARGUMENTS:

margins
the marginal totals to be fit in the log-linear model. A margin is described by the factors not summed over. Thus list(1:2, 3:4) would indicate fitting the 1,2 margin (summing over variables 3 and 4) and the 3,4 margin in a four-way table. This same model can be specified using the names of the variables (e.g., list(c("V1", "V2"), c("V3", "V4"))), or using formula notation, as in margins = ~V1:V2 + V3:V4.

If margins is not specified, a saturated model is fitted.

When a matrix is input as argument data, argument margins must be specified. When a data frame is input and argument margins is missing, then a saturated model involving all factor variables is fitted.

If a class "missmodel" object is input, then if margins is not given, argument margins defaults to the margins specified in the call statement of the input "missmodel" object.
gauss
identifies the variables to be used in the conditional Gaussian part of the model. These variables may be specified in three ways: as a vector of variable indices, e.g., c(1, 2, 4), as a vector of variable names, e.g. c("V1", "V2", "V4"), and using formula notation, e.g. ~V1+V2+V4. If argument gauss is omitted, then all numeric variables (which do not appear in argument margins) are used in the multivariate gaussian model.
design
a formula giving the regression model for predicting the numeric variable cell means as a linear function of the factor variables and the variables provided in optData. Optionally, an ncell by m matrix may be input directly as the design matrix.

Let i=1, ..., ncell denote the cells in the loglinear model, and let mu(i) denote the vector of numeric variable means in cell i. Then the formula design provides the design matrix for predicting the cell means. As an example, let "V1" and "V2" be the names of the factor variables, and let "age" be a vector giving an average age for the subjects in each cell. Then formula design=~V1+V2 indicates a main effect model for the cell means, while design=~V1 + V2 + age indicates a main effect model for the cell means, adjusted for average cell age.

Optionally, an ncell by m matrix may be input. In this case, the regression model is obtained as a linear function of the columns of the input matrix.

If design is not specified, then the design matrix is taken to be an identity matrix.
optData
a data frame with ncell rows containing predictors to be used in computing the design matrix. In the example given in the description for argument design, the variable age would be input in argument optData.
subset
expression specifying which rows of the data should be used in the fit. This can be a logical vector (which is replicated to have length equal to the number of rows), a numeric vector indicating the observation numbers to be included, or a character vector of the row names to be included. All observations are included by default. If object is a data frame, this expression may use variables in the data frame.
prior
Gives the hyperparameters of the Dirichlet prior distribution assumed for the loglinear part of the model. Note that a noninformative prior is always assumed for the Gaussian parameters.

Supply either a character string, or an object of class "priorLoglin", or a vector of hyperparameters.

Valid character strings are "ml" (maximum likelihood), "noninformative", and "data.dependent". String matching is used, so the characters "m", "n", or "d" are sufficient. The values of the hyperparameters change with the algorithm (see for details). E.g. "noninformative" means a common value of 1 for EM, and a common value of 0.5 for DA.

A class "priorLoglin" object is created by routine priorLoglin.

If a vector of hyperparameters is supplied, the length of the vector equals the number of cells formed by the factor variables. The vector is ordered so that the levels of the first variable vary fastest, the second variable levels vary next fastest, etc. If a single numeric value is input, its value is replicated for all cells in the table. The hyperparameters for a data dependent prior (following an independence model) can be generated using routine dataDepPrior. See for details.

The default value is "noninformative". When a class "missmodel" object is input, any value specified in a previous call has priority over the default value (but not over any currently specified value).

Structural zeros must be coded as missing ( NA) when a vector of hyperparameters is input as argument prior.

If a class "missmodel" object is input and argument prior is not given, then argument prior defaults to the prior probabilities specified in the call statement of the input "missmodel" object. If these are not specified, then the default (which depends on the algorithm) is used.
na.proc
character, the method to use in handling missing data. Possible values are:
"fail"

stop with an error message if missing values are encountered,

"omit"
omit observations with missing values,
"em"
use an EM algorithm, and
"da"
use a data augmentation algorithm.

When argument object is a class "preCgm" or "missmodel" object, argument na.proc must be either "da" or "em".
start
Either a list or a "cgm" object of starting values of the model parameters. The parameters estimated by mdCgm are the cell means and variance--covariance matrix of a multivariate Gaussian distribution, and log-linear model cell probabilities.

Thus, start may be a list with matrix component mu giving the matrix of means in each of its ncell columns (where the columns must be in the same order as the log-linear model cells, and the rows must be in the same order as the continuous variables), a matrix component sigma giving the variance-covariance matrix, and a vector pi giving the cell probabilities. If structural zeros appear in the contingency table, start$pi must contain zeros to indicate the structural zeros; see for details.

Alternatively, a class "cgm" object created as the paramIter component of the class "missmodel" object may be input for the starting values. Routines mdCgm, daCgm, and emCgm may be used to create an appropriate "missmodel" object.

In most cases the default starting values are equal to a vector of 1s for pi, and a matrix of means and a diagonal matrix of variances calculated from the numeric observations with no missing values.

When argument object is a class "missmodel" object, start defaults to the final estimates in the input "missmodel" object.
control
A list of parameters used to control the algorithm. If not given, these default to the emCgm.control values, or to the daCgm.control values as appropriate. See the help files for and for details.

When a class "missmodel" object is input, the control values specified on a previous call has priority over the default values (but not over any currently specified value), but only if these are of the required class ( "da" or "em").
contrasts
a list giving contrasts for some or all of the factors appearing in the design formula. The elements of the list should have the same name as the variable and should be either a contrast matrix (specifically, any full-rank matrix with as many rows as there are levels in the factor), or else a function to compute such a matrix given the number of levels.

VALUE:

an object of class "missmodel" is returned; see for details.

SIDE EFFECTS:

The function mdCgm creates the data set .Random.seed if it does not already exist, otherwise update its value.

DETAILS:

The mdCgm function estimates parameters of a conditional Gaussian model (also known as a "general location model") in which the factor variables are modeled according to a hierarchical log-linear model, and, conditional upon the factor variables, the distribution of the numeric variables is multivariate normal. In hierarchical models the inclusion of an interaction effect automatically means that all corresponding lower level effects are included in the model. For example, for factors A , B, and C, inclusion of A:B:C automatically means that A , B, C, A:B , A:C, and B:C are also included in the model.

mdCgm handles missing values in one of four ways as indicated by the argument na.proc.

A Dirichlet prior distribution may be specified for the parameters in the log-linear model. A noninformative prior (see ) is always assumed for the parameters in the multivariate normal distribution.

Because the emCgm function is often called more than once, it is usually preferable to precompute quantities used by emCgm . This may be done using the preCgm function.

REFERENCES:

Agresti, A. (1990), Categorical Data Analysis , John Wiley & Sons, New York.

Bishop, Y. M. M., Fienberg, S. E., and Holland, H. W., Discrete Multivariate Analysis: Theory and Practice , MIT Press, Cambridge,

Schafer, J. L. (1997), Analysis of Incomplete Multivariate Data , Chapman & Hall, London.

SEE ALSO:

, , , , , , , , , , , , , , .

EXAMPLES:

mdGauss(object = language)           # fails by default
                                     # because language has missing data
# Fit model on part of data with no missing values:
mdCgm(language[,c("LAN", "SEX", "HGPA","FLAS")],
                                     subset=!(is.na(SEX) | is.na(HGPA)))
# Equivalent to:
completeCgm(language[,c("LAN", "SEX", "HGPA", "FLAS")],
            subset=!(is.na(SEX) | is.na(HGPA)),prior=1)
mdCgm(object = stlouis[,-1], margins = ~D1:D2+risk,
            gauss = ~verbal1+verbal2,
            design = ~D1+D2+risk, na.proc = "em",
            subset = verbal2 > 100 | is.na(verbal2))
# Equivalent to:
emCgm(object = stlouis[,-1], margins = ~D1:D2+risk,
            gauss = ~verbal1+verbal2, design = ~D1+D2+risk,
            subset = verbal2 > 100 | is.na(verbal2))

# PreProcess
language.s <- preCgm(language)

# Categorical variables LAN, AGE, PRI, SEX, GRD specify a 5 dimensional
# contingency table with 4*5*5*2*5= 1000 cells.
# Specify loglinear model with all main effects and 2-variable associations:
margins.form <- ~ LAN + AGE + PRI + SEX + GRD +
             LAN:AGE + LAN:PRI + LAN:SEX + LAN:GRD +
             AGE:PRI + AGE:SEX + AGE:GRD +
             PRI:SEX + PRI:GRD +
             SEX:GRD

# linear contrast
lc <- c(-2,-1,0,1,2)

# Set up contrasts to get dummy-coded design matrix
options(contrasts= c("contr.treatment", "contr.poly"))
design.form <- ~ LAN + C(AGE,lc,1) + C(PRI,lc,1) + SEX + C(GRD,lc,1)

# Set hyperparameter to 1.05 to ensure a mode in the
# interior of the parameter space
language.em <- mdCgm(language.s, margins = margins.form,
                     design = design.form, prior = 1.05, na.proc= "em")
# same as:
emCgm(language.s, margins = margins.form,
                     design = design.form, prior = 1.05)

# Data augmentation
language.da <- mdCgm(language.em,
                     control = list(niter = 1000, save = 100:1000),
                     na.proc = "da")
# same as:
daCgm(language.em, control = list(niter = 1000, save = 100:1000))