Fit a Generalized Estimation Equation Model

DESCRIPTION:

Returns an object of class "gee" that represents a fit of a Generalized Estimation Equation model.

USAGE:

gee(formula, cluster, variance, data=sys.parent(), family=gaussian,
    link=NULL, correlation="independent", start=NULL, contrasts=NULL,
    subset, na.action=na.omit, control=list(algorithm=2))

REQUIRED ARGUMENTS:

formula
a formula object, with the response on the left of a ~ operator, and the terms separated by + operators on the right. A term of constant offset can be added to the linear predictor, as in the glm function.
cluster
a two-column matrix of integers to identify cluster id and record id. Use cbind(cluster.id, record.id) to specify, in which the variables have to be in the search path or in the data frame entered in data. The first column is the cluster id and the second column is the record id. Each row of the matrix corresponds to the identification of an observation in a cluster. Observations within the same cluster might be correlated, while observations from different clusters are uncorrelated. All of the observations within the same cluster are assumed to have the same variance. The record id is important when the data are unbalanced or the correlation is coordinate-dependent, i.e. discrete time AR, stationary, nonstationary and unstructured correlation. In these cases, the ordering of observations within a cluster has an impact on results. For unbalanced data, all unique record id's constitute a complete cluster, and each cluster is a subset of the complete cluster. In modeling repeated measures with coordinate-independent correlation structures such as independent, exchangeable and continuous AR, the record id could be arbitrary. Only in a balanced data, the default will generate a vector of integers from 1 to the size of a cluster for record.id. For other cases, provide two variables for this argument. The function recordDesign is useful for creating record id's for some data.
variance
a character string to specify a variance structure or a numeric initial scale parameter. The options are "glm.scale" and "glm.1". Enter "glm.scale" to indicate that the variance follows the structure of a generalized linear model with a multiplicative of a scale parameter. If the initial value of the scale parameter is known, then enter that value. If the scale parameter is known to be exactly equal to 1, enter "glm.1". Use varDesign for more complicated variance structures.

OPTIONAL ARGUMENTS:

data
a data frame in which to interpret the variables included in specifying arguments formula, subset, cluster, variance, and correlation. If data is missing, then the variables should be on the search list.
family
a glm "family" object or character string identifying the family. Families supported are "gaussian", "binomial", "poisson", "Gamma", "inverse.gaussian". The default is "gaussian".
link
a character string identifying the link function corresponding to family. For example, use "log" for poisson, and "logit" or "probit" for binomial. Not all links are supported for each family. The supported links are "logit", "probit", "cloglog", "log", "identity", "power(x)", "inverse", and "1/mu^2". The "gaussian" and "inverse.gaussian" families have only one supported link, "identity" and "1/mu^2", respectively. The "power(x)" link is parameterized by a non-negative real value x and may be used with all families.
correlation
a character string or a list to specify correlation structures. Use geeDesign for more complicated correlation structures. Typically, a character string is entered to specify a correlation structure. The following character strings are permitted:
"AR"

auto regressive correlation with discrete occasions,

"contAR"
auto regressive correlation with continuous occasions,
"exchangeable"
exchangeable correlation,
"independent"
independent correlation (default),
"stationary"
stationary correlation,
"nonstationary"
nonstationary correlation,
"unstruct"
unstructured correlation.

The parameterization a for correlation r under "AR" is r=a^d and under "contAR" is r=exp(-d/a), where d is the difference between two time points. For other structures, each cell is either r=a or r=0, e.g. the cells off the non-zero bands in stationary.

The "stationary"
and "nonstationary" require a numeric parameter to specify the number of bands, and the default is 1. Except "independent" and "exchangeable", all other structures are either coordinate or covariate dependent and thus require additional variables to identify the time variable, locate missed experiments in unbalanced data or resolve the ordering of occasions in balanced data. By default, the integer record.id in cluster is used for indexing the discrete time "AR", "nonstationary", "stationary" and "unstruct" cases. If the data are not balanced and sorted according to cluster and record id, and if the structure is one of "AR", "contAR", "nonstationary", "stationary" or "unstruct" structures, enter a list for the correlation argument with the following component names:
"type": one of the above correlation structures.
"x.layer": a name of the factor or variable to identify the levels or coordinate of observations within clusters. If "x.layer" is not provided, the record.id in cluster will be used whenever the specified type requires a variable or an index. This default might not be suitable for certain correlation structures.
"par": the parameter value, such as the number of bands required by the "stationary" or "nonstationary" correlation structures.
Here are some examples of how to specify the correlation argument:
correlation = "AR"
correlation = list(type = "contAR", x.layer = "time")
correlation = list(type="stationary", par=2)
start
a list of vectors containing initial values for the regression and/or correlation parameters. The vectors in the list must be named "regression" and "correlation".
contrasts
a list giving contrasts for some or all of the factors appearing in the model formula. The elements of the list should have the same name as the variable and should be either a contrast matrix (specifically, any full-rank matrix with as many rows as there are levels in the factor), or else a function to compute such a matrix given the number of levels.
subset
an expression saying which subset of the rows of the data should be used in the fit. This can be a logical vector (which is replicated to have length equal to the number of observations), or a numeric vector indicating which observation numbers are to be included, or a character vector of the row names to be included. All observations are included by default.
na.action
a function to filter missing data. The default is "na.omit".
control
a list of control variables or a geeControl object to control the iteration procedures. These include algorithm, tolerance.reg, tolerance.cor, maxit, trace, and sorted.
Three algorithms are available:
Enter 0, for "GEE0" or GEE with fixed correlation during iterations.
Enter 1, for "GEE1" or GEE with moment estimators for covariance parameters.
Enter 2, for "GEE2" or GEE with paired estimating equations (default).
Other options include convergence criteria, data sorting, and flag to print the iteration information. To replace the default values, enter a list of these arguments or a call to the function geeControl. See the help file for geeControl function for details.

VALUE:

An object of class "gee" is returned. See gee.object for details.

DETAILS:

A complete specification of a GEE model includes the mean, the variance, and a working correlation matrix. A simple form of GEE models uses the mean and the variance structures of a Generalized Linear Model, and these can be specified by arguments family, link and variance. These are similar to the family and link arguments in the glm function but not exactly the same. The link function of a family associates the mean and the linear predictor, which indicates the regression parameters of interest. The linear predictors can be specified in the required argument formula.

The argument cluster is to identify independent clusters with cluster id and record id. All of the observations within the same cluster are assumed to have the same variance. For some simple cases, the function recordDesign can sort the data and generate these variables.

The default method to estimate the initial values of the regression coefficients is glm assuming independent clusters. If the initial values are known, enter them to the argument start.

For each family, the variance is assumed to be a known function of the mean with a multiplication of a scale parameter. If the scale is exactly 1, set the argument variance to be "glm.1". If the scale is a known constant, enter a positive number to variance. Otherwise, the scale is an unknown parameter, so enter "glm.scale".

The correlation matrix is parameterized by a vector. The estimates of regression coefficients and correlation parameters are obtained by a pair of estimating equations (Prentice, 1988). Fisher scoring is implemented in the iteration processes. Therefore, estimates of the regression coefficients and correlation vector and their variance estimates are available. This algorithm is called GEE2. In Liang and Zeger (1986), the correlation vector is estimated by the method of moments, and this algorithm is called GEE1. The covariance can be fixed to constant during iterations and the nuisance parameters can be estimated by the methods of moments after the regression parameters achieve convergence. This algorithm is called GEE0. The option can be set in control=list(algorithm=2). The control argument can set other parameters such as convergence criteria. Besides, if the data had been sorted, enter control=list(sorted=T) to save for unnecessary sorting.

The argument correlation specifies the working correlation matrix. In general, stationary and nonstationary require a variable and a parameter to identify the corresponding parameterization. In this case, enter correlation a list with arguments type, x.layer, par , in which, type is an option of correlation structures; x.layer is the name of a variable to identify records; and par is an integer indicating the band of the stationary or the nonstationary structure. Note the value of the variable x.layer must be a column in the data frame data or in the search path. Only in case of balanced data and data being sorted according to cluster id and record id, the default record id can be used for x.layer. The default uses the integers from 1 to the size of an individual cluster, and this record id is the same for all clusters. For unbalanced data, such a x.layer serving to identify records in a cluster is required. For discrete AR, the time variable is usually served for this x.layer, which might not be the record id. On the other hand, a continuous time variable is not necessary the same as a discrete time or a record id and should be avoided for such purposes in discrete or continuous AR. For example, to apply unstructured correlation to unbalanced longitudinal data, a continuous time variable can not replace the role of the record id for the identification of parameters in the unstructured correlation of different clusters. So, a correct x.layer is required.

To fit a more complicated model, see geeDesign, which provide advanced methods for variance structures and correlation structures in modeling overdispersed and hierarchical data.

REFERENCES:

Liang, K.Y. and Zeger, S.L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73 13-22.

Prentice, R. L. (1988). Correlated binary regression with covariates specific to each binary observation, Biometrics 44 1033-1048.

SEE ALSO:

, , , , .

EXAMPLES:

## Create clusterID and recordID and sort the data; add an offset 
## based on the no. of weeks of observation, baseline=8, treatment=2
Seizure.Subject <- recordDesign("Subject",data.frame(Seizure,
   offset=rep(log(c(8,2,2,2,2)),59)))

gee.out <- gee(y~group+offset(offset),cluster=cbind(clusterID,recordID),
   variance="glm.scale",data=Seizure.Subject,family=poisson,link=log,
   correlation="exchangeable",contrasts=list(group=contr.treatment),
   control=geeControl(trace = T))

## Add baseline indicator to isolate a baseline effect
Seizure1.Subject <- data.frame(Seizure.Subject,post=rep(c(0,1,1,1,1),59))

gee.out  <- gee(y~group*post+offset(offset),cluster=cbind(clusterID,recordID),
   variance="glm.scale",family="poisson",link="log",data=Seizure1.Subject,
   correlation=list(type="stationary",par=4),subset=Subject!=49,
   contrasts=list(group=contr.treatment))

summary(gee.out)

## For a known scale in variance structure
SpruceGrpd.Subject <- recordDesign("Subject",na.omit(SpruceGrpd))

gee.out <- gee(y~Time + group, cluster=cbind(clusterID,recordID),
   variance=0.02, family=Gamma,link="power(1.5)",data=SpruceGrpd.Subject,
   correlation=list(type="contAR", x.layer="Time"), 
   contrasts=list(group=contr.treatment), control=list(algorithm=2))