Fit a Spatial Linear Regression Model

DESCRIPTION:

Returns an object of class slm that represents a fit of a spatial linear (generalized least squares) regression model.

USAGE:

slm(formula, cov.family, data=<<see below>>, subset=<<see below>>,  
    spatial.arglist=NULL, na.action=na.fail, model=F, x=F, 
    y=F, contrasts=NULL, ...)  

REQUIRED ARGUMENTS:

formula
a formula object, with the response on the left of a `~' operator, and the terms, separated by + operators, on the right.
cov.family
an object of class "cov.family" giving the spatial covariance family to be fit. Valid values are: CAR (conditional auto-regression), SAR (simultaneous auto-regression), or MA (moving average). These are S-PLUS objects containing functions required by the slm fitting algorithm. The covariance model is defined by argument cov.family and is further defined by the variables listed in argument spatial.arglist.

OPTIONAL ARGUMENTS:

data
a data frame in which to interpret the variables named in the formula, or in the subset arguments. If this is missing, then the variables in the formula should be on the search list. This may also be a single number to handle some special cases -- see NAMES below for details.
subset
this can be a logical vector (with length equal to the number of observations), or a numeric vector indicating which observation numbers are to be included, or a character vector of the row names to be included in the model. All observations are included by default.
spatial.arglist
a list containing arguments required by (and further defining) the spatial model as specified by argument cov.family. Instead of entering these arguments individually, spatial.arglist is used to allow the algorithms to be generalized to different kinds of models. For all of the models currently fit by slm, the spatial.arglist argument contains the following variables:

REQUIRED

neighbor - an object of class "spatial.neighbor" containing the neighbors and weights to be used when defining the covariance model (see spatial.neighbor).

OPTIONAL

region.id - a vector containing the rows currently available in the spatial neighbor object. Argument region.id must be given whenever argument subset is given and rows have previously been removed from the spatial neighbor object. This is described below in the DETAILS section. Also see the help file for spatial.subset.

weights - the cov.family uses the neighbor argument to determine a covariance matrix for the residuals. All current types for cov.family allow the specification of a diagonal matrix of weights in the parameterization of the covariance matrix. See the the cov.family for the parameterization. If specified, vector weights contains these diagonal values. If omitted, weights equal to 1 are used.

start - vector of starting values for the optimization algorithm. Since a profile likelihood is optimized, only starting values for the covariance matrix parameters (vector parameters in the output) can be provided. If not provided, these typically default to zero, but this depends upon the cov.family.

print.level - if TRUE, then the function evaluations are printed as the optimization algorithm proceeds. This can be quite useful for checking on convergence of the algorithm to the maximum likelihood estimates.

na.action
a function to filter missing data. This is applied to the model frame after any subset argument has been used. The default (with na.fail) is to create an error if any missing values are found. A possible alternative is na.omit, which deletes observations that contain one or more missing values.
model
logical flag: if TRUE, the model frame is returned in component model.
x
logical flag: if TRUE, the model matrix is returned in component x.
y
logical flag: if TRUE, the response is returned in component y.
qr
logical flag: if TRUE, the QR decomposition of the model matrix is returned in component qr.
contrasts
a list of contrasts for some or all of the factors appearing in the model formula. Each element of the list should have the same name as the corresponding factor variable, and should be either a contrast matrix (specifically, any full-rank matrix with as many rows as there are levels in the factor), or a function to compute such a matrix given the number of levels.
...
additional arguments which can be passed to the function slm.nlminb and which effect the iterative estimation algorithm. In particular, various algorithmic control values can be passed, along with the lower and upper bounds of the parameters.

VALUE:

an object of class "slm". Objects of class "slm" contain most elements available in class "lm" objects (but they do not inherit from class "lm" objects), and they also contain items returned by the function nlminb. These elements are as follows:
parameters
final values of the parameters over which the optimization takes place. These are the parameters used in defining the covariance structure.
objective
the final value of the objective (-log-likelihood).
message
a statement of the reason for termination.
grad.norm
the final norm of the objective gradient. If there are active bounds, then components corresponding to active bounds are excluded from the norm calculation. If the number of active bounds is equal to the number of parameters, NA will be returned.
iterations
the total number of iterations before termination.
f.evals
the total number of residual evaluations before termination.
g.evals
the total number of jacobian evaluations before termination.
scale
the final value of the scale vector for the minimization.
coefficients
the coefficients of the generalized least-squares fit of the response to the columns of the model matrix. The names of the coefficients are the names of the single-degree-of-freedom effects (the columns of the model matrix). If the model was overdetermined and singular.ok was true, there will be missing values in the coefficients corresponding to inestimable coefficients.
residuals
the residuals from the fit. These are not ordinary residuals. See the the cov.family.object help file or the CAR, SAR, or MA help files for more information.
fitted.values
the fitted values from the fit. These are the linear trend, X%*%beta, where X contains the independent variables, and beta contains the coefficients of the linear model.
rank
the computed rank (number of linearly independent columns) of the model matrix. If the rank is less than the dimension of R, the columns of R will have been pivoted, and missing values will have been inserted in the coefficients. The upper-left rank rows and columns of R are the nonsingular part of the fit, and the remaining columns of the first rank rows give the aliasing information (see alias).
assign
the list of assignments of coefficients (and effects) to the terms in the model. The names of this list are the names of the terms. The ith element of the list is the vector saying which coefficients correspond to the ith term. It may be of length 0 if there were no estimable effects for the term.
call
an image of the call that produced the object, but with the arguments all named and with the actual formula included as the formula argument.
contrasts
a list containing sufficient information to construct the contrasts used to fit any factors occurring in the model. The list contains entries that are either matrices or character vectors. When a factor is coded by contrasts, the corresponding contrast matrix is stored in this list. Factors that appear only as dummy variables and variables in the model that are matrices correspond to character vectors in the list. The character vector has the level names for a factor or the column labels for a matrix.
df.residual
the number of degrees of freedom for residuals.
model
optionally the model frame, if model=TRUE.
x
optionally the model matrix, if x=TRUE.
y
optionally the response, if y=TRUE.
weights
the optional weights (from argument spatial.arglist) used in the model.
tau2
the residual variance estimate.
cov.coef
the variance-covariance matrix for the coefficients. It is assumed that the estimated coefficients are independent of the covariance matrix parameters (true in the CAR, SAR, and MA models).

DETAILS:

slm fits maximum likelihood estimates of spatial regression models (these are equivalent to generalized least squares estimates) using finite difference derivatives and a quasi-Newton optimization algorithm. In such models one assumes a linear model,

E(y/x) = x beta,

for the means of the dependent variable given the fixed covariate values, but the errors are assumed to arise from a multivariate normal distribution with a covariance structure as specified by the covariance structure model cov.family. See the help files for the MA, CAR and SAR objects for types of covariance structures available (for the usual model based on independent errors, lm may be used.)

The sparse matrix routines of Kundert (1988) are used in solving linear systems and computing determinants required by the likelihood function. The use of these routines makes the algorithm much more efficient than would otherwise be the case. Even so, the cpu time required by the algorithm can be quite large, so lattices with more than, say, 200 to 400 regions should be handled carefully to ensure that cpu time will be available.

A profile likelihood is computed. In this likelihood an equation for the linear model parameters (beta) is obtained for known covariance model parameters. Substituting this equation back into the likelihood, the "profile" likelihood is obtained as a function of the covariance model parameters alone. Because a profile likelihood is used, there is a relatively small number of parameters to optimize, making the use of finite difference derivatives more attractive.

Subsetting operations on the spatial data frame are more difficult because the spatial neighbor object must also be subset. This means that a correspondence must be maintained between the "data" object which contains the fixed covariates and the "neighbor" object which maintains information about neighbor relationships. The region.id variable of argument spatial.arglist provides this correspondence. In the following, for the sake of clarity, we suppose that the linear model is specified via a data frame argument data. Vector region.id must be the same length as the vectors in the linear model, and the i-th element of region.id must "name" the region for the i-th row of data in exactly the same manner that the row.id and col.id values in the "spatial.neighbor" object name a region. Then the elements of region.id are keys to the row.id and col.id columns of object neighbor. If rows of object data are removed, the names of these rows is given by the elements of object region.id , and these names are the same names as are used in the row.id and col.id columns of object neighbor. Then rows in neighbor can be removed by the subsetting operation.

If the subset argument is present, it is evaluated in the context of the data frame, like the terms in formula. It is also used in the computation of subsets for any of the arguments contained in spatial.arglist , including variables neighbor and region.id. The specific action of subset on the model arguments is as follows: the model frame is computed on allrows, then the appropriate subset is extracted. A variety of special cases make such an interpretation desirable (e.g., the use of lag or other functions that may need more than the data used in the fit to be fully defined). On the other hand, if you meant the subset to avoid computing undefined values or to escape warning messages, you may be surprised. For example,

slm(y ~ log(x), cov.family = SAR, data = mydata, subset = x > 0)

will still generate warnings from log. If this is a problem, do the subsetting on the data frame directly:

slm(y ~ log(x), cov.family = SAR, data = mydata[mydata$x > 0,])

The subset argument acts on variable neighbor of the spatial.arglist argument as follows: Let region.id of spatial.arglist identify the row numbers in neighbor of spatial.arglist corresponding to the rows of the data frame given in argument data. Then region.id[subset] is a listing of the row and column numbers to be used in neighbor. Rows and columns of neighbor not in the vector region.id[subset] are removed.

As in lm, the formula argument is passed around unevaluated;that is, the variables mentioned in the formula in slm will be defined when the model frame is computed, not when slm is initially called. In particular, if data is given, all these names should be defined as variables in that data frame.

Generic functions such as print have methods to show the results of the fit.

NAMES. Variables occurring in a formula are evaluated differently from arguments to S-PLUS functions, because the formula is an object that is passed around unevaluated from one function to another. The functions such as slm that finally arrange to evaluate the variables in the formula try to establish a context based on the data argument. (More precisely, the function model.frame.default does the actual evaluation, assuming that its caller behaves in the way described here.) If the data argument to slm is missing or is an object (typically, a data frame), then the local context for variable names is the frame of the function that called slm, or the top-level expression frame if the user called slm directly. Names in the formula can refer to variables in the local context as well as global variables or variables in the data object.

The data argument can also be a number, in which case that number defines the local context. This can arise, for example, if a function is written to call slm, perhaps in a loop, but the local context is definitely notthat function. In this case, the function can set data to sys.parent() , and the local context will be the next function up the calling stack. See the third example below. A numeric value for data can also be supplied if a local context is being explicitly created by a call to new.frame. Notice that supplying data as a number implies that this is the onlylocal context; local variables in any other function will not be available when the model frame is evaluated. This is potentially subtle. Fortunately, it is not something the ordinary user of slm needs to worry about. It is relevant for those writing functions that call slm or other such model-fitting functions.

REFERENCES:

Cliff, A. D. and Ord, J. K. (1981). Spatial Processes - Models and Applications. Pion Limited. London.

Cressie, N. A. C. (1993). Statistics for Spatial Data. (Revised Edition). Wiley, New York.

Haining, R. (1990). Spatial Data Analysis in the Social and Environmental Sciences. Cambridge University Press. Cambridge.

Kundert, Kenneth S. and Sangiovanni-Vincentelli, Alberto (1988). A Sparse Linear Equation Solver. Department of EE and CS, University of California, Berkeley.

Ripley, B. D. (1981). Spatial Statistics. Wiley, New York.

There is a vast literature on spatial regression and generalized least squares, the references above are just a small sample of what is available.

SEE ALSO:

, , , , , , , , , .

EXAMPLES:

sids.maslm <- slm(sid.ft ~ nwbirths.ft, cov.family=MA, data=sids,  
     spatial.arglist=list(neighbor=sids.neighbor)) 
sids.sarslm <- slm(sid.ft ~ nwbirths.ft, cov.family=SAR, data=sids,  
     subset=c(-5,-1), spatial.arglist=list(neighbor=sids.neighbor,  
     region.id=1:100, weights=1/sids$births)) 
# myfit calls slm, using the caller to myfit as the local context  
# for variables in the formula (see aov for an actual example) 
myfit <- function(formula, cov.family, data=sys.parent(), ...) { 
    .. .. 
    fit <- slm(formula, cov.family, data, ...) 
    .. .. 
}