Big Data Generalized Linear Model

DESCRIPTION:

Fit a generalized linear model (glm) on a big data object using the same syntax as the glm function. This function is typically not called directly by users but it is invoked through a call to glm when the data argument is of class "bdFrame".

This function requires the bigdata library section to be loaded.

USAGE:

bdGlm(formula, family=gaussian, data, weights, subset, na.action,
      control=glm.control(...), contrasts=NULL, correlation=TRUE)

REQUIRED ARGUMENTS:

formula
a formula object, with the response on the left of a ~ operator and the terms, separated by + operators, on the right. The response must be a single numeric variable.
data
a bdFrame in which to interpret the variables named in the formula, subset, and weights arguments.

OPTIONAL ARGUMENTS:

family
a family object. This is a list of expressions for defining the link, variance function, initialization values, and iterative weights for the generalized linear model. Supported families are: gaussian, binomial, poisson, quasi, inverse.gaussian, and Gamma. Functions like binomial produce a family object and can be given without the parentheses. Family functions can take arguments, as in binomial(link=probit). For more details, see the help files for family and family.object.
weights
the weights for the fitting criterion. This must be one of the columns in data. By default, all observations are weighted equally.
subset
an expression defining which subset of the rows in the data to use in the fit. This can be a logical vector, which is replicated to have length equal to the number of observations, a numeric vector indicating which observation numbers to include. All observations are included by default.
na.action
a function to filter missing data. This is applied to the model.frame after any subset argument has been applied. The default is na.fail, which returns an error if any missing values are found. An alternative is na.exclude, which deletes observations that contain one or more missing values.
control
a list of iteration and algorithmic constants. See glm.control for their names and default values. These can also be given directly as arguments to bdGlm itself, instead of through control
contrasts
a list of contrasts to be used for some or all of the factors appearing as variables in the model formula. The names of the list should be the names of the corresponding variables. The elements of the list should be either contrast-type matrices (matrices with as many rows as levels of the factor, and with columns linearly independent of each other and of a column of ones), or else they should be functions that compute such contrast matrices. See the help file for contr.helmert for examples.
correlation
Logical indicating whether to return the computed correlation matrix for the coefficients in the model. The matrix of correlations will be of size p by p where p is the number of predictors. This can get large when there are factors with many levels. To avoid extracting this matrix, specify correlation=F.

VALUE:

an object of class "bdGlm". The methods for this object provide big data equivalents of the corresponding method for a glm object. Methods available include: print, summary, coef, plot, residuals, fitted, predict, anova and deviance. While the methods behave the same, the actual structure of the glm and bdGlm objects differ.

DETAILS:

The bdGlm function is typically not called directly by a user. It is invoked through a call to glm when the data argument is a big data object (an object of class "bdFrame").

The evaluation for the formula and the creation of the model matrix with contrasts, weights, subset and na.action are done the same as in an ordinary glm model.

Limitations of bdGlm relative to glm:
- the response variable cannot be a matrix
- the predictors cannot contain terms using: offset, poly, bs, ns
- there are no ordered factors for bigdata objects so the ordered factors contrasts will never be used
- the glm arguments: start, method, model, x and y do not work in bdGlm
- user defined families are not supported
- the predict method, predict.bdGlm, does not support arguments other than object, newdata and type
- the predict method does not support type="terms"
- the plot method, plot.bdGlm, does not create a normal QQ residuals plot. The plots that are produced are hexbin scatterplots.
- the anova method, anova.bdGlm, only produces a summary ANOVA when given a single model object

REFERENCES:

Chambers, J. M. and Hastie, T. J. (1993). Statistical Models in S. London: Chapman and Hall.

McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, 2nd ed. London: Chapman and Hall.

SEE ALSO:

, , , ,

EXAMPLES:

# Convert kyphosis to a bdFrame and use bdGlm:
bigkyphosis <- as.bdFrame(kyphosis)
bigGlm <- glm(Kyphosis ~ Age + Number, family=binomial, data=bigkyphosis)
# Check class of the model object:
class(bigGlm)
# Print the model object:
bigGlm