Big Data Principal Component Analysis

DESCRIPTION:

Returns an object of class bdPrincomp containing the standard deviations of the principal components, the loadings, and, optionally, the scores.

This function requires the bigdata library section to be loaded.

USAGE:

bdPrincomp(x, data=NULL, covlist=NULL, scores=T, cor=F, na.action, subset)

REQUIRED ARGUMENTS:

at least one of x or data must be given.

OPTIONAL ARGUMENTS:

x
a bdFrame or formula. If a bdFrame, the columns should correspond to variables and the rows to observations. If a formula, no variables may appear on the left (response) side.
data
a bdFrame. Usually, this is used only when x is a formula, although it might be used instead of x.
covlist
This argument is not currently supported in bdPrincomp. It is in the function signature for consistency with princomp, but the function will stop with an error message if it is not NULL.
scores
logical value. If scores is TRUE, then a bdFrame of the scores for all of the components is returned. If scores is FALSE, then no scores are computed.
cor
logical flag: if TRUE, then the principal components are based on the correlation matrix rather than the covariance matrix. That is, the variables are scaled to have unit variance.
na.action
function to handle missing values.
subset
the subset of the observations to use.

VALUE:

an object of class "bdPrincomp" which is a list with components:
sdev
vector of standard deviations of the principal components.
loadings
orthogonal matrix of class "loadings" giving the loadings. The first column is the linear combination of columns of x defining the first principal component, etc.
center
vector of centers for the variables.
scale
vector of numbers by which the variables are scaled. If cor is FALSE , these are all 1. If cor is TRUE , scales is the standard deviations of the input data variables.
n.obs
the number of observations on which the estimates are based.
formula
the formula. This is not present if a formula was not used.
call
the call to bdPrincomp.
bdModel
a bdModel object used by predict.bdPrincomp to compute predictions on new data.
bdPredictions
a bdFrame containing principal component scores for the data.

DETAILS:

The results of princomp and bdPrincomp agree to double-precision accuracy with one exception: The signs of the loadings are not determined uniquely in principal components analysis; therefore, they might differ.

BACKGROUND:

Principal component analysis defines a rotation of the variables of x. The first derived direction (a linear combination of the variables) is chosen to maximize the standard deviation of the derived variable, the second to maximize the standard deviation among directions uncorrelated with the first, and so on.

Principal component analysis is often used as a data reduction technique, sometimes in conjunction with regression. If the variables are not all in the same units, you should scale the columns of the input before performing the principal component analysis because a variable with large variance relative to the others will dominate the first principal component.

REFERENCES:

Many multivariate statistics books (and some regression texts) include a discussion of principal components. Below are a few examples:

Dillon, W. R. and Goldstein, M. (1984). Multivariate Analysis, Methods and Applications. Wiley, New York.

Johnson, R. A. and Wichern, D. W. (1982). Applied Multivariate Statistical Analysis. Prentice-Hall, Englewood Cliffs, New Jersey.

Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, London.

SEE ALSO:

, , , , , .

EXAMPLES:

x <- princomp(as.bdFrame(state.x77))