Compute Correlations or Covariances

DESCRIPTION:

Compute correlations or covariances for the columns of a data set.

This function requires the bigdata library section to be loaded.

USAGE:

bd.cor(data, x.columns=NULL, y.columns=NULL, cov=F)

REQUIRED ARGUMENTS:

data
a bdFrame or data.frame.

OPTIONAL ARGUMENTS:

x.columns
a character vector of column names from data that determines the rows in the output. The correlation (or covariance) will be computed between the columns of data specified in x.columns and in y.columns. If missing, all numeric columns of data will be used.
y.columns
a character vector of column names in data that determines the columns in the output. If missing, all numeric columns of data will be used.
cov
a logical value; if FALSE (the default) correlations are computed, if TRUE covariances are computed.

VALUE:

an object of class "bdFrame" or "data.frame", (the same class as the input data) containing the correlations or covariances for the variables specified. The first column in the output contains the names of the target columns.

DETAILS:

The covariance of two variables, X and Y, is the average value of the product of the deviation of X from its mean and the deviation of Y from its mean. The variables are positively associated if, when X is larger than its mean, Y tends to be larger than its mean as well (or, when X is smaller than its mean, Y tends to be smaller than its mean as well). In this case, the covariance is a positive number. The variables are negatively associated if, when X is larger than its mean, Y tends to be smaller than its mean (or vice versa). Here, the covariance is a negative number. The scale of the covariance depends on the scale of the data values in X and Y; it is possible to have very large or very small covariance values.

The correlation of two variables is a dimensionless measure of association based on the covariance; it is the covariance divided by the product of the standard deviations for the two variables. Correlation is always in the range -1, 1 and does not depend on the scale of the data values. The variables X and Y are positively associated if their correlation is close to 1 and negatively associated if it is close to -1. Because of these properties, correlation is often a more useful measure of association than covariance.

Correlation measures the strength of the linear relationship between two variables. If you create a scatter plot for two variables that have correlation near 1, the points will appear as a line with positive slope. Likewise, if you create a scatter plot for two variables that have correlation near -1, you will see points along a line with negative slope.

A correlation near zero implies that two variables do not have a linear relationship. However, this does not necessarily mean the variables are completely unrelated. It is possible, for example, that the variables are related quadratically or cubically, associations which are not detected by the correlation measure.

SEE ALSO:

.

EXAMPLES:

# Compute correlations of numeric variables in fuel.frame with the
#    variable Fuel:
bd.cor(fuel.frame, "Fuel")

# Compute correlations only between Fuel and Disp.
bd.cor(fuel.frame, "Fuel", "Disp.")

# Compute covariance of numeric variables in fuel.frame with the
#   variable Fuel:
bd.cor(fuel.frame, "Fuel", cov=T)