Row and Column Summaries

DESCRIPTION:

Means, sums, variances, or standard deviations, by row or column, or dimensions of arrays. These are generic functions; methods currently exist for data.frame, resamp , series, bdFrame , bdTimeSeries, and bdSignalSeries objects.

USAGE:

colMeans(x,  na.rm=F, dims=1, weights, freq, n) 
colSums(x,   na.rm=F, dims=1, weights, freq, n) 
colVars(x,   na.rm=F, dims=1, unbiased=T, SumSquares=F, weights, freq, n) 
colStdevs(x, na.rm=F, dims=1, unbiased=T, SumSquares=F, weights, freq, n) 
rowMeans(x,  na.rm=F, dims=1, weights, freq, n) 
rowSums(x,   na.rm=F, dims=1, weights, freq, n) 
rowVars(x,   na.rm=F, dims=1, unbiased=T, SumSquares=F, weights, freq, n) 
rowStdevs(x, na.rm=F, dims=1, unbiased=T, SumSquares=F, weights, freq, n)
sd(x, na.rm=F)

REQUIRED ARGUMENTS:

x
a matrix, array, vector, data frame, timeSeries object, or an object for which a method has been written.

OPTIONAL ARGUMENTS:

na.rm
if FALSE, missing values ( NA) in the input result in missing values in corresponding elements of the output. If TRUE then missing values are omitted from calculations.
dims
integer -- the number of dimensions to treat as "rows". If x is an array with more than two dimensions (say 5), dims determines what dimensions are summarized; if dims=3, then rowMeans is a 3 dimensional array consisting of the means across the remaining 2 dimensions, and colMeans is a 2 dimensional array consisting of the means across the last 3 dimensions.

You can specify dims=1 for a big data object (for example, the big data versions of colMeans, colSums, colVars , and colStdevs). Any other value is not allowed.

unbiased
if TRUE, then variances are sample variances, e.g.
sum((x-mean(x))^2)/(n-1)
for a vector, where n is the length of the vector. This is unbiased if the values in x are obtained by simple random sampling. If FALSE, the definition
sum((x-mean(x))^2)/n
is used instead.
SumSquares
if TRUE, then unnormalized sums of squares are returned, with no division by either n or (n-1). If this is TRUE then unbiased is ignored.
weights
vector, with the same number of observations as x (number of rows or columns for colmeans and rowMeans, respectively, if x is a matrix). If present, argument unbiased is ignored and the definition used is
sum(weights * (x-mean(x, weights=weights))^2)
if SumSquares=T and
sum(weights * (x-mean(x, weights=weights))^2)/sum(weights)
otherwise.
freq
vector of positive integers, the same number of observations as x. If present, the kth row of x is repeated k times. The effect is similar to the weights argument, except this does not cause the unbiased argument to be ignored, and division is by (sum(freq)-1) rather than (n-1).
n
number of rows; if supplied this overrides the actual number of rows of an object. This is useful for obtaining summaries on regular subsets of the data.

VALUE:

Means, sums, variances, or sums of squares by row or column. This is normally a vector, but is a matrix or array if x is an array and the value of dims implies that the result has at least two dimensions.

If n is supplied then a vector without names returned ( dims is ignored). Otherwise the result has names or dimnames if these are found in x.

DETAILS:

colVars(x) is equivalent to diag(var(x)) if x is a matrix, but is faster (and uses column names).

Supplying n improves speed, largely because names are discarded. However, the primary use of n is to compute summaries for a vector without turning it into an array first.

Variances are computed by the numerically accurate corrected two-pass method described in Chan, Golub, and LeVeque (1983). Summations are done by adding results for groups of size 256, then adding the group sums; this is motivated by the numerically-accurate pairwise summation method described in the same article.

REFERENCES:

Chan, T., Golub, G., and LeVeque, R. (1983). Algorithms for computing the sample variance: analysis and recommendations. The American Statistician, 37, 242-247.

SEE ALSO:

, , , , , , , , , , .

EXAMPLES:

x <- matrix(1:12, 4) 
rowMeans(x) 
colMeans(x) 
 
## Summaries for regular subsets of a vector 
x <- 1:10 
colMeans(x, n=5)           # groups of 5 consecutive observations 
rowMeans(x, n=5)           # groups of every fifth observation 
 
 
## Higher-dimensional array 
x <- array(runif(24), dim=c(2,3,4)) 
rowMeans(x)                  # vector of length 2. 
rowMeans(x, dims=2)          # 2x3 matrix. 
apply(x, 1:2, mean)          # same as previous 
colMeans(x)                  # 3x4 matrix. 
colMeans(x, dims=2)          # vector of length 4. 
colMeans(aperm(x, c(2,1,3))) # 2x4 matrix 
colVars(x[1,,])              # vector of length 4 
diag(var(x[1,,]))            # same as previous 
 
 
### Investigate the distribution of the sample mean and t-statistic 
### when the underlying population is not normal 
x <- rexp(1000 * 20)  # 1000 samples of size 20 
means <- colMeans(x, n=20) 
stdevs <- colStdevs(x, n=20) 
qqnorm(means) 
plot(means, stdevs) # These would be independent for a normal population 
qqnorm( (means - 1) / stdevs ) 
 
# The first three lines in that study could be replaced with 
x <- matrix(rexp(1000 * 20), 20)  # 1000 samples of size 20 
means <- colMeans(x) 
stdevs <- colStdevs(x) 
 
 
### Bootstrap the sample mean 
y <- runif(10) 
indices <- sample(1:10, 10*1000, replace=T) # 1000 samples 
 
# One way -- make use of the argument "n" 
colMeans(y[indices], n=10) 
 
# Alternative (slower) 
boot.y <- y[indices] 
dim(boot.y) <- c(10, 1000) 
colMeans(boot.y) 
 
# Same as previous, but much slower 
apply(boot.y, 2, mean)