Empirical Quantiles

DESCRIPTION:

Returns a vector or bdVector of the desired quantiles of the data.

USAGE:

quantile(x, probs = 0:4/4, na.rm = F, ...)
quantile.default(x, probs = 0:4/4, na.rm = F, 
         alpha = 1, rule = 1, weights = NULL, freq = NULL) 

REQUIRED ARGUMENTS:

x
vector or bdVector of data. Missing values are not allowed unless na.rm=TRUE.

OPTIONAL ARGUMENTS:

probs
vector or bdVector of desired probability levels. Values must be between 0 and 1 inclusive. The default produces a "five number summary": the minimum, lower quartile, median, upper quartile, and maximum of x.
na.rm
logical flag; indicates whether missing values are removed before computation.
...
methods may have other arguments. The following arguments are available in the default method.
alpha
value between 0 and 1 which determines the definition of the quantiles; smaller numbers give wider quantiles.
rule
integer describing the rule to be used for values of probs that are near 0 or 1. If rule is 1, NAs are supplied for any such points. If rule is 2, the extreme values of x are used. If rule is 3, linear extrapolation is used. This option is irrelevant if alpha=1.
weights
vector of weights the same length as x, or NULL if no weights. Quantiles are calculated for the weighted distribution with probabilities proportional to weights on the values of x.
freq
vector of positive integers, the same length as x, giving frequencies. If supplied then results are equivalent to supplying rep(x, freq) instead of x. The effect is similar to the weights argument, except that values are actually repeated so that the quantiles returned may be exactly equal to a repeated value of x rather than interpolated between adjacent values.

VALUE:

vector or bdVector of empirical quantiles corresponding to the probs levels in the sorted x data.

DETAILS:

The algorithm linearly interpolates between order statistics of x, assuming that the ith order statistic is the (i-alpha)/(n-1+2*alpha) quantile if no weights are present, where n=length(x). The algorithm uses partial sorting, hence is quickly able to find a few quantiles even of large datasets.

approx((1:n - alpha) / (n + 1 - 2 * alpha), 
       x, probs, rule=rule) 

If x contains randomly-generated values from a distribution, then alpha=1 gives quantiles which are biased (they tend to be too narrow), alpha=1/3 gives approximately median-unbiased estimates of the quantiles of the distribution, and alpha=0 matches the correct probabilities for a new observation "X" from that distribution, i.e.
prob(X < quantile(x, p, alpha=0)) = p
(the relationship is exact if p=k/(n+1) for some integer k and the distribution is continuous, and approximate otherwise).

If weights are present, then alpha=.5 corresponds to interpolating between the midpoints of segments of the step function with step widths proportional to weights. For other values of alpha the horizontal positions of those midpoints are transformed linearly; for alpha=1 the horizontal positions of the two extreme midpoints are at 0 and 1.

If weights are present and there are ties in x, then the corresponding weights are averged, so that results are independent of the order of observations.

If both weights and frequencies are supplied, then x and weights are replicated using the frequencies. This may use a lot of memory.

REFERENCES:

Hyndman, R. J. and Fan, Y (1996), "Sample Quantiles in Statistical Packages," The American Statistician, 50, 361-364.

SEE ALSO:

, , , , , , , .

EXAMPLES:

quantile(car.miles)        # five number summary 

quantile(testscores[,1], c(.33,.67))  # 33% and 67% quantiles of  
                                      # data from testscores 

diff(quantile(testscores[,1], c(.25, .75))) # interquartile range 
 
# create function iqr 
iqr <- function (x) diff(quantile(x, c(.25, .75))) 
iqr(car.miles) # returns 23 
 
set.seed(2); x <- runif(9) 
probs <- seq(0, 1, length=101) 
plot(probs, quantile(x, probs, alpha=1), type="l", ylim=c(-.14,1)) 
lines(probs, quantile(x, probs, alpha=.5), col=2) 
lines(probs, quantile(x, probs, alpha=0), col=3) 
lines(probs, quantile(x, probs, alpha=0, rule=3), col=3, lty=3) 
 
# weighted distributions 
plot(probs, quantile(sort(x), probs, weights=1:9, alpha=.5), 
     type="l", ylim=0:1) 
w <- 1:9 / sum(1:9) 
points(cumsum(w)-w/2, sort(x)) 
lines(cumsum(w), sort(x), type="S", col=2) 
lines(probs, quantile(sort(x), probs, weights=1:9, alpha=1), col=3) 
 
# Frequencies 
quantile(rep(x, 1:9))   # For reference 
quantile(x, freq = 1:9) # This should match the previous