Estimate Probability Density Function

DESCRIPTION:

Returns x and y coordinates of a non-parametric estimate of the probability density of the data. Options include the choice of the window to use and the number of points at which to estimate the density. Weights may also be supplied. This is a generic function; there are currently no methods with individual help files.

USAGE:

density(x, n = 50, window = "g", na.rm = F, width = "hb", 
        from=<<see below>>, to=<<see below>>, cut=<<see below>>, 
        weights = NULL, freq = NULL) 

REQUIRED ARGUMENTS:

x
vector or bdVector of observations from the distribution whose density is to be estimated. Missing values are allowed if na.rm is TRUE.

OPTIONAL ARGUMENTS:

n
the number of equally spaced points at which to estimate the density.
window
character string giving the type of window used in the computations. One of: "cosine", "gaussian", "3gaussian", "rectangular", "triangular" (one character is sufficient).
na.rm
logical flag: if TRUE then missing values are removed before estimation, if FALSE they are not allowed.
width
width of the window. This may be a numeric value, a function to apply to x which returns a bandwidth, or a character string specifying which built-in bandwidth method to use. Available bandwidth methods are histogram bin ( hb), normal reference density ( nrd), biased cross-validation ( bcv), unbiased cross-validation ( ucv), and Sheather & Jones pilot estimation of derivatives ( sj). The argument weights is not used in these calculations.

The standard error of a Gaussian window is width/4. For the other windows width is the width of the interval on which the window is non-zero.

from, to
the n estimated values of density are equally spaced between from and to. The default is the range of the data extended by width*cut.
cut
The fraction of the window width that the x values are to be extended by. The default is .75 for the Gaussian windows and .5 for the other windows. Ignored if from and to are used.
weights
vector of same length as x for computing a weighted density estimate. See DETAILS, below.
freq
vector of non-negative integers the same length as x, giving frequencies; density(x, freq=f) is equivalent to density(rep(x, f)).

VALUE:

list with two components, x and y, suitable for giving as an argument to one of the plotting functions.
x
vector or bdVectorof n points at which the density is estimated.
y
density estimate at each x point.

DETAILS:

Missing values are excluded if na.rm is TRUE, and cause an error otherwise.

When the data is a bdVector, the data is aggregated before smoothing. The range of the "x" variable is divided into 1000 bins, and the mean for "x" computed in each bin. A weighted density estimate is then computed on the bin means weighted based on the bin counts. This gives values that differ somewhat from those when "density" is applied to the unaggregated data. The values are generally close enough to be indistinguishable when used in a plot, but the difference could be important when density is used for prediction or optimization.

These are kernel estimates. For each x value in the output, the window is centered on that x and the heights of the window at each datapoint are summed. This sum, after a normalization, is the corresponding y value in the output: the value at x[i] is
y[i]=1/N*sum(K(x[i]-X))
where K is the kernel function specified by window and width, X is the input data, and N is the length of X. In the presence of weights the value is
y[i]=1/sum(weights)*sum(weights*K(x[i]-X)).

The "gaussian" window is truncated at 4 standard deviations (and then scaled appropriately to adjust for the truncated area). This is different than the S+6.0 version of "density", which truncates at 3 standard deviations. The "3gaussian" window option allows for 3 s.d. truncation.

weights and freq are equivalent except that freq affects width calculations and weights does not.

If from==to and n is not supplied, then n defaults to 1.

BACKGROUND:

Density estimation is essentially a smoothing operation. Inevitably there is a trade-off between bias in the estimate and the estimate's variability: wide windows produce smooth estimates that may hide local features of the density.

REFERENCES:

Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.

Wegman, E. J. (1972). Nonparametric probability density estimation. Technometrics, 14, 533-546.

Venables, W.N. and Ripley, B.D. (1997) Modern Applied Statistics with S-PLUS, Second Edition. Springer-Verlag.

BUGS:

Note that the gaussian kernel is set to zero outside of 4 standard deviations (was 3 in earlier versions of density), and the kernel scaled accordingly to integrate to 1.

For kernels which are discontinuous ( "r", "g", and "3gaussian"), analytical results are indetermine at output points where the difference between output and data values falls at a discontinuity point of the kernel, and numerical results may vary between machines.

Results from this version of density differ from S+6.0 and earlier, because of the change to 4 standard deviations, and because this version is in double precision. Both changes address the discontinuity problem. With the default values for from, to, and cut and window = "r" or "g", the extreme output x-values ( from and to) previously fell on discontinuity points. The change for the gaussian window moves that discontinuity point outside the output range. The change to double precision makes results for the "r" window more consistent across machines in our test cases, but results are still indeterminate.

SEE ALSO:

, , , , , , .

EXAMPLES:

plot(density(rnorm(20)), type="b") 

den.co2 <- density(co2, width=4) 
hist(co2) 
den.co2$y <- den.co2$y*length(co2)*2 
   # multiply density by length of series and width of histogram bar 
lines(den.co2)