density(x, n = 50, window = "g", na.rm = F, width = "hb", from=<<see below>>, to=<<see below>>, cut=<<see below>>, weights = NULL, freq = NULL)
bdVector
of observations from the distribution whose density is to
be estimated.
Missing values are allowed if
na.rm
is
TRUE
.
"cosine"
,
"gaussian"
,
"3gaussian"
,
"rectangular"
,
"triangular"
(one character is sufficient).
TRUE
then missing values are removed before estimation,
if
FALSE
they are not allowed.
x
which returns a bandwidth, or a character
string specifying which built-in bandwidth method to use. Available bandwidth
methods are histogram bin (
hb
),
normal reference density (
nrd
), biased
cross-validation (
bcv
),
unbiased cross-validation (
ucv
), and
Sheather & Jones pilot estimation of derivatives
(
sj
). The argument
weights
is not used in these calculations.
The standard error of a Gaussian window is
width/4
.
For the other windows
width
is the width of the interval on which the
window is non-zero.
from
and
to
. The default is the range of the data extended by
width*cut
.
x
values are to be extended by.
The default is
.75
for the Gaussian windows and
.5
for the other
windows. Ignored if
from
and
to
are used.
x
for computing a weighted density
estimate. See DETAILS, below.
x
, giving frequencies;
density(x, freq=f)
is equivalent to
density(rep(x, f))
.
x
and
y
, suitable for giving as an argument
to one of the plotting functions.
bdVector
of
n
points at which the density is estimated.
x
point.
Missing values are excluded if
na.rm
is
TRUE
, and cause an error
otherwise.
When the data is a
bdVector
, the data is aggregated before smoothing. The range of the "x" variable is divided into 1000 bins, and the mean for "x" computed in each bin. A weighted density estimate is then computed on the bin means weighted based on the bin counts. This gives values that differ somewhat from those when "density" is applied to the unaggregated data. The values are generally close enough to be indistinguishable when used in a plot, but the difference could be important when density is used for prediction or optimization.
These are kernel estimates. For each
x
value in the output, the window is
centered on that
x
and the heights of the window at each datapoint are summed.
This sum, after a normalization, is the corresponding
y
value in the output: the value at
x[i]
is
y[i]=1/N*sum(K(x[i]-X))
where
K
is the kernel function specified
by
window
and
width
,
X
is the input data, and
N
is the length
of
X
. In the presence of weights the value is
y[i]=1/sum(weights)*sum(weights*K(x[i]-X))
.
The
"gaussian"
window is truncated at 4 standard deviations (and
then scaled appropriately to adjust for the truncated area). This is
different than the S+6.0 version of
"density"
, which truncates at 3
standard deviations. The
"3gaussian"
window
option allows for 3
s.d. truncation.
weights
and
freq
are equivalent except that
freq
affects
width calculations and
weights
does not.
If
from==to
and
n
is not supplied, then
n
defaults to 1.
Density estimation is essentially a smoothing operation. Inevitably there is a trade-off between bias in the estimate and the estimate's variability: wide windows produce smooth estimates that may hide local features of the density.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.
Wegman, E. J. (1972). Nonparametric probability density estimation. Technometrics, 14, 533-546.
Venables, W.N. and Ripley, B.D. (1997) Modern Applied Statistics with S-PLUS, Second Edition. Springer-Verlag.
Note that the gaussian kernel is set to zero outside of 4 standard
deviations (was 3 in earlier versions of
density
), and the kernel
scaled accordingly to integrate to 1.
For kernels which are discontinuous (
"r"
,
"g"
, and
"3gaussian"
),
analytical results are indetermine at output points
where the difference between output and data values falls at a
discontinuity point of the kernel,
and numerical results may vary between machines.
Results from this version of
density
differ from S+6.0
and earlier, because of the change to 4 standard deviations,
and because this version is in double precision. Both changes
address the discontinuity problem. With the
default values for
from
,
to
, and
cut
and
window = "r"
or
"g"
, the extreme output x-values (
from
and
to
)
previously fell on discontinuity points.
The change for the gaussian window moves that discontinuity point
outside the output range. The change to double precision
makes results for the
"r"
window more consistent across
machines in our test cases, but results are still indeterminate.
plot(density(rnorm(20)), type="b") den.co2 <- density(co2, width=4) hist(co2) den.co2$y <- den.co2$y*length(co2)*2 # multiply density by length of series and width of histogram bar lines(den.co2)