Saddlepoint calculations

DESCRIPTION:

Saddlepoint approximation to distribution of the mean of observations from a discrete distribution, or linear combination of multiple group means.

USAGE:

pDiscreteMean(q, values, size = <<see below>>, weights = NULL, 
             group = NULL, conv.factor = 0, ...) 
qDiscreteMean(p, values, size = <<see below>>, weights = NULL, 
             group = NULL, conv.factor = 0, ...) 
dDiscreteMean(x, values, size = <<see below>>, weights = NULL, 
             group = NULL, conv.factor = 0, ...) 
saddlepointP(tau, L, size = <<see below>>, weights = NULL, 
             group = NULL, mean = T, conv.factor = 0) 
saddlepointD(tau, L, size = <<see below>>, weights = NULL, 
             group = NULL, mean = T, conv.factor = 0) 
saddlepointPSolve(p, L, size = <<see below>>, weights = NULL, 
             group = NULL, mean = T, conv.factor = 0,  
             initial, tol = 1E-6, tol.tau = tol, maxiter = 100) 

REQUIRED ARGUMENTS:

p
vector of probabilities
q
vector of quantiles
x
vector of quantiles
tau
vector of tilting parameters
L,values
vector of possible values; these functions calculate the distribution of a sample mean for observations chosen with replacement from these values.

OPTIONAL ARGUMENTS:

size
sample size; The default value is n (the length of L). See "DETAILS", below.
weights
vector of probabilities of length n; if supplied then sampling is with these (unequal) probabilities on the values in L.
group
vector of length n indicating stratified sampling or multiple-group problems; unique values of this vector determine the groups. In the current implementation, only one of group and size may be supplied.
mean
logical, if TRUE then calculations are for the sample mean, or sum of sample means for groups. If FALSE, then calculations are for the sample sum or sum of group sample sums.
conv.factor
convolution factor; see "DETAILS", below.
initial
vector the same length as p; initial values used in iteratively solving for tau.
tol
tolerance for solving for tau on the scale of p.
tol.tau
tolerance for solving for tau on the scale of tau.
maxiter
maximum number of iterations allowed for finding values of tau which bracket the solution for each p (after the root is bracketed additional iterations may be performed).
...
arguments to control numerical convergence. For pDiscreteMean and dDiscreteMean these are any other arguments acceptable to tiltMeanSolve. For qDiscreteMean, these are any other arguments acceptable to saddlepointPSolve.

VALUE:

density ( dDiscreteMean and saddlepointD), probability ( pDiscreteMean and saddlepointP), quantile ( qDiscreteMean), or saddlepoint tilting parameter ( saddlepointPSolve) for the mean or sum of random values from a discrete distribution.

The output is a vector as the same length as the primary input ( p, q, x, or tau).

"density" is a misnomer, as the distribution is not continuous. However, if the values in the discrete distribution are themselves drawn from a continuous distribution, then this distribution is practically continuous (Hall 1986); the "density" is the density for a continuous approximation to the distribution.

DETAILS:

Suppose that Y is the mean of size observations sampled with replacement from L. Then
(tiltMean(tau, L), saddlepointP(tau, L, size))
are parametric equations in tau that trace the saddlepoint estimate of the cumulative distribution function of Y.

If group is supplied, then calculations are for the distribution of the sum of group means, or sum of group sums. Arbitrary sample sizes within groups are not supported. In the sum of group means case, the tilting parameter used for group g is tau / (n[g]/n), which is consistent with tiltMean.

The standard saddlepoint estimate for density is due to Daniels (see also Kolassa 1997). The cumulative distribution function estimate used here is formula (3.8) in Barndorff-Nielsen (1986), often referred to as the "r*" approximation in the literature. This is similar to the Lugannani and Rice saddlepoint approximation (see Kolassa). The cdf approximation is modified to avoid numerical problems in the center.

These estimates are for continuous distributions, though here they are applied to discrete distributions. If the sample is reasonably large and observations ( L) are not lattice-valued this should not matter, but for small samples the estimates may break down, and for lattice-valued observations (e.g. integers) the estimates do not reflect the discrete steps in the actual cdf.

The conv.factor argument convolves the distribution of the sum (or mean) of size observations (chosen from L with probabilities weights) with a single normally distributed observation with variance conv.factor*var(L,weights,unbiased=F). This serves three purposes. First, it provides some smoothing.
Second, it inflates the variance of the distribution, and may be used to get (nearly) unbiased variances. Recall that the usual estimate of sample variance ( var(x,unbiased=T)) uses a denominator of (n-1) rather than n, where n is the sample size; this corresponds to a variance inflation factor of n/(n-1). Here the expected value of the variance for the mean of size independent observations without weights from a distribution with variance sigma^2 is (n-1)/n sigma^2 (size+conv.factor)/size^2. With size=n and conv.factor=n/(n-1) that simplifies to sigma^2/n.
Third, the argument makes estimates reliable in extreme cases, when size is very small and L or weights is skewed (see "EXAMPLES"). Saddlepoint density and distribution estimates break down in the tails for all discrete distributions when size is fixed: the density approximation approaches infinity as tau approaches plus or minus infinity; the r* cdf approximation approaches 0 as tau approaches infinity and 1 as tau approaches negative infinity (the Lugannani-Rice approximation approaches negative infinity as tau approaches positive infinity and positive infinity as tau approaches negative infinity). On most examples, however, the approximations fail only in extreme regions of the tails, and may not fail at all up to machine precision. In case of questionable results, set conv.factor to a small positive value, say 0.1, to get the correct tail behavior.

saddlepointP produces a warning if the cdf approximation is decreasing at any tau value.

pDiscreteMean calls tiltMeanSolve to calculate tau for given quantiles, then calls saddlepointP. dDiscreteMean calls tiltMeanSolve, then saddlepointD, and qDiscreteMean calls saddlepointPSolve, then tiltMean.

saddlepointPSolve uses a bracketed secant method to iteratively solve for tau

REFERENCES:

Barndorff-Nielsen, O. E. (1986), "Inference on full or partial parameters based on the standardized signed log likelihood ratio", Biometrika, 73, 307-322.

Daniels, H.E. (1954), "Saddlepoint approximations in statistics," Ann. Math. Statist., 25, 631-650.

Hall, P. (1986), "On the number of bootstrap simulations required to construct a Confidence Interval", Annals of Statistics 14, 1453-1462.

Hesterberg, T.C. (1994), "Saddlepoint Quantiles and Distribution Curves, with Bootstrap Applications," Computational Statistics, 9(3), 207-212.

Kolassa, J.E. (1997). Series Approximation Methods in Statistics. Second edition; Springer-Verlag, Lecture Notes in Statistics, no. 88.

SEE ALSO:

,

EXAMPLES:

set.seed(0) 
x <- rexp(30) 
p <- c(.01, .025, .05, .5, .95, .975, .99) 
tau <- saddlepointPSolve(p, x) 
plot(tiltMean(tau, x)$q, p)  # saddlepoint distribution curve 
tau2 <- seq(min(tau), max(tau), length = 200) 
lines(tiltMean(tau2, x)$q, saddlepointP(tau2, x)) 
 
# variance decreases as sample size increases (use qDiscreteMean) 
points(qDiscreteMean(p, x, size = 50), p, col = 3) 
p2 <- seq(.01, .99, by = .005) 
lines(qDiscreteMean(p2, x, size = 50), p2, col = 3) 
 
# Find the saddlepoint cdf and density estimates at a particular x value 
q <- 1:40/20 # quantile values .05, .1, ..., 2 
plot(q, dDiscreteMean(q, x), type = "l") # density 
lines(q, pDiscreteMean(q, x), col = 2)   # cdf 
 
# Stratified sampling 
set.seed(0) 
gs <- c(10, 20, 10) 
L1 <- rnorm(gs[1], mean = 0, sd = 1) 
L2 <- rnorm(gs[2], mean = 1, sd = 1) 
L3 <- rnorm(gs[3], mean = 2, sd = 3) 
L <- c(L1, L2, L3) 
group <- rep(1:3, gs) 
p <- 1:9/10 
plot(qDiscreteMean(p, L = L, group = group), p) 
p2 <- seq(.1, .9, by = .01) 
lines(qDiscreteMean(p2, L = L, group = group), p2) 
 
# An example showing failure of the approximations in the tails 
L <- c(0,3,6,10) 
taup <- c(seq(-4,-1,length=10),seq(-1,1,length=50),seq(1,4,length=10)) 
taud <- seq(-3,3,length=100) 
plot(taup, saddlepointP(taup, L), type = "l") # density: warning messages 
lines(taud, saddlepointD(taud, L), col = 3)   # cdf 
 
# Improve with a convolution 
plot(taup, saddlepointP(taup, L, conv.factor = .1), type = "l") # density 
lines(taud, saddlepointD(taud, L, conv.factor = .1), col = 3)   # cdf