describe
is a generic method that invokes
describe.data.frame
,
describe.matrix
,
describe.vector
, or
describe.formula
.
describe.vector
is the basic
function for handling a single variable.
This function determines whether the variable is character, factor,
category, binary, discrete numeric, and continuous numeric, and prints
a concise statistical summary according to each. A numeric variable is
deemed discrete if it has <= 10 unique values. In this case,
quantiles are not printed. A frequency table is printed
for any non-binary variable if it has no more than 20 unique
values. For any variable with at least 20 unique values, the 5 lowest
and highest values are printed.
describe
is especially useful for
describing data frames created by
sas.get
, as SAS labels, formats,
value labels, and frequencies of special missing values are printed.
For a binary variable, the sum (number of 1's) and mean (proportion of
1's) are printed. If the first argument is a formula, a model frame
is created and passed to describe.data.frame. If a variable
is of class
"impute"
, a count of the number of imputed values is
printed. If a date variable has an attribute
partial.date
(this is set up by
sas.get
), counts of how many partial dates are
actually present (missing month, missing day, missing both) are also presented.
If a variable was created by the special-purpose function
substi
(which
substitutes values of a second variable if the first variable is NA),
the frequency table of substitutions is also printed.
A latex method
exists for converting the
describe
object to a LaTeX file. For
numeric variables having at least 20 unique values,
describe
saves
in its returned object the frequencies of 100 evenly spaced bins
running from minimum observed value to the maximum.
latex
inserts a
spike histogram displaying these frequency counts in the tabular
material using the LaTeX picture environment. For example output see
http://biostat.mc.vanderbilt.edu/twiki/pub/Main/Hmisc/counties.pdf.
Sample weights may be specified to any of the functions, resulting in weighted means, quantiles, and frequency tables.
## S3 method for class 'vector': describe(x, descript, exclude.missing=TRUE, digits=4, weights, normwt, ...) ## S3 method for class 'matrix': describe(x, descript, exclude.missing=TRUE, digits=4, ...) ## S3 method for class 'data.frame': describe(x, descript, exclude.missing=TRUE, digits=4, ...) ## S3 method for class 'formula': describe(x, descript, data, subset, na.action, digits=4, weights, ...) ## S3 method for class 'describe': print(x, condense=TRUE, ...) ## S3 method for class 'describe': latex(object, title=NULL, condense=TRUE, file=paste('describe',first.word(expr=attr(object,'descript')),'tex',sep='.'), append=FALSE, size='small', tabular=TRUE, ...) ## S3 method for class 'describe.single': latex(object, title=NULL, condense=TRUE, vname, file, append=FALSE, size='small', tabular=TRUE, ...)
describe.data.frame
function is automatically invoked. For a matrix,
describe.matrix
is
called. For a formula, describe.data.frame(model.frame(x))
is invoked. The formula may or may not have a response variable. For
print
or
latex
,
x
is an object created by
describe
.
descript
defaults to a character representation of
the formula.
weights
times.
normwt=FALSE
results in the use of
weights
as
weights in computing various statistics. In this case the sample size
is assumed to be equal to the sum of
weights
. Specify
normwt=TRUE
to divide
weights
by a constant so that
weights
sum to the number of
observations (length of vectors specified to
describe
). In this
case the number of observations is taken to be the actual number of
records given to
describe
.
describe
na.action
defaults to
na.retain
which does not delete any
NA
s from the data frame.
Use
na.action=na.omit
or
na.delete
to drop any observation with
any
NA
before processing.
describe.default
which are passed to calls
to
format
for numeric variables. For example if using R
POSIXct
or
Date
date/time formats, specifying
describe(d,format='%d%b%y')
will print date/time variables as
"01Jan2000"
. This is useful for omitting the time
component. See the help file for
format.POSIXct
or
format.Date
for more
information. For
latex
methods, ... is ignored.
descript
element of the
describe
object, prefixed by
"describe"
. Set
file=""
to send LaTeX code to standard output instead of a file.
TRUE
to have
latex
append text to an existing file
named
file
"small"
, the default, or
"normalsize"
,
"tiny"
,
"scriptsize"
, etc.) for the
describe
output in LaTeX.
FALSE
to use verbatim rather than tabular environment
for the summary statistics output. By default, tabular is used if the
output is not too wide.
latex.describe.single
If
options(na.detail.response=TRUE)
has been set and
na.action
is
"na.delete"
or
"na.keep"
, summary statistics on
the response variable are printed separately for missing and non-missing
values of each predictor. The default summary function returns
the number of non-missing response values and the mean of the last
column of the response values, with a
names
attribute of
c("N","Mean")
.
When the response is a
Surv
object and the mean is used, this will
result in the crude proportion of events being used to summarize
the response. The actual summary function can be designated through
options(na.fun.response = "function name")
.
descript
,
counts
,
values
. The list is of class
describe
. If the input
object was a matrix or a data
frame, the list is a list of lists, one list for each variable
analyzed.
latex
returns a standard
latex
object. For numeric
variables having at least 20 unique values, an additional component
intervalFreq
. This component is a list with two elements,
range
(containing two values) and
count
, a vector of 100 integer frequency
counts.
Frank Harrell
Vanderbilt University
mailto:f.harrell@vanderbilt.edu
set.seed(1) describe(runif(200),dig=2) #single variable, continuous #get quantiles .05,.10,... dfr <- data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE)) describe(dfr) ## Not run: d <- sas.get(".","mydata",special.miss=TRUE,recode=TRUE) describe(d) #describe entire data frame attach(d, 1) describe(relig) #Has special missing values .D .F .M .R .T #attr(relig,"label") is "Religious preference" #relig : Religious preference Format:relig # n missing D F M R T unique # 4038 263 45 33 7 2 1 8 # #0:none (251, 6%), 1:Jewish (372, 9%), 2:Catholic (1230, 30%) #3:Jehovah's Witnes (25, 1%), 4:Christ Scientist (7, 0%) #5:Seventh Day Adv (17, 0%), 6:Protestant (2025, 50%), 7:other (111, 3%) # Method for describing part of a data frame: describe(death.time ~ age*sex + rcs(blood.pressure)) describe(~ age+sex) describe(~ age+sex, weights=freqs) # weighted analysis fit <- lrm(y ~ age*sex + log(height)) describe(formula(fit)) describe(y ~ age*sex, na.action=na.delete) # report on number deleted for each variable options(na.detail.response=TRUE) # keep missings separately for each x, report on dist of y by x=NA describe(y ~ age*sex) options(na.fun.response="quantile") describe(y ~ age*sex) # same but use quantiles of y by x=NA d <- describe(my.data.frame) d$age # print description for just age d[c('age','sex')] # print description for two variables d[sort(names(d))] # print in alphabetic order by var. names d2 <- d[20:30] # keep variables 20-30 page(d2) # pop-up window for these variables # Test date/time formats and suppression of times when they don't vary library(chron) d <- data.frame(a=chron((1:20)+.1), b=chron((1:20)+(1:20)/100), d=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20, hour=rep(11,20),min=rep(17,20),sec=rep(11,20)), f=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20, hour=1:20,min=1:20,sec=1:20), g=ISOdate(year=2001:2020,month=rep(3,20),day=1:20)) describe(d) ## End(Not run)