cleanup.import
will correct errors and shrink
the size of data frames created by the S-Plus
File ... Import
dialog or by other methods such as
scan
and
read.table
. By
default, double precision numeric variables are changed to single
precision (S-Plus only) or to integer when they contain no fractional
components.
Infinite values or values greater than 1e20 in absolute value are set
to NA. This solves problems of importing Excel spreadsheets that
contain occasional character values for numeric columns, as S-Plus
converts these to
Inf
without warning. There is also an option to
convert variable names to lower case and to add labels to variables.
The latter can be made easier by importing a CNTLOUT dataset created
by SAS PROC FORMAT and using the
sasdict
option as shown in the
example below.
cleanup.import
can also transform character or
factor variables to dates.
upData
is a function facilitating the updating of a data frame
without attaching it in search position one. New variables can be
added, old variables can be modified, variables can be removed or renamed, and
"labels"
and
"units"
attributes can be provided. Various checks
are made for errors and inconsistencies, with warnings issued to help
the user. Levels of factor variables
can be replaced, especially using the
list
notation of the standard
merge.levels
function. Unless
force.single
is set to
FALSE
,
upData
also converts double precision vectors to single precision
(if not under R), or to integer if no fractional values are present in
a vector.
Both
cleanup.import
and
upData
will fix a problem with
data frames created under S-Plus before version 5 that are used in S-Plus 5 or
later. The problem was caused by use of the
label
function
to set a variable's class to
"labelled"
. These classes are
removed as the S version 4 language does not support multiple
inheritance. Failure to run data frames through one of the two
functions when these conditions apply will result in simple numeric
variables being set to
factor
in some cases. Extraneous
"AsIs"
classes are also removed.
For S-Plus, a function
exportDataStripped
is provided that allows
exporting of data to other systems
by removing attributes
label, imputed, format, units
, and
comment
. It calls
exportData
after stripping these
attributes. Otherwise
exportData
will fail.
csv.get
reads comma-separated text data files, allowing optional
translation to lower case for variable names after making them valid S
names. Original possibly non-legal names are taken to be variable
labels. Character or factor variables containing dates can be converted
to date variables.
cleanup.import
is invoked to finish the job.
cleanup.import(obj, labels, lowernames=FALSE, force.single=TRUE, force.numeric=TRUE, rmnames=TRUE, big=1e20, sasdict, pr, datevars=NULL, dateformat=' upData(object, ..., rename, drop, labels, units, levels, force.single=TRUE, lowernames=FALSE, moveUnits=FALSE) exportDataStripped(data, ...) csv.get(file, lowernames=FALSE, datevars=NULL, dateformat='%F', allow=NULL, ...)
force.single=FALSE
.
force.single=TRUE
will also convert vectors having only integer
values to have a storage mode of integer, in R or S-Plus.
cleanup.import
will check
each factor variable to see if the levels contain only numeric values
and
""
. In that case, the variable will be converted to numeric,
with
""
converted to NA. Set
force.numeric=FALSE
to prevent
this behavior.
obj
. These character values are taken to be variable labels in the
same order of variables in
obj
.
For
upData
,
labels
is a named list or named vector with variables
in no specific order.
TRUE
to change variable names to lower case.
upData
does this before applying any other changes, so variable
names given inside arguments to
upData
need to be lower case if
lowernames==TRUE
.
cleanup.import
TRUE
or
FALSE
to force or prevent printing of the current
variable number being processed. By default, such messages are printed if the
product of the number of variables and number of observations in
obj
exceeds 500,000.
lowernames
is
applied) of variables to consider as a factor or character vector
containing dates in a format matching
dateformat
. The
default is
"%F"
which uses the yyyy-mm-dd format.
cleanup.import
is the input format (see
)
upData
, one or more expressions of the form
variable=expression
, to derive new variables or change old ones.
For
exportDataStripped
, optional arguments that are passed to
exportData
. For
csv.get
, arguments to pass to
read.csv
.
age
and
sex
to respectively
Age
and
gender
, specify
rename=list(age="Age", sex="gender")
or
rename=c(age=...)
.
"units"
attributes of variables, in no
specific order
"levels"
attributes for factor variables, in
no specific order. The values in this list may be character vectors
redefining
levels
(in order) or another list (see
merge.levels
if using S-Plus).
TRUE
to look for units of measurements in variable
labels and move them to a
"units"
attribute. If an expression
in a label is enclosed in parentheses or brackets it is assumed to be
units if
moveUnits=TRUE
.
Frank Harrell, Vanderbilt University
## Not run: dat <- read.table('myfile.asc') dat <- cleanup.import(dat) ## End(Not run) dat <- data.frame(a=(1:3)/7, y=c('a','b1','b2'), z=1:3) dat2 <- upData(dat, x=x^2, x=x-5, m=x/10, rename=c(a='x'), drop='z', labels=c(x='X', y='test'), levels=list(y=list(a='a',b=c('b1','b2')))) dat2 describe(dat2) dat <- dat2 # copy to original name and delete dat2 if OK rm(dat2) # If you import a SAS dataset created by PROC CONTENTS CNTLOUT=x.datadict, # the LABELs from this dataset can be added to the data. Let's also # convert names to lower case for the main data file ## Not run: mydata2 <- cleanup.import(mydata2, lowernames=TRUE, sasdict=datadict) ## End(Not run)