sas.codes
and these may be added back to the
levels
of a
factor
variable using the
code.levels
function.
Information about special missing values may be captured in an attribute
of each variable having special missing values. This attribute is
called
special.miss
, and such variables are given class
special.miss
.
There are
print
,
[]
,
format
, and
is.special.miss
methods for such variables.
The
chron
function is used to set up date, time, and date-time variables.
If using S-Plus 5 or 6 or later, the
timeDate
function is used
instead.
Under R, POSIXct is used for dates and date-times. For times without
dates, these still need to be stored in date-time format in POSIX.
Such SAS time variables are given a major class of
timePOSIXt
and a
format.timePOSIXt
function so that the date portion (which will
always be 1/1/1970) will not print by default.
If a date variable represents a partial date (.5 added if
month missing, .25 added if day missing, .75 if both), an attribute
partial.date
is added to the variable, and the variable also becomes
a class
imputed
variable.
The
describe
function uses information about partial dates and
special missing values.
There is an option to automatically uncompress (or gunzip) compressed
SAS datasets.
sas.get(library, member, variables, ifs, format.library=library, formats=T, recode=formats, special.miss=F, id, as.is=.5, check.unique.id=T, force.single=F, dates="sas", keep.log=T, log.file="_temp_.log", macro=sas.get.macro, data.frame.out=T, clean.up=T, quiet=F, temp=tempfile("SaS"), sasprog="sas", uncompress=F) is.special.miss(x, code) x[...] print(x) format(x) sas.codes(x) x <- code.levels(x)
sas.get
with
special.miss=T
or with
recode
in effect.
sas.contents
to get
the variables in the SAS dataset.
If you have retrieved a subset of the variables
in the SAS dataset and which to retrieve the same list of variables
from another dataset, you can program the value of
variables
- see
one of the last examples.
formats
to
F
to keep
sas.get
from telling the SAS macro to
retrieve value label formats from
format.library
. When you do not
specify
formats
or
recode
,
sas.get
will set
format
to
T
if a
SAS format catalog (
.sct
or
.sc2
) file exists in
format.library
.
Value label formats if present are stored as the
formats
attribute of the returned
object (see below). A format is used if it is referred to by one or more
variables
in the dataset, if it contains no ranges of values (i.e., it identifies
value labels for single values), and if it is a character format
or a numeric format that is not used just to label missing values.
If you set
recode
to
TRUE
, 1, or 2,
formats
defaults to
TRUE
.
To fetch the values and labels for variable
x
in the dataset
d
you
could type:
f <- attr(d$x, "format")
formats <- attr(d, "formats")
formats$f$values; formats$f$labels
TRUE
if
formats
is
TRUE
. If it is
TRUE
, variables that have an appropriate format (see above) are
recoded as
factor
objects, which map the values
to the value labels for the format. Alternatively, set
recode
to
1 to use labels of the form value:label, e.g. 1:good 2:better 3:best.
Set
recode
to 2 to use labels such as good(1) better(2) best(3).
Since
sas.codes
and
code.levels
add flexibility, the usual choice
for
recode
is
T
or
TRUE
.
special.miss
to
TRUE
. This will cause the
special.miss
attribute and the
special.miss
class to be added
to each variable that has at least one special missing value.
Suppose that variable
y
was .E in observation 3 and .G
in observation 544. The
special.miss
attribute for
y
then has the
value
list(codes=c("E","G"),obs=c(3,544)))
y
you would say for example
s <- attr(y, "special.miss")
s$codes; s$obs
is.special.miss(x)
or the
print.special.miss
method, which
will replace
NA
values for the variable with
E
or
G
if they
correspond to special missing values.
The describe
function uses this information in printing a data summary.
row.names
attribute of a data frame, but
the id variable is still retained as a variable in the data frame.
(if
data.frame.out
is
FALSE
, this will be the attribute
"id"
of the S-PLUS
dataset.) You can also specify a vector of variable names as the
id
parameter. After fetching the data from SAS, all these variables will be
converted to character format and concatenated (with a space as a separator)
to form a (hopefully) unique ID variable.
data.frame.out=T
, SAS character variables are converted to S factor
objects if
as.is=F
or if
as.is
is a number between 0 and 1 inclusive and
the number of unique values of the variable is less than
the number of observations (
n
) times
as.is
. The default if
as.is
is .5,
so character variables are converted to factors only if they have fewer
than
n/2
unique values. The primary purpose of this is to keep unique
identification variables as character values in the data frame instead
of using more space to store both the integer factor codes and the
factor labels.
id
is specified, the row names are checked for
uniqueness if
check.unique.id=T
. If any are duplicated, a warning
is printed. Note that if a data frame is being created with duplicate
row names, statements such as
my.data.frame["B23",]
will retrieve
only the first row with a row name of
"B23"
.
LENGTH
s > 4 are stored as
S double precision numerics, which allow for the same precision as
a SAS
LENGTH
8 variable. Set
force.single=T
to store every
numeric variable in single precision (7 digits of precision).
This option is useful when the creator of the SAS dataset has
failed to use a
LENGTH
statement.
"sas"
,
"yearfrac"
,
"yearfrac2"
,
"yymmdd"
.
If a SAS variable has a date format (one of "DATE", "MMDDYY", "YYMMDD",
"DDMMYY", "YYQ", "MONYY", "JULIAN"), it will be converted to the format
specified by
dates
before being given to S-PLUS.
"sas"
gives
days from 1/1/1960 (from 1/1/1970 if using
chron
),
"yearfrac"
gives days from 1/1/1900 divided by
365.25,
"yearfrac2"
gives year plus fraction of current year,
and
"yymmdd"
gives a 6 digit number YYMMDD (year%%100, month, day).
Note that S-PLUS will store these as numbers, not as
character strings. If dates="sas" and a variable has one of the SAS
date formats listed above, the variable will be given a class of "date"
to work with Terry Therneau's implementation of the "date" class in S.
If the
chron
package or
timeDate
function is available, these are
used instead.
FALSE
, delete the SAS log file upon completion.
TRUE
, the return value will be an S-PLUS data frame,
otherwise it will be a list.
TRUE
, remove all temporary files when finished. You
may want to keep these while debugging the SAS macro.
FALSE
, print the contents of the SAS log file if
there has been an error.
T
to automatically invoke the UNIX
gunzip
command
(if
member.ssd01.gz
exists) or the
uncompress
command
(if
member.ssd01.Z
exists) to uncompress the SAS dataset before
proceeding. This assumes you have the file permissions to allow
uncompressing in place. If the file is already uncompressed, this
option is ignored.
where
, each individual variable is placed into a
separate object (whose name is the name of the variable) using the
assign
function with the
where
argument. For example, you can
put each variable in its own file in a directory, which in some cases
may save memory over attaching a data frame.
code
is omitted,
is.special.miss
will return a
T
for each
observation that has any special missing value.
If you specify
special.miss=T
and there are no special missing
values in the data SAS dataset, the SAS step will bomb.
For variables having a
PROC FORMAT VALUE
format with some of the levels undefined,
sas.get
will interpret those
values as
NA
if you are using
recode
.
The SAS macro
sas_get
uses record lengths of up to 4096 in two
places. If you are exporting records that are very long (because of
a large number of variables and/or long character variables), you
may want to edit these
LRECL
s to quadruple them, for example.
data.frame.out
is
TRUE
, the output will
be a data frame resembling the SAS dataset. If
id
was specified, that column of the data frame will be used
as the row names of the data frame. Each variable in the data frame
or vector in the list will have the attributes
label
and
format
containing SAS labels and formats. Underscores in formats are
converted to periods. Formats for character variables have
$
placed
in front of their names.
If
formats
is
TRUE
and there are any
appropriate format definitions in
format.library
, the returned
object will have attribute
formats
containing lists named the
same as the format names (with periods substituted for underscores and
character formats prefixed by
$
).
Each of these lists has a vector called
values
and one called
labels
with the
PROC FORMAT; VALUE ...
definitions.
If
data.frame.out
is
FALSE
, the output will
be a list of vectors, each containing a variable from the SAS
dataset. If
id
was specified, that element of the list will
be used as the
id
attribute of the entire list.
You must be able to run SAS (by typing sas) on your system.
If the S-PLUS command
!sas
does not start SAS, then this function cannot work.
If you are reading time or
date-time variables, you will need to execute the command
library(chron)
to print those variables or the data frame if the
timeDate
function
is not available.
Terry Therneau, Mayo Clinic
Frank Harrell, Vanderbilt University
Bill Dunlap, University of Washington and Insightful Corp.
Michael W. Kattan, Cleveland Clinic Foundation
SAS Institute Inc. (1990). SAS Language: Reference, Version 6. First Edition. SAS Institute Inc., Cary, North Carolina.
SAS Institute Inc. (1988). SAS Technical Report P-176, Using the SAS System, Release 6.03, under UNIX Operating Systems and Derivatives. SAS Institute Inc., Cary, North Carolina.
SAS Institute Inc. (1985). SAS Introductory Guide. Third Edition. SAS Institute Inc., Cary, North Carolina.
sas.contents("saslib", "mice") # [1] "dose" "ld50" "strain" "lab_no" attr(, "n"): # [1] 117 mice <- sas.get("saslib", mem="mice", var=c("dose", "strain", "ld50")) plot(mice$dose, mice$ld50) nude.mice <- sas.get(lib=unix("echo $HOME/saslib"), mem="mice", ifs="if strain='nude'") nude.mice.dl <- sas.get(lib=unix("echo $HOME/saslib"), mem="mice", var=c("dose", "ld50"), ifs="if strain='nude'") # Get a dataset from current directory, recode PROC FORMAT; VALUE ... # variables into factors with labels of the form "good(1)" "better(2)", # get special missing values, recode missing codes .D and .R into new # factor levels "Don't know" and "Refused to answer" for variable q1 d <- sas.get(".", "mydata", recode=2, special.miss=T) attach(d) nl <- length(levels(q1)) lev <- c(levels(q1), "Don't know", "Refused") q1.new <- as.integer(q1) q1.new[is.special.miss(q1,"D")] <- nl+1 q1.new[is.special.miss(q1,"R")] <- nl+2 q1.new <- factor(q1.new, 1:(nl+2), lev) # Note: would like to use factor() in place of as.integer ... but # factor in this case adds "NA" as a category level d <- sas.get(".", "mydata") sas.codes(d$x) # for PROC FORMATted variables returns original data codes d$x <- code.levels(d$x) # or attach(d); x <- code.levels(x) # This makes levels such as "good" "better" "best" into e.g. # "1:good" "2:better" "3:best", if the original SAS values were 1,2,3 # Retrieve the same variables from another dataset (or an update of # the original dataset mydata2 <- sas.get('mydata2', var=names(d)) # This only works if none of the original SAS variable names contained _ # Code from Don MacQueen to generate SAS dataset to test import of # date, time, date-time variables # data ssd.test; # d1='3mar2002'd ; # dt1='3mar2002 9:31:02'dt; # t1='11:13:45't; # output; # # d1='3jun2002'd ; # dt1='3jun2002 9:42:07'dt; # t1='11:14:13't; # output; # format d1 mmddyy10. dt1 datetime. t1 time.; # run;