openData
function returns a data handle that
can be used to read or write sequential blocks of data from or to the file.
openData(file=NULL, type="", rowsToRead=1000, openType="read", keep=character(), drop=character(), colNames=character(), rowNamesCol=-1, filter=character(), format=character(), delimiter="", startCol=1, endCol=-1, startRow=1, endRow=-1, pageNumber=-1, colNameRow=-1, server="", user="", password="", database="", table="", sqlQuery=character(0), rowNames=F, stringsAsFactors=T, sortFactorLevels=T, valueLabelAsNumber=F, centuryCutoff=1930, separateDelimiters=T, quote=T, decimal.point=".", thousands.separator=",", time.format = "", time.zone="GMT", use.locale=F, scanLines=max(startRow, 256), maxLineWidth=0, na.string=<<see below>>, colTypes=character(0), sasFormats="")
type
argument below), S-PLUS assumes the file is of
that type. This can be overridden by providing
type
explicitly. The
file
argument is not required if
importing from or exporting to a relational database.
readNextDataRows
. If this is set to zero, the entire
data set is read at once. If
filter
is supplied,
the filter is applied before the rows are read; this means that filtered blocks
(except the last) are all of size
rowsToRead
.
"read"
if the data file is to be
read and
"write"
if the data file is to be opened for
writing. Only the first character is necessary and case is ignored.
keep
and
drop
can be
given.
keep
and
drop
can be
given.
colNames
is specified,
colNameRow
is automatically set to zero, which tells S-PLUS not to look for column names in the data. If there are column names in the data, you must set
startRow
to the appropriate value, so the column names row is not read in as data. If the
colNames
argument is specified,
""
is replaced by
coln
, where
n
is the column number.
colNames
. Column names containing
special characters such as "." should be surrounded by single quotes, such as
filter = "'Disp.' > 300"
. See the
help
file for information on the form the filter expression should take.
type="FASCII"
. You must use a format string together
with the
"FASCII"
file type if the columns in your
data file are not separated by delimiters. See the
and
help files for
information on the form this string can take.
\n
and
\t
are the only multi-character delimiters allowed, and denote a newline and a tab,
respectively. For any other multi-character string, only the first character is
used as the delimiter.
Double quotes are reserved characters and therefore cannot be used as standard
delimiters.
When importing data, if a delimiter is not supplied, S-PLUS searches the file automatically for the following (in the order given): tabs, commas, semicolons, and vertical bars. If none of these are detected, blank spaces are treated as delimiters.
-1
means that the
last column in the file is used.
scanLines
must be at least this value.
-1
means that the last row
in the file is used.
colNameRow=0
to prevent S-PLUS from searching for a
row of column names. In a delimited ASCII file, the column names row must come
before the first data row to be read (
startRow
).
""
if
type="DB2"
.
""
if
type="ORACLE"
.
database
to import from or export to. When exporting
data, if a table by the specified name does not already exist,
openData
creates it. When working with a
database,
table
cannot be specified in conjunction
with the
sqlQuery
argument (see below).
table
argument above.
rowNames=TRUE
, the row names
are exported to the data file.
stringsAsFactors=TRUE
, strings
are converted to factors when imported.
sortFactorLevels=TRUE
, the levels
for all factors created from character strings are sorted. Otherwise, the levels
are defined in the order that they are read in from the data file. This argument
only applies if the entire data set is read at once
(
rowsToRead=0
). See the DETAILS section below for
information on how factor levels are handled when reading in blocks.
valueLabelAsNumber=TRUE
, SAS and
SPSS variables with labels are imported as numbers. Otherwise, the value labels
are imported.
separateDelimiters=TRUE
,
repeated delimiters indicate columns with missing values.
Otherwise, repeated delimiters are treated as one delimiter.
The option
separateDelimiters=FALSE
,
is most often used to treat multiple blank spaces as one delimiter.
quote=TRUE
, quotes are
placed around character strings when exporting to ASCII text files
(
type="ASCII"
).
(.)
.
(,)
.
options("time.in.format")
or
options("time.out.format")
use.locale=TRUE
, the
default values of
decimal.point
and
thousands.separator
come from the current
locale set by
Sys.setlocale
, and the default
value of
time.zone
is
options()$time.zone
. Otherwise, the
default values are as described above.
scanLines=-1
means to scan the entire file, which may take a long
time for large files, but is the safest option.
The problem with setting this argument to scan less than the entire file
is that
importData
may detect the wrong column types, and
read some of the data incorrectly.
For example, suppose a particular column in a file only contains integers
for the first thousand rows, and then contains arbitrary strings.
If
scanLines=-1
, the column type will be detected as character or factor,
and imported that way.
If
scanLines=100
, the column type will be numeric,
and
importData
will attempt to import all of the values in that form:
elements that cannot be parsed as numbers will be read as
NA
.
"NA"
when reading,
and
""
when writing.
No matter what value is specified for this argument, when reading an empty string value will always be read as a missing value.
"numeric"
,
"character"
,
"factor"
,
and
"timeDate"
.
NOTE
section for more detail.
readNextDataRows
or
writeNextDataRows
function.
This function, along with
readNextDataRows
and
writeNextDataRows
, provide the capability of reading
and writing arbitrarily large data sets with S-PLUS.
See the
readNextDataRows
help file for a simple
example of computing statistics from an external data set by reading it in blocks.
When reading character data with
stringsAsFactors=TRUE
(the default), the levels are accumulated as each block is read. The levels for the
factor variables in the final data frame imported by
readNextDataRows
contain all possible values that have
appeared in that variable. When reading in blocks, the order of the levels is the
order that the various values appear in the data set.
When importing SAS data, if
valueLabelAsNumber
is
FALSE
(the default), S-PLUS attempts to get value labels from the file specified by
sasFormats
or it looks in the same directory where the data file (specified by the
file
argument) is located to find one of these files (in this order):
Windows:
formats.sas7bcat
,
formats.stc
,
formats.xpt
,
formats.tpt
,
formats.sas7bdat
Unix/Linux:
formats.stc
,
formats.xpt
,
formats.tpt
,
formats.sas7bdat
.
where
.sas7bcat
is a SAS catalog file,
.sas7bdat
is a SAS data file,
.stc
is a SAS CPORT transport file, and
.xpt
or
.tpt
is an older style SAS Transport file (prior to SAS version 7).
S-PLUS cannot read
.sas7bcat
(catalog) files created on Unix/Linux, but it can read catalog files created on Windows, even when running the Unix version of S-PLUS. As a workaround, Unix users must convert their SAS catalog file to either a CPORT transport file or a SAS data file.
If
sasFormats
is not given and none of the default format files (listed above) can be found, then S-PLUS behaves as if
valueLabelsAsNumbers=T
, even though some SAS data variables may have been associated with user-defined formats when the data set was created.
If
valueLabelsAsNumbers=TRUE
, there is no attempt to open a SAS format file, even if
sasFormats
is specified.
valueLabelsAsNumbers
also controls whether or not value labels which may exist in a SPSS data file are used when importing that data file.
If
sasFormats
is a file with a
.loc
extension, then that file must contain the names of the format files to be used. User-defined formats are read from all files listed in the
.loc
file. This is useful if you have data that uses formats from various format files.
# First create an external SAS data set. exportData(Quinidine, "Quinidine.ssd01") # Open the external data set for subsequent reads. dh <- openData("Quinidine.ssd01", rowsToRead=100, drop="Subject") # Get variable names and type. getDataInfo(dh) # Read the first 100 observations. df100 <- readNextDataRows(dh) # Close the external data file. closeData(dh)