bdFrame
.
importData(file=NULL, type="", keep=character(0), drop=character(0), colNames=character(0), rowNamesCol=-1, filter=character(), format=character(0), delimiter=<<see below>>, startCol=1, endCol=-1, startRow=1, endRow=-1, pageNumber=-1, colNameRow=-1, server="", user="", password="", database="", table="", stringsAsFactors=<<see below>>, sortFactorLevels=T, valueLabelAsNumber=F, centuryCutoff=1930, separateDelimiters=T, odbcConnection=character(0), odbcSqlQuery=character(0), sqlQuery=character(0), readAsTable=F, colNamesUpperCase=F, time.in.format=character(0), decimal.point=".", thousands.separator=",", time.zone="GMT", use.locale=F, sqlReturnData=T, scanLines=max(startRow, 256), maxLineWidth=0, na.string="NA", colTypes=character(0), colStringWidths = integer(0), sasFormats="", bigdata=F)
type
argument below), S-PLUS
assumes the file is of that type. This can be overridden by providing
type
explicitly. The
file
argument is not required if importing from a
relational database.
Microsoft Access file. This file type is available only in S-PLUS for Windows. See the DETAILS section for information on import/export limitations. File suffix: mdb or .accdb (Access 2007 only).
file
argument should be specified. file
argument should be specified. file
argument should be specified.
This option is available only in S-PLUS for Windows. file
argument should be specified. "ODBC"
, you must also specify the argument
odbcConnection
.
You can also specify the argument
odbcSqlQuery
with this type. For
more information, see the ODBC arguments described below.file
argument should be specified. data.dump
and
data.restore
functions.
File suffix: sdd. file
argument should be specified. keep
and
drop
can be
given.
The variable names to
keep
can also be placed in a
file and separated by spaces, commas, or newlines. In this case, you can pass the
name of the file to
keep
as a character string
beginning with an ampersand "@". For example,
keep="@keeplist.txt"
.
keep
and
drop
can be
given.
The variable names to
drop
can also be placed in a
file and separated by spaces, commas, or newlines. In this case, you can pass the
name of the file to
drop
as a character string
beginning with an ampersand "@". For example,
drop="@droplist.txt"
.
colNames
is specified,
colNameRow
is automatically set to zero, which tells S-PLUS not to look for column names in the data. If there are column names in the data, you must set
startRow
to the appropriate value, so the column names row is not read in as data. If the
colNames
argument is specified,
""
is replaced by
coln
, where
n
is the column number.
bdFrame
.
This argument is not supported when reading a big data object (
bigdata=T
).
colNames
. Column names containing
special characters such as "." should be surrounded by single quotes, such as
filter = "'Disp.' > 300"
.
The
function can be used
to obtain the original variable names.
The logical operators that are available to use in the filter expression are:
==
,
!=
,
<
,
>
,
<=
,
>=
,
&
,
|
, and
!
.
Thus, to select all rows that do not have missing values in the
id
column, type
id!=NA
.
To import all rows corresponding to 10-year-old children who weigh less than 150
pounds, define your filter as
filter = "Age==10 & Weight<150"
.
Note that the entire filter expression must be within quotes. In the filter
expression, the variable name must be on the left side of the logical operator;
i.e., type
Age>12
instead of
12
<
Age
.
The wildcard characters
?
(for single characters) and
*
(for strings of arbitrary length) can be used to
select subgroups of character variables.
For example, the logical expression
account==????22
selects all rows for which the
account
variable is
six characters long and ends in 22.
The expression
id==3*
selects all rows for which
id
starts with 3, regardless of the length of the
string.
You can use the built-in variable
@rownum
to import
specific row numbers. For example, the expression
@rownum<200
imports the first 199 rows of the
data file.
Three functions permit random sampling within the filter expression:
samp.rand
accepts a single numeric argument
prop
, where
0<=prop<=1
. Rows are selected randomly from
the data file with a probability of
prop
.
samp.fixed
accepts two numeric arguments,
sample.size
and
total.observations
. The first row is drawn from the
data file with a probability of
sample.size/total.observations
. The ith row is drawn
with a probability of
(sample.size - i)/(total.observations - i)
,
where
i=1,2,...,sample.size
.
samp.syst
accepts a single numeric argument
n
. Every nth row
is selected systematically from the data file after a random start.
The sampling functions use the S-PLUS random number generator to create random samples. You can therefore use the function to produce the same data sample repeatedly.
Expressions are evaluated from left to right, so you can sample a subset of the
rows in your data file by first subsetting and then sampling. For example, to
import a random sample of half the rows corresponding to high school graduates,
use the expression
schooling>=12 & samp.rand(0.5)
.
Note that the filter is not evaluated by S-PLUS. Thus, expressions containing
built-in S-PLUS functions such as
mean
are not
allowed. One special exception to this deals with missing values: you can use
NA
to denote missing values in the logical
expressions, though you cannot use
NA
-specific
functions such as
is.na
and
na.exclude
.
type="FASCII"
. You must use a format string together
with the
"FASCII"
file type if the columns in your
data file are not separated by delimiters.
A valid format string includes a percent % sign followed by the data type for each
column in the data file. Available data types are:
s
, which denotes a character string;
f
, which denotes a numeric value; and
the asterisk
*
, which denotes a skipped column.
The elements in the string are separated by commas. For example, the format string
%s,%f,%*,%f
imports the first column of the data file
as type
"character"
, the second and fourth columns as
"numeric"
, and skips the third column altogether. If a
variable is designated as "numeric" and the value of a cell cannot be interpreted
as a number, the cell is filled in with a missing value. Incomplete rows are also
filled in with missing values.
In the format string, you can also specify integers that define the width of each
field. For example, the format string
%4f,%6s,%3*,%6f
reads the first four characters in each row as a numeric column. The next six
characters in each row are read as a character string, the next three are skipped, and then
six more characters are imported as another numeric column.
\n
and
\t
are the only multi-character delimiters allowed, and denote a newline and a tab,
respectively. For any other multi-character string, only the first character is
used as the delimiter.
Double quotes are reserved characters and therefore cannot be used as standard
delimiters.
If a delimiter is not supplied, S-PLUS searches the file automatically for the
following (in the order given): tabs, commas, semicolons, and vertical bars. If
none of these are detected, blank spaces are treated as delimiters.
-1
means that the last column in the
file is used.
scanLines
must be at least this value.
-1
means that the last row in the file is
used.
Can be used to specify which dataset to retrieve from a SAS Transport file.
Specify the number of the dataset in the file (e.g., 1 to get the first,
2 to get the second, and so on). If you know the exact name of the dataset to
retrieve, you can use the
table
argument. See
its description for more information.
bdFrame
. If you do not specify a row, S-PLUS
attempts to locate column names in the first row of the file; specify
colNameRow=0
to prevent S-PLUS from searching for a
row of column names. In a delimited ASCII file, the column names row must come
before the first data row to be read (
startRow
).
type="DIRECT-SQL"
, and you are accessing a non-default instance of SQL
Server, specify
server="SERVERNAME\\INSTANCE"
. To access the default instance, use
server=
"SERVERNAME"
.This should be left as the empty string
""
if
type="DB2"
.
type="ORACLE"
and you are
using Remote OS Authentication, specify
password="self"
and
no
user
argument.
""
if
type="ORACLE"
.
table
cannot be specified in conjunction
with the
sqlQuery
argument (see below).
table
can be used to specify which dataset to retrieve
from a SAS Transport file. To specify which dataset to import, set the
table
to the dataset name. The dataset name must
match exactly, including case. If you do not know the name, use the
pageNumber
argument and specify the number of
the dataset in the file. If you omit the
table
argument,
the first dataset in the file is imported.
stringsAsFactors=TRUE
, strings
are converted to factors when imported. The default is
TRUE
unless you set
options(stringsAsFactors=FALSE)
.
sortFactorLevels=TRUE
, the levels
for all factors created from character strings are sorted.
Otherwise, the order of the levels is not specified.
In previous versions of S-PLUS, there were situations where
importing with
sortFactorLevels=FALSE
was significantly faster,
but this is no longer true.
This argument is not supported when reading a big data object (
bigdata=T
).
valueLabelAsNumber=TRUE
returns the actual data values (either numeric or character). If
valueLabelAsNumber=FALSE
, the value labels
are imported.
separateDelimiters=TRUE
,
repeated delimiters indicate columns with missing values.
Otherwise, repeated delimiters are treated as one delimiter.
The option
separateDelimiters=FALSE
,
is most often used to treat multiple blank spaces as one delimiter.
type="ODBC"
and is functional only in
S-PLUS for Windows. See the DETAILS section below for information on the form
this string should take.
type="ODBC"
, and in S-PLUS
for Windows
type
is not
"ODBC"
).
This cannot be given in conjunction with the
table
argument above.
readAsTable=TRUE
, the arguments
separateDelimiters
,
startRow
, and
startCol
are set to
TRUE
. This forces S-PLUS to read the
entire file as a single table.
colNamesUpperCase=TRUE
, column
names are imported in all uppercase letters. Variable names from
SAS and versions of SPSS earlier than v12 are converted to lower
case letters unless this argument is set to
TRUE
.
options("time.in.format")
.
(.)
.
(,)
.
bigdata=T
).
use.locale=TRUE
, the
default values of
decimal.point
and
thousands.separator
come from the current
locale set by
Sys.setlocale
, and the default
value of
time.zone
is
options()$time.zone
. Otherwise, the
default values are as described above.
sqlReturnData=TRUE
(the default),
any SQL query expression is evaluated and the resulting data is returned.
If
sqlReturnData=FALSE
,
the SQL query is executed for effect only and NULL is returned.
The function
may also be used to execute an SQL query for effect.
Do not set
sqlReturnData=TRUE
for SQL statements that have side effects (e.g., INSERT statements). Note
that when
sqlReturnData=TRUE
, the SQL may be
executed twice: a small "trial" run may be done to determine the column
types before the full result is extracted.
scanLines=-1
means to scan the entire file, which may take a long
time for large files, but is the safest option.
The problem with setting this argument to scan less than the entire file
is that
importData
may detect the wrong column types, and
read some of the data incorrectly.
For example, suppose a particular column in a file only contains integers
for the first thousand rows, and then contains arbitrary strings.
If
scanLines=-1
, the column type will be detected as character or factor,
and imported that way.
If
scanLines=100
, the column type will be numeric,
and
importData
will attempt to import all of the values in that form:
elements that cannot be parsed as numbers will be read as
NA
.
An advantage of having this limit is that it prevents accidentally reading an arbitrary binary file as a text file, and getting garbage. It is seldom necessary to set this argument for normal text files.
"numeric"
,
"character"
,
"factor"
,
and
"timeDate"
.
"character"
columns.
The element values are ignored for non-character columns.
The column string width of a big data character column is
the maximum number of characters that can be stored in the column without truncation.
This argument is only supported when reading a big data object (
bigdata=T
),
otherwise it is ignored.
NOTE
section for more detail.
TRUE
, the data is read into a big data object or type
bdFrame
.
Otherwise, it is read into a
data.frame
object. This argument can be used only if the bigdata library section has been loaded.
bdFrame
.
importData
function causes creation of the data
set
.Random.seed
if it does not already exist;
otherwise its value is updated.
If you try to import data and encounter problems, S-PLUS displays a warning describing the problems encountered.
When constructing a
data.frame
object (
bigdata=F
),
variable names are run through the
make.names
function to ensure that appropriate S-PLUS names are created.
Variable names that contain underscores (
_
) are
converted so that they contain periods (
.
) instead.
The
filter
and
sqlQuery
arguments must be written using the original
variable names in the data file or database.
The
function can be
used before the data is imported to obtain the original variable names.
To access a database on a remote server, S-PLUS must establish a communication
link to the server across the network.
If
type="ODBC"
in S-PLUS for Windows,
the information required to create this link is contained
in the ODBC connection string.
This string consists of one or more attributes that specify how a driver connects
to a data source.
An attribute identifies a specific piece of information that the driver needs to
know before it can make the appropriate data source connection.
Each driver may have a different set of attributes, but the connection string is
always of the form:
DSN=dataSourceName [;SERVER=value] [;PWD=value] [;UID= value] [;
You must specify the data source name. However, all other attributes are optional. If you
do not specify a particular attribute, it defaults to the value specified in the
relevant DSN tab of the ODBC Data Source Administrator.
The S-PLUS GUI obfuscates ODBC passwords in the History Log by replacing the PWD
in the
odbcConnection
argument with ******. To reuse the
code from the History log at the command line, replace the ****** with the
appropriate password.
There is a difference between
bigdata=F
and
bigdata=T
when importing empty strings as factors.
If
bigdata=F
, it is possible to import a factor level that is an empty string.
If
bigdata=T
, any such factor levels are read as
NA
.
You can also use the
function to run SQL
queries. Both
executeSQL
and
importData
are designed to
execute arbitrary SQL queries: either stored procedures or simple
statements that return data. They are not designed to execute
arbitrary SQL "programs." If you need to work with a long succession
of SQL statments, it is best to convert the statements to a SQL stored
procedure and then call the procedure using either
executeSQL
or
importData
.
Tests performed on importing/exporting databases have shown the following:
32k is the maximum import/export character string length in the Memo field.
varchar
columns with a maximum string
length of 32,672 characters.varchar2
), 2000(
nchar
), 1000(
nvarchar2
)
varchar2
),>=4000(
CLOB
)
char
), 2000(
nchar
), 1000(
nvarchar2
)
char
), >255(
long
)
text
), 255(
varchar
), 255(
nvarchar
), 255(
char
), 255(
nchar
)
varchar
), >=8000(
text
)
text
), 7999(
varchar
), 3999(
nvarchar
), 7999(
char
), 3999(
nchar
)
varchar
), >255(
text
)type="STATASE"
.type="MATLAB"
to create a
pre-MATLAB 7 version file; otherwise, specify
type="MATLAB7"
to export
the MATLAB 7 file format.
Use the functions and to import data in sequential blocks of a specified size.
Use the function to retrieve the names of all data sets, sheets, or tables in a specified data file or database.
When importing SAS data, if
valueLabelAsNumber
is
FALSE
(the default), S-PLUS attempts to get value labels from the file specified by
sasFormats
or it looks in the same directory where the data file (specified by the
file
argument) is located to find one of these files (in this order):
Windows:
formats.sas7bcat
,
formats.stc
,
formats.xpt
,
formats.tpt
,
formats.sas7bdat
Unix/Linux:
formats.stc
,
formats.xpt
,
formats.tpt
,
formats.sas7bdat
.
where
.sas7bcat
is a SAS catalog file,
.sas7bdat
is a SAS data file,
.stc
is a SAS CPORT transport file, and
.xpt
or
.tpt
is an older style SAS Transport file (prior to SAS version 7).
S-PLUS cannot read
.sas7bcat
(catalog) files created on Unix/Linux, but it can read catalog files created on Windows, even when running the Unix version of S-PLUS. As a workaround, Unix users must convert their SAS catalog file to either a CPORT transport file or a SAS data file.
If
sasFormats
is not given and none of the default format files (listed above) can be found, then S-PLUS behaves as if
valueLabelsAsNumbers=T
, even though some SAS data variables may have been associated with user-defined formats when the data set was created.
If
valueLabelsAsNumbers=TRUE
, there is no attempt to open a SAS format file, even if
sasFormats
is specified.
valueLabelsAsNumbers
also controls whether or not value labels which may exist in a SPSS data file are used when importing that data file.
If
sasFormats
is a file with a
.loc
extension, then that file must contain the names of the format files to be used. User-defined formats are read from all files listed in the
.loc
file. This is useful if you have data that uses formats from various format files.
# Create an ASCII file. exportData(state.x77, "state.txt") # Now import it. state.new <- importData("state.txt", type="ASCII") # Create a fixed format ASCII file. cat(paste(paste(state.abb[1:8], 1:8, LETTERS[11:18], 21:28, sep=""), collapse="\n"), "\n", file="fdata.fix") # Now import it. fdata <- importData("fdata.fix", type="FASCII", format="%2s,%1f,%1s,%2f") # Create a text file containing a column name with a special character ("Disp." in fuel.frame has a period) exportData( fuel.frame, file = "ff.txt", rowNames = TRUE ) # Put single quotes around column names that have special characters when importing importData( file = "ff.txt", filter = " 'Disp.' > 300 " ) # Examples for an Oracle database with S-PLUS for Solaris. # # The following two commands are equivalent and return all data # from the emp table. importData(type="oracle", user="scott", password="tiger", table="emp", server="ORACLEDB") importData(type="oracle", user="scott", password="tiger", server="ORACLEDB", sqlQuery = "SELECT * FROM emp")
# The following example demonstrates that you must put your # entire filter string in quotes. # First, create an ASCII file. exportData(state.x77, "state.txt") # Next, import the new ASCII file. state.new <- importData("state.txt", type="ASCII") # Finally, import a subset of the rows from the ASCII file by specifying # a filter. Note that the entire filter string is in quotes: state2 <- importData("state.txt", filter = "Illiteracy == 0.6 & Murder > 5")