Import Data

DESCRIPTION:

Import data from a file or database into a data frame or a bdFrame.

USAGE:

importData(file=NULL, type="", keep=character(0), drop=character(0), 
   colNames=character(0), rowNamesCol=-1, filter=character(), 
   format=character(0), delimiter=<<see below>>, 
   startCol=1, endCol=-1, startRow=1, endRow=-1, pageNumber=-1, colNameRow=-1, 
   server="", user="", password="", database="", table="", 
   stringsAsFactors=<<see below>>, sortFactorLevels=T, 
   valueLabelAsNumber=F, centuryCutoff=1930, separateDelimiters=T,
   odbcConnection=character(0), odbcSqlQuery=character(0), 
   sqlQuery=character(0), readAsTable=F, colNamesUpperCase=F, 
   time.in.format=character(0), decimal.point=".", thousands.separator=",",
   time.zone="GMT", use.locale=F, sqlReturnData=T,
   scanLines=max(startRow, 256), maxLineWidth=0, na.string="NA",
   colTypes=character(0), colStringWidths = integer(0), sasFormats="",
   bigdata=F) 

OPTIONAL ARGUMENTS:

file
a character string specifying the name of the file to import. If the file has a known suffix (see the type argument below), S-PLUS assumes the file is of that type. This can be overridden by providing type explicitly. The file argument is not required if importing from a relational database.
type
a character string specifying the type of file to import. Possible values are listed here; the case of the character string is ignored.
"ACCESS"

Microsoft Access file. This file type is available only in S-PLUS for Windows. See the DETAILS section for information on import/export limitations. File suffix: mdb or .accdb (Access 2007 only).

"ASCII"
ASCII text. 32k is the maximum import/export character string length. File suffix: asc, csv, or txt. S-PLUS for Windows also recognizes the suffixes dat and prn.
"DBASE"
dBase or other xBase file. File suffix: dbf.
"DIRECT-DB2"
DB2 database connection. See the DETAILS section for information on import/export limitations. No file argument should be specified.
"DIRECT-ORACLE"
Oracle database connection. See the DETAILS section for information on import/export limitations. No file argument should be specified.
"DIRECT-SQL"
Microsoft SQL Server database connection. See the DETAILS section for information on import/export limitations. No file argument should be specified. This option is available only in S-PLUS for Windows.
"DIRECT-SYBASE"
Sybase database connection. No file argument should be specified.
"EPI"
Epi Info file. File suffix: rec.
"EXCEL" or "EXCELX"
Microsoft Excel worksheet file. 32k is the maximum import/export character string length. Note that "EXCELX" and the new file extension, xlsx, are for files imported from or exported to Excel 2007. File suffix: xls or xlsx.
"FASCII"
Formatted ASCII text file. File suffix: fix. S-PLUS for Windows also recognizes the suffix fsc.
"FOXPRO"
FoxPro file. This file type is available only in S-PLUS for Windows. File suffix: dbf.
"GAUSS"
Aptech Gauss data file (old format). Unix exports GAUSS files in GAUSS96 format. File suffix: dat.
"GAUSS96"
Aptech Gauss data file (new format). File suffix: dat.
"LOTUS"
Lotus 123 worksheet file. File suffix: wk* or wr*.
"MATLAB"
Matlab file. The file must contain a single, full (i.e. not sparse), matrix. File suffix: mat.
"MINITAB"
Minitab workbook. In S-PLUS for UNIX, the file must be from Minitab Version 11 or earlier. File suffix: mtw.
"ODBC"
ODBC connection. This file type is available only in S-PLUS for Windows. When you specify "ODBC" , you must also specify the argument odbcConnection. You can also specify the argument odbcSqlQuery with this type. For more information, see the ODBC arguments described below.
"ORACLE"
Oracle database connection. Same as "DIRECT-ORACLE". No file argument should be specified.
"PARADOX"
Paradox data file. This file type is available only in S-PLUS for Windows. File suffix: db.
"QUATTRO"
Quattro Pro worksheet file. File suffix: wq* or wb*.
"SAS", "SAS6WIN"
SAS Version 6.x data file from Windows or OS/2. File suffix: sd2.
"SAS1", "SAS6UX32"
SAS Version 6.x data file from HP, IBM or Sun UNIX. File suffix: ssd01.
"SAS4", "SAS6UX64"
SAS Version 6.x data file from Digital/Compaq/HP Tru64 UNIX. File suffix: ssd04.
"SAS7"
SAS Version 7 or later data file from Windows or UNIX. File suffix: sas7bdat.
"SAS7WIN"
SAS Version 7 or later data file from Windows. This is equivalent to the "SAS7" file type since the platform is autodetected. File suffix: sas7bdat.
"SAS7UX32"
SAS Version 7 or later data file from Solaris (SPARC), HP-UX, or IBM AIX. This is equivalent to the "SAS7" file type since the platform is autodetected. File suffix: sas7bdat.
"SAS7UX64"
SAS Version 7 or later data file from Digital/Compaq/HP Tru64 UNIX. This is equivalent to the "SAS7" file type since the platform is autodetected. File suffix: sas7bdat.
"SAS_CPORT"
SAS CPORT transport file. CPORT files created either on Windows or Unix systems by SAS versions 7.01-9.01 can be imported. File suffix: stc or cpt.
"SAS_TPT"
SAS transport file. File suffix: tpt or xpt.
"SIGMAPLOT"
SigmaPlot data file. This file type is available only in S-PLUS for Windows. File suffix: jnb.
"SPLUS"
S-PLUS transport file. Files in this format can also be written and read with the data.dump and data.restore functions. File suffix: sdd.
"SPSS"
SPSS data file. File suffix: sav.
"SPSSP"
SPSS portable data file. File suffix: por.
"STATA"
Stata Version 2.0 or later data file. File suffix: dta.
"SYBASE"
Same as "DIRECT-SYBASE". Sybase database connection. No file argument should be specified.
"SYSTAT"
Systat data file. File suffix: sys. S-PLUS for Windows also recognizes the suffix syd.

keep
a character vector of column names or a numeric vector of column numbers. The columns specified are imported from the data file. Only one of keep and drop can be given.

The variable names to keep can also be placed in a file and separated by spaces, commas, or newlines. In this case, you can pass the name of the file to keep as a character string beginning with an ampersand "@". For example, keep="@keeplist.txt".

drop
a character vector of column names or a numeric vector of column numbers. The columns specified are NOT imported from the data file. Only one of keep and drop can be given.

The variable names to drop can also be placed in a file and separated by spaces, commas, or newlines. In this case, you can pass the name of the file to drop as a character string beginning with an ampersand "@". For example, drop="@droplist.txt".

colNames
a character vector of names to use for the imported columns. When colNames is specified, colNameRow is automatically set to zero, which tells S-PLUS not to look for column names in the data. If there are column names in the data, you must set startRow to the appropriate value, so the column names row is not read in as data. If the colNames argument is specified, "" is replaced by coln, where n is the column number.
rowNamesCol
an integer denoting the column that should be used for row names. The specified column is dropped from the resulting data frame or bdFrame . This argument is not supported when reading a big data object ( bigdata=T).
filter
a logical expression that specifies the rows to be imported from the data file. The filter must be written in terms of the original column names in the file and not in terms of the variable names specified by colNames. Column names containing special characters such as "." should be surrounded by single quotes, such as filter = "'Disp.' > 300". The function can be used to obtain the original variable names.

The logical operators that are available to use in the filter expression are: ==, !=, <, >, <=, >=, &, |, and !. Thus, to select all rows that do not have missing values in the id column, type id!=NA. To import all rows corresponding to 10-year-old children who weigh less than 150 pounds, define your filter as filter = "Age==10 & Weight<150". Note that the entire filter expression must be within quotes. In the filter expression, the variable name must be on the left side of the logical operator; i.e., type Age>12 instead of 12< Age.

The wildcard characters ? (for single characters) and * (for strings of arbitrary length) can be used to select subgroups of character variables. For example, the logical expression account==????22 selects all rows for which the account variable is six characters long and ends in 22. The expression id==3* selects all rows for which id starts with 3, regardless of the length of the string.

You can use the built-in variable @rownum to import specific row numbers. For example, the expression @rownum<200 imports the first 199 rows of the data file.

Three functions permit random sampling within the filter expression:

samp.rand accepts a single numeric argument prop, where 0<=prop<=1. Rows are selected randomly from the data file with a probability of prop.

samp.fixed accepts two numeric arguments, sample.size and total.observations. The first row is drawn from the data file with a probability of sample.size/total.observations. The ith row is drawn with a probability of (sample.size - i)/(total.observations - i), where i=1,2,...,sample.size.

samp.syst accepts a single numeric argument n. Every nth row is selected systematically from the data file after a random start.

The sampling functions use the S-PLUS random number generator to create random samples. You can therefore use the function to produce the same data sample repeatedly.

Expressions are evaluated from left to right, so you can sample a subset of the rows in your data file by first subsetting and then sampling. For example, to import a random sample of half the rows corresponding to high school graduates, use the expression schooling>=12 & samp.rand(0.5).

Note that the filter is not evaluated by S-PLUS. Thus, expressions containing built-in S-PLUS functions such as mean are not allowed. One special exception to this deals with missing values: you can use NA to denote missing values in the logical expressions, though you cannot use NA-specific functions such as is.na and na.exclude.

format
a character string specifying the format to use when type="FASCII". You must use a format string together with the "FASCII" file type if the columns in your data file are not separated by delimiters.

A valid format string includes a percent % sign followed by the data type for each column in the data file. Available data types are: s, which denotes a character string; f, which denotes a numeric value; and the asterisk *, which denotes a skipped column. The elements in the string are separated by commas. For example, the format string %s,%f,%*,%f imports the first column of the data file as type "character", the second and fourth columns as "numeric", and skips the third column altogether. If a variable is designated as "numeric" and the value of a cell cannot be interpreted as a number, the cell is filled in with a missing value. Incomplete rows are also filled in with missing values.

In the format string, you can also specify integers that define the width of each field. For example, the format string %4f,%6s,%3*,%6f reads the first four characters in each row as a numeric column. The next six characters in each row are read as a character string, the next three are skipped, and then six more characters are imported as another numeric column.

delimiter
a character string specifying the character to use as a delimiter in an ASCII input file. The expressions \n and \t are the only multi-character delimiters allowed, and denote a newline and a tab, respectively. For any other multi-character string, only the first character is used as the delimiter. Double quotes are reserved characters and therefore cannot be used as standard delimiters. If a delimiter is not supplied, S-PLUS searches the file automatically for the following (in the order given): tabs, commas, semicolons, and vertical bars. If none of these are detected, blank spaces are treated as delimiters.
startCol
an integer specifying the first column to be imported from the data file.
endCol
an integer specifying the final column to be imported from the data file. The default value of -1 means that the last column in the file is used.
startRow
an integer specifying the first row to be imported from the data file. The value of the argument scanLines must be at least this value.
endRow
an integer specifying the final row to be imported from the data file. The default value of -1 means that the last row in the file is used.
pageNumber
an integer specifying the page number of the spreadsheet. By default, the first page is used.

Can be used to specify which dataset to retrieve from a SAS Transport file. Specify the number of the dataset in the file (e.g., 1 to get the first, 2 to get the second, and so on). If you know the exact name of the dataset to retrieve, you can use the table argument. See its description for more information.

colNameRow
an integer denoting the row that should be used for column names. The specified row is dropped from the resulting data frame or bdFrame. If you do not specify a row, S-PLUS attempts to locate column names in the first row of the file; specify colNameRow=0 to prevent S-PLUS from searching for a row of column names. In a delimited ASCII file, the column names row must come before the first data row to be read ( startRow).
server
a character string specifying the database server when importing from a relational database. If type="DIRECT-SQL", and you are accessing a non-default instance of SQL Server, specify server="SERVERNAME\\INSTANCE". To access the default instance, use server= "SERVERNAME".

This should be left as the empty string "" if type="DB2".

user
a character string specifying the user name when importing from a relational database.
password
a character string specifying the user's password for accessing the database when importing from a relational database. If type="ORACLE" and you are using Remote OS Authentication, specify password="self" and no user argument.
database
a character string specifying the database to use when importing from a relational database. This should be left as the empty string "" if type="ORACLE".
table
a character string specifying the name of the table to import from a relational database. When importing from a database, table cannot be specified in conjunction with the sqlQuery argument (see below).

table can be used to specify which dataset to retrieve from a SAS Transport file. To specify which dataset to import, set the table to the dataset name. The dataset name must match exactly, including case. If you do not know the name, use the pageNumber argument and specify the number of the dataset in the file. If you omit the table argument, the first dataset in the file is imported.

stringsAsFactors
a logical value. If stringsAsFactors=TRUE, strings are converted to factors when imported. The default is TRUE unless you set options(stringsAsFactors=FALSE).
sortFactorLevels
a logical value. If sortFactorLevels=TRUE, the levels for all factors created from character strings are sorted. Otherwise, the order of the levels is not specified. In previous versions of S-PLUS, there were situations where importing with sortFactorLevels=FALSE was significantly faster, but this is no longer true. This argument is not supported when reading a big data object ( bigdata=T).
valueLabelAsNumber
a logical value. If importing SAS and SPSS variables with labels, valueLabelAsNumber=TRUE returns the actual data values (either numeric or character). If valueLabelAsNumber=FALSE, the value labels are imported.
centuryCutoff
a numeric value that specifies the origin for two-digit dates in ASCII text files. Dates with two digit years are assigned to the 100-year span that starts with this value. The default value of 1930 means that the date 6/15/30 is read as June 15, 1930 while 12/29/29 is read as December 29, 2029.
separateDelimiters
a logical value that specifies how repeated consecutive delimiter characters are treated when reading ASCII text files. If separateDelimiters=TRUE, repeated delimiters indicate columns with missing values. Otherwise, repeated delimiters are treated as one delimiter. The option separateDelimiters=FALSE, is most often used to treat multiple blank spaces as one delimiter.
odbcConnection
a character string containing an ODBC connection string. This argument is required if type="ODBC" and is functional only in S-PLUS for Windows. See the DETAILS section below for information on the form this string should take.
odbcSqlQuery
a character string containing an optional SQL query when importing data from an ODBC connection. If no query is specified, the first table in the data source is used. This argument is functional only when type="ODBC", and in S-PLUS for Windows
sqlQuery
a character string specifying the SQL query to execute when importing from a database connection (other than ODBC; that is, type is not "ODBC"). This cannot be given in conjunction with the table argument above.
readAsTable
a logical value. If readAsTable=TRUE, the arguments separateDelimiters, startRow, and startCol are set to TRUE. This forces S-PLUS to read the entire file as a single table.
colNamesUpperCase
a logical flag. If colNamesUpperCase=TRUE, column names are imported in all uppercase letters. Variable names from SAS and versions of SPSS earlier than v12 are converted to lower case letters unless this argument is set to TRUE.
time.in.format
a character string specifying the format used to interpret date/time data when importing from ASCII or FASCII text files. By default, this is determined by options("time.in.format").
decimal.point
a single character specifying the decimal point character for ASCII data files. By default, this is the period (.).
thousands.separator
a single character specifying the thousands separator character for ASCII data files. By default, this is the comma (,).
time.zone
a string naming the time zone any dates in the input are assumed to be in. Currently, time zone information in the data file is ignored. This argument is not supported when reading a big data object ( bigdata=T).
use.locale
a logical value. If use.locale=TRUE, the default values of decimal.point and thousands.separator come from the current locale set by Sys.setlocale, and the default value of time.zone is options()$time.zone. Otherwise, the default values are as described above.
sqlReturnData
a logical value. If sqlReturnData=TRUE (the default), any SQL query expression is evaluated and the resulting data is returned. If sqlReturnData=FALSE, the SQL query is executed for effect only and NULL is returned. The function may also be used to execute an SQL query for effect.

Do not set sqlReturnData=TRUE for SQL statements that have side effects (e.g., INSERT statements). Note that when sqlReturnData=TRUE, the SQL may be executed twice: a small "trial" run may be done to determine the column types before the full result is extracted.

scanLines
an integer giving the number of lines that will be scanned from an ASCII input file before performing the import to determine the column name and types and widths. Specifying a negative value such as scanLines=-1 means to scan the entire file, which may take a long time for large files, but is the safest option.

The problem with setting this argument to scan less than the entire file is that importData may detect the wrong column types, and read some of the data incorrectly. For example, suppose a particular column in a file only contains integers for the first thousand rows, and then contains arbitrary strings. If scanLines=-1, the column type will be detected as character or factor, and imported that way. If scanLines=100, the column type will be numeric, and importData will attempt to import all of the values in that form: elements that cannot be parsed as numbers will be read as NA.

maxLineWidth
an integer giving the maximum line width expected when reading ASCII text files. If a line is read that is longer than this value, an error is signaled. The default of 0, or any number less than 32768 is treated as 32768.

An advantage of having this limit is that it prevents accidentally reading an arbitrary binary file as a text file, and getting garbage. It is seldom necessary to set this argument for normal text files.

na.string
a character string that will be read as a missing value when reading an ASCII text file. No matter what value is specified for this argument, an empty string value will always be read as a missing value.
colTypes
a character vector of column types to use for the imported columns. This can contain values from: "numeric", "character", "factor", and "timeDate".
colStringWidths
an integer vector of column string widths to use for the imported "character" columns. The element values are ignored for non-character columns. The column string width of a big data character column is the maximum number of characters that can be stored in the column without truncation. This argument is only supported when reading a big data object ( bigdata=T), otherwise it is ignored.
sasFormats
specifies the SAS formats file. See the NOTE section for more detail.
bigdata
a logical value; if TRUE, the data is read into a big data object or type bdFrame. Otherwise, it is read into a data.frame object. This argument can be used only if the bigdata library section has been loaded.

VALUE:

a data frame or bdFrame.

SIDE EFFECTS:

The importData function causes creation of the data set .Random.seed if it does not already exist; otherwise its value is updated.

DETAILS:

If you try to import data and encounter problems, S-PLUS displays a warning describing the problems encountered.

When constructing a data.frame object ( bigdata=F), variable names are run through the make.names function to ensure that appropriate S-PLUS names are created. Variable names that contain underscores ( _) are converted so that they contain periods ( .) instead. The filter and sqlQuery arguments must be written using the original variable names in the data file or database. The function can be used before the data is imported to obtain the original variable names.

To access a database on a remote server, S-PLUS must establish a communication link to the server across the network. If type="ODBC" in S-PLUS for Windows, the information required to create this link is contained in the ODBC connection string. This string consists of one or more attributes that specify how a driver connects to a data source. An attribute identifies a specific piece of information that the driver needs to know before it can make the appropriate data source connection. Each driver may have a different set of attributes, but the connection string is always of the form:

DSN=dataSourceName [;SERVER=value] [;PWD=value] [;UID= value] [;=]

You must specify the data source name. However, all other attributes are optional. If you do not specify a particular attribute, it defaults to the value specified in the relevant DSN tab of the ODBC Data Source Administrator.

The S-PLUS GUI obfuscates ODBC passwords in the History Log by replacing the PWD in the odbcConnection argument with ******. To reuse the code from the History log at the command line, replace the ****** with the appropriate password.

There is a difference between bigdata=F and bigdata=T when importing empty strings as factors. If bigdata=F, it is possible to import a factor level that is an empty string. If bigdata=T, any such factor levels are read as NA.

You can also use the function to run SQL queries. Both executeSQL and importData are designed to execute arbitrary SQL queries: either stored procedures or simple statements that return data. They are not designed to execute arbitrary SQL "programs." If you need to work with a long succession of SQL statments, it is best to convert the statements to a SQL stored procedure and then call the procedure using either executeSQL or importData.

Tests performed on importing/exporting databases have shown the following:

Access 2000

32k is the maximum import/export character string length in the Memo field.

Access 2007
Uses the .accdb file extension. Access 2007 files cannot be read by earlier versions of Access. Note that importing Access 2007 files requires that you have ODBC drivers.
DIRECT-DB2
exporting data to DB2 creates varchar columns with a maximum string length of 32,672 characters.
DIRECT-Oracle
maximum character string length by column type:
import: 3999( varchar2), 2000( nchar), 1000( nvarchar2)
export: <4000( varchar2),>=4000( CLOB)
If you have two or more columns of long strings, either with strings over 1,333 characters, S-PLUS writes empty rows to the database.
Oracle via ODBC
same as "DIRECT-Oracle" except the maximum character string length by column type:
import: 1999( char), 2000( nchar), 1000( nvarchar2)
export: <=255 ( char), >255( long)
DIRECT-SQL
maximum character string length by column type for SQL Server 2000:
import: 4096( text), 255( varchar), 255( nvarchar), 255( char), 255( nchar)
export: <8000( varchar), >=8000( text)
If you export data with strings over 255 characters, the string is truncated.The longer the string, the greater the truncation.
SQL Server 2000 using ODBC
the maximum string length varies with column type:
import: 7999( text), 7999( varchar), 3999( nvarchar), 7999( char), 3999( nchar)
export: <=255( varchar), >255( text)
STATA
export: you are limited to 2,047 characters. For larger STATA datasets (up to 32,767 variables), specify type="STATASE".
MATLAB
export: specify type="MATLAB" to create a pre-MATLAB 7 version file; otherwise, specify type="MATLAB7" to export the MATLAB 7 file format.

NOTE:

Use the functions and to import data in sequential blocks of a specified size.

Use the function to retrieve the names of all data sets, sheets, or tables in a specified data file or database.

When importing SAS data, if valueLabelAsNumber is FALSE (the default), S-PLUS attempts to get value labels from the file specified by sasFormats or it looks in the same directory where the data file (specified by the file argument) is located to find one of these files (in this order):

Windows: formats.sas7bcat, formats.stc, formats.xpt, formats.tpt, formats.sas7bdat

Unix/Linux: formats.stc, formats.xpt, formats.tpt, formats.sas7bdat.

where .sas7bcat is a SAS catalog file, .sas7bdat is a SAS data file, .stc is a SAS CPORT transport file, and .xpt or .tpt is an older style SAS Transport file (prior to SAS version 7).

S-PLUS cannot read .sas7bcat (catalog) files created on Unix/Linux, but it can read catalog files created on Windows, even when running the Unix version of S-PLUS. As a workaround, Unix users must convert their SAS catalog file to either a CPORT transport file or a SAS data file.

If sasFormats is not given and none of the default format files (listed above) can be found, then S-PLUS behaves as if valueLabelsAsNumbers=T, even though some SAS data variables may have been associated with user-defined formats when the data set was created.

If valueLabelsAsNumbers=TRUE, there is no attempt to open a SAS format file, even if sasFormats is specified. valueLabelsAsNumbers also controls whether or not value labels which may exist in a SPSS data file are used when importing that data file.

If sasFormats is a file with a .loc extension, then that file must contain the names of the format files to be used. User-defined formats are read from all files listed in the .loc file. This is useful if you have data that uses formats from various format files.

SEE ALSO:

, , , , , , , , , , .

EXAMPLES:

# Create an ASCII file.
exportData(state.x77, "state.txt") 
# Now import it. 
state.new <- importData("state.txt", type="ASCII") 

# Create a fixed format ASCII file. 
cat(paste(paste(state.abb[1:8], 1:8, LETTERS[11:18], 21:28, 
        sep=""), collapse="\n"), "\n", file="fdata.fix") 
# Now import it. 
fdata <- importData("fdata.fix", type="FASCII", 
        format="%2s,%1f,%1s,%2f") 
# Create a text file containing a column name with a special character ("Disp." in fuel.frame has a period)
exportData( fuel.frame, file = "ff.txt", rowNames = TRUE )
# Put single quotes around column names that have special characters when importing
importData( file = "ff.txt", filter = " 'Disp.' > 300 " )
# Examples for an Oracle database with S-PLUS for Solaris.
#
# The following two commands are equivalent and return all data 
# from the emp table.
importData(type="oracle", user="scott", password="tiger", 
   table="emp", server="ORACLEDB")
importData(type="oracle", user="scott", password="tiger",
   server="ORACLEDB", sqlQuery = "SELECT * FROM emp")
# The following example demonstrates that you must put your
# entire filter string in quotes.
# First, create an ASCII file.   
exportData(state.x77, "state.txt") 
# Next, import the new ASCII file.   
state.new <- importData("state.txt", type="ASCII") 
# Finally, import a subset of the rows from the ASCII file by specifying
# a filter. Note that the entire filter string is in quotes:  
state2 <- importData("state.txt", filter = "Illiteracy == 0.6 & Murder > 5")