Open an External Data File

DESCRIPTION:

Opens an external data file for either importing the contents into S-PLUS, or for exporting data from S-PLUS to the file. The openData function returns a data handle that can be used to read or write sequential blocks of data from or to the file.

USAGE:

openData(file=NULL, type="", rowsToRead=1000, openType="read",
   keep=character(), drop=character(), colNames=character(),
   rowNamesCol=-1, filter=character(), format=character(),
   delimiter="", startCol=1, endCol=-1, startRow=1, endRow=-1,
   pageNumber=-1, colNameRow=-1, server="", user="", password="",
   database="", table="", sqlQuery=character(0), rowNames=F, 
   stringsAsFactors=T, sortFactorLevels=T, valueLabelAsNumber=F, 
   centuryCutoff=1930, separateDelimiters=T, quote=T, decimal.point=".", 
   thousands.separator=",", time.format = "", time.zone="GMT", use.locale=F,
   scanLines=max(startRow, 256), maxLineWidth=0,
   na.string=<<see below>>, colTypes=character(0), sasFormats="")

OPTIONAL ARGUMENTS:

file
a character string specifying the name of the file to import or the name of the file to create upon export. If the file has a known suffix (see the type argument below), S-PLUS assumes the file is of that type. This can be overridden by providing type explicitly. The file argument is not required if importing from or exporting to a relational database.
type
a character string specifying the type of file to import or export. The case of the character string is ignored. Possible values are listed in the help files for and .
rowsToRead
the number of rows to read with each call to readNextDataRows. If this is set to zero, the entire data set is read at once. If filter is supplied, the filter is applied before the rows are read; this means that filtered blocks (except the last) are all of size rowsToRead.
openType
a character string, "read" if the data file is to be read and "write" if the data file is to be opened for writing. Only the first character is necessary and case is ignored.
keep
a character vector of column names or a numeric vector of column numbers. The columns specified are imported from or exported to the data file. Only one of keep and drop can be given.
drop
a character vector of column names or a numeric vector of column numbers. The columns specified are NOT imported from or exported to the data file. Only one of keep and drop can be given.
colNames
a character vector of names to use for the imported columns. When colNames is specified, colNameRow is automatically set to zero, which tells S-PLUS not to look for column names in the data. If there are column names in the data, you must set startRow to the appropriate value, so the column names row is not read in as data. If the colNames argument is specified, "" is replaced by coln, where n is the column number.
rowNamesCol
an integer denoting the column that should be used for row names. The specified column is dropped from the resulting data frame.
filter
a logical expression that specifies the rows to be imported from or exported to the data file. The filter must be written in terms of the original column names in the file and not in terms of the variable names specified by colNames. Column names containing special characters such as "." should be surrounded by single quotes, such as filter = "'Disp.' > 300". See the help file for information on the form the filter expression should take.
format
a character string specifying the format to use when type="FASCII". You must use a format string together with the "FASCII" file type if the columns in your data file are not separated by delimiters. See the and help files for information on the form this string can take.
delimiter
a character string specifying the character to use as a delimiter in an ASCII input file. The expressions \n and \t are the only multi-character delimiters allowed, and denote a newline and a tab, respectively. For any other multi-character string, only the first character is used as the delimiter. Double quotes are reserved characters and therefore cannot be used as standard delimiters.

When importing data, if a delimiter is not supplied, S-PLUS searches the file automatically for the following (in the order given): tabs, commas, semicolons, and vertical bars. If none of these are detected, blank spaces are treated as delimiters.

startCol
an integer specifying the first column to be imported from the data file.
endCol
an integer specifying the final column to be imported from the data file. The default value of -1 means that the last column in the file is used.
startRow
an integer specifying the first row to be imported from the data file. The value of the argument scanLines must be at least this value.
endRow
an integer specifying the final row to be imported from the data file. The default value of -1 means that the last row in the file is used.
pageNumber
an integer specifying the page number of the spreadsheet. By default, the first page is used.
colNameRow
an integer denoting the row that should be used for column names. The specified row is dropped from the resulting data frame. If you do not specify a row, S-PLUS attempts to locate column names in the first row of the file; specify colNameRow=0 to prevent S-PLUS from searching for a row of column names. In a delimited ASCII file, the column names row must come before the first data row to be read ( startRow).
server
a character string specifying the database server when importing from or exporting to a relational database. This should be left as the empty string "" if type="DB2".
user
a character string specifying the user name when importing from or exporting to a relational database.
password
a character string specifying the user's password for accessing the database when importing from or exporting to a relational database.
database
a character string specifying the database to use when importing from or exporting to a relational database. This should be left as the empty string "" if type="ORACLE".
table
a character string specifying the name of the table in database to import from or export to. When exporting data, if a table by the specified name does not already exist, openData creates it. When working with a database, table cannot be specified in conjunction with the sqlQuery argument (see below).
sqlQuery
a character string specifying the SQL query to execute when reading from a database. This cannot be given in conjunction with the table argument above.
rowNames
a logical flag. If rowNames=TRUE, the row names are exported to the data file.
stringsAsFactors
a logical value. If stringsAsFactors=TRUE, strings are converted to factors when imported.
sortFactorLevels
a logical value. If sortFactorLevels=TRUE, the levels for all factors created from character strings are sorted. Otherwise, the levels are defined in the order that they are read in from the data file. This argument only applies if the entire data set is read at once ( rowsToRead=0). See the DETAILS section below for information on how factor levels are handled when reading in blocks.
valueLabelAsNumber
a logical value. If valueLabelAsNumber=TRUE, SAS and SPSS variables with labels are imported as numbers. Otherwise, the value labels are imported.
centuryCutoff
a numeric value that specifies the origin for two-digit dates in ASCII text files. Dates with two digit years are assigned to the 100-year span that starts with this value. The default value of 1930 means that the date 6/15/30 is read as June 15, 1930 while 12/29/29 is read as December 29, 2029.
separateDelimiters
a logical value that specifies how repeated consecutive delimiter characters are treated when reading ASCII text files. If separateDelimiters=TRUE, repeated delimiters indicate columns with missing values. Otherwise, repeated delimiters are treated as one delimiter. The option separateDelimiters=FALSE, is most often used to treat multiple blank spaces as one delimiter.
quote
a logical flag. If quote=TRUE, quotes are placed around character strings when exporting to ASCII text files ( type="ASCII").
decimal.point
a single character specifying the decimal point character for ASCII data files. By default, this is the period (.).
thousands.separator
a single character specifying the thousands separator character for ASCII data files. By default, this is the comma (,).
time.format
a character string specifying the format used to interpret date/time data when importing or exporting from ASCII or FASCII text files. By default, this is determined by options("time.in.format") or options("time.out.format")
time.zone
a string naming the time zone any dates in the input are assumed to be in. Currently, time zone information in the data file is ignored.
use.locale
a logical value. If use.locale=TRUE, the default values of decimal.point and thousands.separator come from the current locale set by Sys.setlocale, and the default value of time.zone is options()$time.zone. Otherwise, the default values are as described above.
scanLines
an integer giving the number of lines that will be scanned from an ASCII input file before performing the import to determine the column name and types and widths. Specifying a negative value such as scanLines=-1 means to scan the entire file, which may take a long time for large files, but is the safest option.

The problem with setting this argument to scan less than the entire file is that importData may detect the wrong column types, and read some of the data incorrectly. For example, suppose a particular column in a file only contains integers for the first thousand rows, and then contains arbitrary strings. If scanLines=-1, the column type will be detected as character or factor, and imported that way. If scanLines=100, the column type will be numeric, and importData will attempt to import all of the values in that form: elements that cannot be parsed as numbers will be read as NA.

maxLineWidth
an integer giving the maximum line width expected when reading ASCII text files. If a line is read that is longer than this value, an error is signaled. The default of 0, or any number less than 32768 is treated as 32768.
na.string
a character string that will be read as a missing value when reading an ASCII text file, or written to represent a missing value when writing an ASCII text file. The default value is "NA" when reading, and "" when writing.

No matter what value is specified for this argument, when reading an empty string value will always be read as a missing value.

colTypes
a character vector of column types to use for the imported columns. This can contain values from: "numeric", "character", "factor", and "timeDate".
sasFormats
specifies the SAS formats file. See the NOTE section for more detail.

VALUE:

a data handle object, typically passed to either the readNextDataRows or writeNextDataRows function.

DETAILS:

This function, along with readNextDataRows and writeNextDataRows , provide the capability of reading and writing arbitrarily large data sets with S-PLUS. See the readNextDataRows help file for a simple example of computing statistics from an external data set by reading it in blocks.

When reading character data with stringsAsFactors=TRUE (the default), the levels are accumulated as each block is read. The levels for the factor variables in the final data frame imported by readNextDataRows contain all possible values that have appeared in that variable. When reading in blocks, the order of the levels is the order that the various values appear in the data set.

NOTE:

When importing SAS data, if valueLabelAsNumber is FALSE (the default), S-PLUS attempts to get value labels from the file specified by sasFormats or it looks in the same directory where the data file (specified by the file argument) is located to find one of these files (in this order):

Windows: formats.sas7bcat, formats.stc, formats.xpt, formats.tpt, formats.sas7bdat

Unix/Linux: formats.stc, formats.xpt, formats.tpt, formats.sas7bdat.

where .sas7bcat is a SAS catalog file, .sas7bdat is a SAS data file, .stc is a SAS CPORT transport file, and .xpt or .tpt is an older style SAS Transport file (prior to SAS version 7).

S-PLUS cannot read .sas7bcat (catalog) files created on Unix/Linux, but it can read catalog files created on Windows, even when running the Unix version of S-PLUS. As a workaround, Unix users must convert their SAS catalog file to either a CPORT transport file or a SAS data file.

If sasFormats is not given and none of the default format files (listed above) can be found, then S-PLUS behaves as if valueLabelsAsNumbers=T, even though some SAS data variables may have been associated with user-defined formats when the data set was created.

If valueLabelsAsNumbers=TRUE, there is no attempt to open a SAS format file, even if sasFormats is specified. valueLabelsAsNumbers also controls whether or not value labels which may exist in a SPSS data file are used when importing that data file.

If sasFormats is a file with a .loc extension, then that file must contain the names of the format files to be used. User-defined formats are read from all files listed in the .loc file. This is useful if you have data that uses formats from various format files.

SEE ALSO:

, , , , , , , .

EXAMPLES:

# First create an external SAS data set.
exportData(Quinidine, "Quinidine.ssd01")

# Open the external data set for subsequent reads.
dh <- openData("Quinidine.ssd01", rowsToRead=100, drop="Subject")

# Get variable names and type.
getDataInfo(dh)

# Read the first 100 observations.
df100 <- readNextDataRows(dh)

# Close the external data file.
closeData(dh)