Input Data from a File or Connection

DESCRIPTION:

Reads data from a connection or interactively from standard input. Options are available to control how the file is read and the structure of the data in S-PLUS.

USAGE:

scan(file="", what=numeric(), n=<<see below>>, sep="", 
     multi.line=F, flush=F, append=F, skip=0, widths=NULL, 
     strip.white=<<see below>>,
     scan.as.integer=F, locale) 

OPTIONAL ARGUMENTS:

file
the file or S-PLUS connection object to be scanned. If file is missing or empty ( ""), data is read from standard input. In this case, scan prompts with the index for the next data item, and input can be terminated by a blank line. For more details on reading data from connections, see the help file for file.
what
a vector of mode "numeric", "character", or "complex", or a list of vectors of these modes. Objects of mode "logical" are not allowed. The scan function reads all fields in the file as data of the same mode as what. Thus, what=character() or what="" reads data as character fields. If what is missing, scan interprets all fields as numeric.

If what is a list, then each record is considered to have length (what) fields, and the mode of each field is the mode of the corresponding component in what. When widths is given as a vector of length greater than one, what must be a list the same length as widths.

n
the maximum number of items to read from the file (the number of records times the fields per record). If omitted, the function reads to the end of file, or to an empty line if reading from standard input.
sep
a single-character separator, often "\t" for tabs or "\n" for newlines. If omitted, any amount of white space (blanks, tabs, and possibly newlines) can separate fields. If the widths argument is specified, sep specifies the separator to insert into fixed-format records. By default, sep="".
multi.line
logical value. If multi.line=FALSE, all fields must appear on one line. If scan reaches the end of a line without reading all the fields, an error occurs. Thus, the number of fields on each line must be a multiple of the length of what, unless flush=TRUE. This is useful for checking that no fields have been omitted. If multi.line=TRUE, reading continues and the positions of newlines are disregarded. By default, multi.line=FALSE.
flush
logical value. If flush=TRUE, the scan function flushes to the end of the line after reading the last of the fields requested. This allows you to include comments that are not read by scan after the last field. It also prevents multiple sets of items from being placed on one line. By default, flush=FALSE.
append
logical value. If append=TRUE, the returned object includes all of the elements in the what argument, with the input data for the respective fields appended to each component. If append=FALSE, the data in what is ignored and only the modes matter. By default, append=FALSE.
skip
the number of initial lines of the file that should be skipped prior to reading. By default, skip=0 and reading begins at the top of the file.
widths
a vector of integer field widths corresponding to items in the what argument. The widths argument provides for common fixed-format input. If widths is not NULL, then as scan reads the characters in a record, it automatically inserts a sep character after reading widths[1] characters; widths[1] represents the width of the first field. The scan function then inserts another sep after widths[2] characters, and so on, allowing the record to be read as if your input was originally delimited by the sep character. The default sep used when widths is supplied is "\001" (binary 1); if your input contains this character, you should set the sep argument to a character that is not contained anywhere in the input.

One caveat: the widths vector you specify must correspond exactly to field widths in your input. If they do not, you may get "field undecipherable" errors in seemingly odd places, or the input may be silently but incorrectly digested. By default, widths=NULL. Note that if widths has a length greater than one, the what argument must be a list of the same length.

strip.white
a vector of logical values corresponding to items in the what argument. The strip.white argument allows you to strip leading and trailing white space from character fields; scan always strips numeric fields in this way. If strip.white is not NULL, it must be either of length 1, in which case the single logical value tells whether to strip all fields read, or it must be the same length as what, in which case the logical vector tells which fields to strip. For example, if strip.white[1]=TRUE and field 1 is character, scan strips the leading and trailing white space from field 1. If widths is specified, strip.white=TRUE by default and all fields are stripped. Otherwise, strip.white=NULL by default and no fields are stripped. If you read free-format input by leaving sep unspecified, then strip.white has no effect.
scan.as.integer
a logical value to say how to deal with the class "integer" in the what argument. The default scan.as.integer=FALSE means to treat it as double precision while TRUE means to treat them as integers. This is here because previous versions of S-PLUS parsed what=1 as double precision but now it is parsed as an integer.
locale
character string as used in the Sys.setlocale function. If given, read numbers as if you were in the given locale.

VALUE:

a list or vector like the what argument if it is present, and a numeric vector if what is omitted.

DETAILS:

It is possible to read files that contain more than one mode by specifying a list as the what argument. For example, if the fields in the file myfile are alternately numeric and character, the command scan(myfile, what=list(0,"")) reads them and returns an object of mode "list" that has a numeric vector and a character vector as its two elements.

The elements of what can be anything, as long as you have numbers where you want numeric fields, character data where you want character fields, and complex numbers where you want complex fields. A NULL component in what causes the corresponding field to be skipped during input. The elements are used only to decide the kind of field, unless append=TRUE. Note that scan retains the names attribute of the list, if any. Thus, the command z <- scan(myfile, what=list(pop=0, city="")) allows you to refer to z$pop and z$city.

Any numeric field containing the characters NA is returned as a missing value. If the field separator (the sep argument) is given and the field is empty, the returned value is NA for a numeric or complex field and "" for a character field.

The main use of separators is to allow white space inside character fields. For example, suppose in the command above that the numeric field is to be followed by a tab, with text filling out the rest of the line. The command z <- scan(myfile, what=list(pop=0, city=""), sep="\t") allows blanks in the city name. With no separator, arbitrary white space can be included by quoting the whole string. With a separator, quotes are not used; if the separator character is to be included in a string, it must be escaped by a preceding backslash.

Fields of mode "logical" cannot be read directly. Instead, read them as character fields and convert them by using expressions such as x=="T". Any field that cannot be interpreted according to the mode(s) supplied to scan causes an error.

The scan function employs C scan formats to read numeric data, rather than using the S-PLUS parser (the parse function). Exponential notation must use "e"; numbers that use "d" or other letters will be read incorrectly. You will need to change your data from the "d" notation to the "e" notation with, for instance, the sed utility in UNIX.

As it reads more and more records, scan allocates more space to accommodate the growing vectors. If you supply a what argument that is identical in size to the result you expect, S-PLUS uses that space and does not have to perform memory allocations. This may produce significant memory savings when dealing with large files of data.

The make.fields function preprocesses files that have fixed-format fields and places separators after each field. It can be used as a separate step instead of using the widths argument with scan. The advantage of using widths is that you do not need to create any temporary files.

The read.table function reads data from a file and returns a data frame. It is often a better choice than scan if the data are in a regular table format with rows of equal length. The count.fields function returns the number of fields in each line of a file, which is useful for determining if read.table is appropriate. The count.fields function is also helpful when using scan to return a list, if the number of fields in each line is a proper multiple of the length of what. The readline is another function that accepts data interactively.

SEE ALSO:

, , , , , , . .

EXAMPLES:

# Read numeric values from standard input.
num <- scan() 
# Read a label and two numeric fields to make a matrix. 
z <- scan("myfile", list(name="", 0, 0)) 
mat <- cbind(z[[2]], z[[3]]) 
dimnames(mat) <- list(z$name, c("X","Y")) 
# Like previous, but make columns integer
z <- scan("myfile", list(name="", 0, 0), scan.as.integer = T) 
# Read in a vector of character data. 
personnel <- scan("person", what="") 
# Create a list with two NULL components, a character component, 
# and a numeric component. Fields are separated by tabs. 
ff <- scan("myfile", what=list(NULL, name="", data=0, NULL),  
              multi.line=T, sep="\t") 
# Delete NULL components from ff.
ff <- ff[sapply(ff, length) > 0] 
# Save in single precision, skip the first five lines of the file.
scan("myfile", single(0), skip=5) 
# Example of reading a fixed format file using the widths and
# strip.white arguments. Blanks are read as NA for numeric fields. 
# Assignment can be suppressed for a field using NULL in the what argument. 
# For this example, the file 'dfile' contains the following lines: 
# 01giraffe.9346H01-04 
# 88donkey .1220M00-15 
# 77ant         L04-04 
# 20gerbil .1220L01-12 
# 22swallow.2333L01-03 
# 12lemming     L01-23 
mydf.what <- list(code=0, name="", x=0, s="", n1=0, NULL, n2=0) 
mydf.widths <- c(2, 7, 5, 1, 2, 1, 2) 
# strip.white defaults to TRUE if widths is specified. 
# You can also use strip.white = c(F, T, F, F, F, F, F). 
mydf <- scan("dfile", what=mydf.what, widths=mydf.widths) 
mydf 
# This produces the following output: 
# $code: 
# [1]  1 88 77 20 22 12 
# $name: 
# [1] "giraffe" "donkey"  "ant"     "gerbil"  "swallow" "lemming" 
# $x: 
# [1] 0.9346 0.1220     NA 0.1220 0.2333     NA 
# $s: 
# [1] "H" "M" "L" "L" "L" "L" 
# $n1: 
# [1] 1 0 4 1 1 1 
# [[6]]: 
# NULL 
# $n2: 
# [1]  4 15  4 12  3 23 
# Now with strip.white argument: 
mydf <- scan("dfile", what=mydf.what, widths=mydf.widths, strip.white=F) 
mydf$name 
# This produces a list just like the one above, except 
# the columns are not stripped: 
# [1] "giraffe" "donkey " "ant    " "gerbil " "swallow" "lemming"