Filter Rows
A character string containing a logical expression for selecting the rows to be included in the return value. Available logical operators are: ==, !=, <, >, <=, >=, &, | and !. For example: `"Age == 10 & Weight < 150"'.
The wildcard characters ? (for single characters) and * (for strings of arbitrary length) can be used to select subgroups of character string variables. For example: `"account == ????22"'.
Use NA to denote missing values, for example, "Id != NA" (this is a special syntax for the importData and exportData functions only).
Use the built-in variable `@rownum' to specify specific row numbers to select. For example: `"@rownum < 200"' gives the first 199 rows.
The filter expression should have the variable on the left. For example, `age > 12' rather than `12 < age'.
Three functions are available within the filter expression to permit sampling, as follows:
samp.rand
allows random sampling. Takes single argument prop. Each case is selected with probability equal to prop.
samp.fixed
selects a random sample of fixed size. Takes two arguments, sample.size and total.observations. The first case is drawn with a probability of sample.size/total.observations, and the succeeding ith case is drawn with a probability of (sample.size - hits)/(total.observations - i) .
samp.syst
performs a systematic sample of every nth case after a random start. Takes single argument n.
Expressions are evaluated from left to right, so you can sample from a subset of your cases by subsetting first, then sampling. For example, to take a random sample of half of high school graduates, use `schooling >= 12 & samp.rand(.5)'.
Note that the filter string is not evaluated by S-PLUS. This means that `filter="Age > mean(Age)"' is not allowed. Also note that the filter must be written in terms of the original variable names in the data set, not in terms of the variable names specified in colNames. The getDataInfo function can be used to get the original variable names.
Case Selection
You can select cases by entering a case-selection statement in the Filter Information box in the Filter dialog. The case-selection or where statement has the following form:
where variable expression relational operator condition
You can specify a single variable or an expression involving several variables. All of the usual arithmetic operators ( + - / * () ) are available for use in variable expressions.
The following relational operators are available:
Operator |
|
= |
equals |
!= |
not equal |
< |
less than |
> |
greater than |
<= |
less than or equal |
>= |
greater than or equal |
& |
and |
| |
or |
! |
not |
Examples of selection conditions given by "where" expressions are:
where sex = 1 & age < 50
where (income + benefits) / famsize < 4500
where income1 >=20000 or income2 >= 20000
where income1 >=20000 & income2 >= 20000
where dept = "auto loan"
Note that strings used in case-selection expressions need not be enclosed in quotes unless they contain embedded blanks.
Wildcards * or ? are available to select subgroups of string variables. For example:
where account = ????22
where id = 3*
The first statement selects any accounts that have 2s as the 5th and 6th characters in the string, while the second statement selects strings of any length that begin with 3.
The comma operator is used to list different values of the same variable name that are used as selection criteria. It allows you to bypass lengthy OR expressions when giving lists of conditional values, for example:
where state = CA,WA,OR,AZ,NV
where caseid != 22*,30??,4?00
You can test to see that any variable is missing by comparing it to the special, internal variable, _missing. For example:
where income != _missing & age != _missing
Three functions are available for sampling.
The first, samp_rand(prop) allows for simple random sampling. Each case is selected with a probability equal to prop.
The second, samp_fixed(sample_size,total_observations) selects a random sample of fixed size. The first case is drawn with a probability of sample_size/total_observations, and the succeeding ith case is drawn with a probability of (sample_size - hits) / (total_observations - i).
Finally, a third function samp_syst(n) performs a systematic sample of every nth case after a random start.
Expressions are evaluated from left to right, so you can sample from a subset of your cases by subsetting them first and then sampling. For instance to take a random half of high school graduates use:
where schooling >= 12 & samp_rand(.5)