|
|
|
An Introduction to the S Language
Patrick Burns -- August
2002, revised March 2010
What is S?
S started as a research project at Bell Labs a few decades ago, it
is a language that was developed for data analysis, statistical modeling,
simulation and graphics.
However, it is a general purpose language with some powerful features -- it
could (and does) have uses far removed from data analysis.
In particular, it should be used for many of the tasks that spreadsheets
are currently used for.
If a task is non-trivial to do in a spreadsheet, then almost always it would
more productively (and safely) be done in the S language.
Spreadsheet Addiction
talks about problems with spreadsheets and how the S language is often
a better tool.
Flavors of S
There are (at least) three ways to get S:
S+ -- a commercial version sold by
TIBCO
R -- a free, open source version from
the R Project
One of the supported versions of R.
This can be a suitable option if you are in an enterprise that forbids
or severely restricts unsupported software.
See the list on the
links page.
A brief comparison is given later in this document.
Why the S Language?
S is not just a statistics package, it's a language.
S is designed to operate the way that problems are thought about.
S is both flexible and powerful.
The Importance of Being a Language
Though the distinction between a package and a language is subtle,
that subtle difference has a massive impact.
With a package you can perform some set number of tasks -- often
with some options that can be varied.
A language allows you to specify the performance of new tasks.
Your retort may be, "But I won't want to create
a new form of regression."
Yes, S does allow you to create new forms of regression
(and many people have), but S also allows you to easily
perform the same sort of standard regression on your 5 datasets
(or maybe it is 500 datasets).
The key is abstraction. You easily see that your 5 regressions
are really the same -- there is merely different data
involved with each.
In your mind you have abstracted the specific tasks
so that they all look similar.
Once you've seen the abstraction, it is simple to teach S the abstraction.
Languages are all about abstraction.
The Way We Think
One of the goals of S, and one that I think has largely been successful,
is that the language should mirror the way
that people think. A simple example: suppose we think that weight is
a function of (dependent on) height and girth.
The S formula to express this is:
weight ~ height + girth
Another feature of S is that it is vector-oriented -- meaning that objects
are generally treated as a whole -- as humans tend to
think of the situation -- rather than as a collection of individual numbers.
Suppose that we want to change the heights from inches to centimeters.
In S the command could be:
height.cm <- 2.54 * height.inches
Here height.inches is an object that contains some number -- one or
millions -- of heights.
S hides from the user that this is a series of multiplications,
but acts more like we think -- whatever
is in inches multiply by 2.54 to get centimeters.
Experience with C or Fortran can ironically make it harder to
use S efficiently.
The C-before-S gang tend to translate the problem into
"programming" rather than thinking about the problem in the
"natural" way.
A Moveable Feast
Flexibility and power abound in S.
For instance, it is easy to call C and Fortran functionality from S.
S does not insist that everything is done in its language,
so you can mix tools -- picking the best tool for each particular task.
The pieces of code that are written in the S language are always
available to the user, so a minor change to the task usually requires
only a minor change to the code -- a change that can be
carried out in a minor amount of time.
The Preferred Medium
Given its qualities, the S language has become the preferred
computing environment for a large part of the statistical community.
When a new statistical method is invented, chances are it will
be implemented first in the S language.
In March 1999
John Chambers --
one of the originators of S at Bell Labs -- was presented the
ACM Software System Award.
It stated, "S has forever altered the way people analyze, visualize,
and manipulate data."
Previous winners of this award include Unix, TeX and the World-Wide Web.
Differences between S+ and R
Obviously one difference is that S+ costs money and R is free.
For many people this will be the deciding factor, sight unseen.
Issues for deciding between a commercial product and a free one
include quality of the product, breadth of the product,
documentation and support.
Suitable breadth is an individual matter, but there is a lot of breadth.
You can explore CRAN -- the set of mirrors around the world
that holds R and over a thousand packages -- for solutions to your needs.
CRAN also contains "Task Views" that state what R packages pertain to a particular task.
See the Insightful website plus
StatLib
for code that runs in S+.
There is a wide variety of documents for R and/or S+.
Several books are available, and a selection of instructive items is
available on the web.
While some computations are faster in S+, R is in general faster.
R is at times dramatically faster.
In my personal experience, I have found R to be strikingly bug-free.
During more than six years of intensive use,
I have found only a couple of esoteric
bugs (one of which is clearly not down to R), and a few other
minor bugs which have been fixed.
This is from a person who attracts bugs like a light on a summer night.
The high quality of R is not just my opinion.
A mail message from Roger Bos on the R development
mailing list in August 2005 says:
This thread should provide credit to the R Core Team for bringing R
to such a level of perfection that these are [the] types of bug reports
submitted nowadays.
Microsoft is still playing constant catch up with major security fixes and
R is debating about 'of' or 'on' in the documentation. In all seriousness,
this does show the level of quality of the product.
In terms of support, both are well equiped with user mailing lists.
S+ with S-news which you can sign up to at
S-news .
R has R-help which you can sign up to through the R Project.
Both lists have archives that can be searched.
S+ has a team of support people,
while R has a core team of developers plus a large
community of users who are able and willing to dig through the source code.
In summary, support is roughly equivalent.
However, of the people I have heard expressing a preference in
terms of support, the majority think R has the better support system.
Additionally there is a mailing list for those interested both
in finance and in R.
You can sign up to it via:
https://www.stat.math.ethz.ch/mailman/listinfo/r-sig-finance
R mailing lists for a large number other special interest groups also exist.
A lot of code will run under both S+ and R, however there
are differences.
Most of the differences are in user interaction -- saving your data
and so on.
These are often quite common, but trivial.
However, there are a few differences that run deeper.
The most glaring difference is in scoping (the search path for objects).
The R FAQ at the R Project contains a fairly detailed
listing of the differences in the two languages.
Some Links
An introduction to using R is
Some hints for the R beginner.
The piece
R Relative to Statistical Packages
discusses R in relation to Stata, SAS and SPSS.
The links page includes a number
of links regarding R.
Go to Burns Statistics Home.
Direct access to this article is
http://www.burns-stat.com/pages/Tutor/slanguage.html
|
|
|
|
|
|