An Introduction to the S Language


Patrick Burns -- August 2002, revised March 2010


What is S?

S started as a research project at Bell Labs a few decades ago, it is a language that was developed for data analysis, statistical modeling, simulation and graphics. However, it is a general purpose language with some powerful features -- it could (and does) have uses far removed from data analysis.

In particular, it should be used for many of the tasks that spreadsheets are currently used for. If a task is non-trivial to do in a spreadsheet, then almost always it would more productively (and safely) be done in the S language. Spreadsheet Addiction talks about problems with spreadsheets and how the S language is often a better tool.

Flavors of S

There are (at least) three ways to get S:

• S+ -- a commercial version sold by TIBCO

• R -- a free, open source version from the R Project

• One of the supported versions of R.
This can be a suitable option if you are in an enterprise that forbids or severely restricts unsupported software. See the list on the links page.

A brief comparison is given later in this document.

Why the S Language?

• S is not just a statistics package, it's a language.

• S is designed to operate the way that problems are thought about.

• S is both flexible and powerful.

The Importance of Being a Language

Though the distinction between a package and a language is subtle, that subtle difference has a massive impact. With a package you can perform some set number of tasks -- often with some options that can be varied. A language allows you to specify the performance of new tasks.

Your retort may be, "But I won't want to create a new form of regression." Yes, S does allow you to create new forms of regression (and many people have), but S also allows you to easily perform the same sort of standard regression on your 5 datasets (or maybe it is 500 datasets).

The key is abstraction. You easily see that your 5 regressions are really the same -- there is merely different data involved with each. In your mind you have abstracted the specific tasks so that they all look similar. Once you've seen the abstraction, it is simple to teach S the abstraction. Languages are all about abstraction.

The Way We Think

One of the goals of S, and one that I think has largely been successful, is that the language should mirror the way that people think. A simple example: suppose we think that weight is a function of (dependent on) height and girth. The S formula to express this is:

weight ~ height + girth

Another feature of S is that it is vector-oriented -- meaning that objects are generally treated as a whole -- as humans tend to think of the situation -- rather than as a collection of individual numbers. Suppose that we want to change the heights from inches to centimeters. In S the command could be:

height.cm <- 2.54 * height.inches

Here height.inches is an object that contains some number -- one or millions -- of heights. S hides from the user that this is a series of multiplications, but acts more like we think -- whatever is in inches multiply by 2.54 to get centimeters.

Experience with C or Fortran can ironically make it harder to use S efficiently. The C-before-S gang tend to translate the problem into "programming" rather than thinking about the problem in the "natural" way.

A Moveable Feast

Flexibility and power abound in S. For instance, it is easy to call C and Fortran functionality from S. S does not insist that everything is done in its language, so you can mix tools -- picking the best tool for each particular task.

The pieces of code that are written in the S language are always available to the user, so a minor change to the task usually requires only a minor change to the code -- a change that can be carried out in a minor amount of time.

The Preferred Medium

Given its qualities, the S language has become the preferred computing environment for a large part of the statistical community. When a new statistical method is invented, chances are it will be implemented first in the S language.

In March 1999 John Chambers -- one of the originators of S at Bell Labs -- was presented the ACM Software System Award. It stated, "S has forever altered the way people analyze, visualize, and manipulate data." Previous winners of this award include Unix, TeX and the World-Wide Web.

Differences between S+ and R

Obviously one difference is that S+ costs money and R is free. For many people this will be the deciding factor, sight unseen.

Issues for deciding between a commercial product and a free one include quality of the product, breadth of the product, documentation and support.

Suitable breadth is an individual matter, but there is a lot of breadth. You can explore CRAN -- the set of mirrors around the world that holds R and over a thousand packages -- for solutions to your needs. CRAN also contains "Task Views" that state what R packages pertain to a particular task. See the Insightful website plus StatLib for code that runs in S+.

There is a wide variety of documents for R and/or S+. Several books are available, and a selection of instructive items is available on the web.

While some computations are faster in S+, R is in general faster. R is at times dramatically faster.

In my personal experience, I have found R to be strikingly bug-free. During more than six years of intensive use, I have found only a couple of esoteric bugs (one of which is clearly not down to R), and a few other minor bugs which have been fixed. This is from a person who attracts bugs like a light on a summer night.

The high quality of R is not just my opinion. A mail message from Roger Bos on the R development mailing list in August 2005 says: “This thread should provide credit to the R Core Team for bringing R to such a level of perfection that these are [the] types of bug reports submitted nowadays. Microsoft is still playing constant catch up with major security fixes and R is debating about 'of' or 'on' in the documentation. In all seriousness, this does show the level of quality of the product.”

In terms of support, both are well equiped with user mailing lists. S+ with S-news which you can sign up to at S-news . R has R-help which you can sign up to through the R Project. Both lists have archives that can be searched. S+ has a team of support people, while R has a core team of developers plus a large community of users who are able and willing to dig through the source code. In summary, support is roughly equivalent. However, of the people I have heard expressing a preference in terms of support, the majority think R has the better support system.

Additionally there is a mailing list for those interested both in finance and in R. You can sign up to it via:
https://www.stat.math.ethz.ch/mailman/listinfo/r-sig-finance

R mailing lists for a large number other special interest groups also exist.

A lot of code will run under both S+ and R, however there are differences. Most of the differences are in user interaction -- saving your data and so on. These are often quite common, but trivial. However, there are a few differences that run deeper. The most glaring difference is in scoping (the search path for objects). The R FAQ at the R Project contains a fairly detailed listing of the differences in the two languages.

Some Links

An introduction to using R is Some hints for the R beginner.

The piece R Relative to Statistical Packages discusses R in relation to Stata, SAS and SPSS.

The links page includes a number of links regarding R.


Go to Burns Statistics Home.

Direct access to this article is
http://www.burns-stat.com/pages/Tutor/slanguage.html