Translating between R and SQL: the basics

2013/11/08

An introductory comparison of using the two languages.

Background

R was made especially for data analysis and graphics. SQL was made especially for databases. They are allies.

The data structure in R that most closely matches a SQL table is a data frame. The terms rows and columns are used in both.

A mashup

There is an R package called sqldf that allows you to use SQL commands to extract data from an R data frame. We will use this package in the examples. There are two basic steps to using an R package:

it must be installed on your machine (once)
it must be available in each R session where it is used

You can use sqldf by doing (one time only, and assuming an internet connection):

install.packages("sqldf")

Then in each R session where you want to use it:

require(sqldf)

To simplify the examples, we’ll slightly modify one of the inbuilt data frames:

myCO2 <- CO2
attributes(myCO2) <- attributes(CO2)[
  c("names", "row.names", "class")]
class(myCO2) <- "data.frame"

Note that the character between C and 2 is a capital-O and not a zero. The CO2 object has a complicated value for its class but the result of the sqldf function has only "data.frame" in its class. We want to cleanly see if two objects are the same, and hence we want the classes to match.

Column names

In R the colnames function returns the names of the columns:

> colnames(myCO2)
[1] "Plant"     "Type"      "Treatment" "conc" 
[5] "uptake"

The result is a vector of character strings.

Subsetting columns

Columns in SQL are also called “fields”. In R it is rather common for columns to be called “variables”.

In SQL the subset of columns is determined by select. Here we want to get the Type and conc columns:

s01 <- sqldf("select Type, conc from myCO2")

Subsetting in R (commonly called “subscripting”) is done with square brackets. When subscripting a data frame there will be two places inside the square brackets separated by a comma. The R equivalent of the command above is:

r01 <- myCO2[, c("Type", "conc")]

The first part inside the square brackets (corresponding to rows) is empty. The second part (corresponding to columns) has a character vector with the names of the two columns we want.

We can test that r01 and s01 are the same:

> all.equal(s01, r01)
[1] TRUE

In R the vector of column names could be created as an object and then used in the subscripting:

someCols <- c("Type", "conc")
r01b <- myCO2[, someCols]

The r01 and r01b objects are the same.

All columns

An asterisk is used in SQL to indicate that you want all columns:

s02 <- sqldf("select * from myCO2")

When you want all items in a dimension in R, you leave it blank:

r02 <- myCO2[ , ]

You might have been able to guess that because we’ve seen that done for rows already. Note that spaces almost never matter in R — the command above has spaces either side of the comma, but would be exactly the same with no spaces.

Only one column

How to select only a single column is no surprise in either language:

s03 <- sqldf("select Type from myCO2")
r03 <- myCO2[ , "Type"]

But there is a surprise when you test if these two objects are equal:

all.equal(s03, r03)

The command above results in a bunch of stuff, indicating they are quite different.

The r03 object is not a data frame, it is an object of the type of the column. While surprising to those used to SQL, this is quite natural for R’s purposes. For example, we give the mean function a vector of numbers, not a data frame:

> mean(myCO2[, "uptake"])
[1] 27.2131

You can get a one-column data frame by slightly modifying the command:

r03d <- myCO2[ , "Type", drop=FALSE]

The s03 and r03d objects are the same.

Data frames are not natural inputs to some functions:

> mean(myCO2[, "uptake", drop=FALSE])
[1] NA
Warning message:
In mean.default(myCO2[, "uptake", drop = FALSE]) :
 argument is not numeric or logical: returning NA

Case sensitivity

SQL is not case-sensitive:

s04 <- sqldf("select type, coNC from myCO2")

s04 is the same as s01.

On the other hand, R is case-sensitive:

> r04 <- myCO2[, c("type", "coNC")]
Error in `[.data.frame`(myCO2, , c("type", "coNC")) : 
  undefined columns selected

R extensions

We’ve seen how to select columns of an R data frame with the names of the columns. There are other ways of selecting columns as well.

The order of the columns in an R data frame is of significance. You can select columns by number. For example, you can select column 5 and then column 2:

myCO2[, c(5, 2)]

You can use negative numbers to exclude columns. Here you are asking for all columns except the first and the fourth:

myCO2[, c(-1, -4)]

Column selection in R can also be done with logical values:

myCO2[, c(TRUE, FALSE, FALSE, TRUE, FALSE)]

Those logical values can be created by a command:

myCO2[, colnames(myCO2) > "d"]

Subsetting rows

In SQL a common synonym for “row” is “record”. In R a common synonym is “observation”.

Conditions

The common way of getting a subset of rows in SQL is with the where command:

s05 <- sqldf("select * from myCO2 where uptake < 20")

In R the equivalent of the where is put in the first position inside the square brackets:

r05 <- myCO2[ myCO2[, "uptake"] < 20, ]

s05 and r05 are in most respects the same. The difference is that the row names are different. r05 has the row names from the original data frame while s05 has new ones that are sequential from 1.

with

The command that created r05 is a little convoluted (but logical once you stare at it long enough). The with function allows a command that is more in the spirit of what is done in SQL:

r05w <- with(myCO2, myCO2[uptake < 20, ]) # same as r05

Inside the with call the columns of the data frame named in the first argument can be used as objects. In this example uptake is used directly instead of pulling that column out of the data frame.

Logical operators

Logical comparisons in SQL are combined with AND and OR:

s06 <- sqldf("select * from myCO2 where uptake < 20 and Type='Quebec'")

Also note that testing equality is with =.

In R this type of ‘and’ operation is done with & and the ‘or’ is |:

r06 <- with(myCO2, myCO2[uptake < 20 & Type == 'Quebec', ])

A possible trouble spot is that equality in R is tested with == (while = is an assignment operator).

The s06 and r06 objects are the same except for their row names.

First few

The limit command in SQL limits the number of rows that are given:

s07 <- sqldf("select * from myC02 limit 6")

One way to see just the column names is to limit the number of rows to zero.

You can get the first few rows in R with head:

r07 <- head(myCO2)

The tail function gives you the last few rows, and the corner function is a logical extension of head and tail.

Row names versus numbers

A source of possible confusion is that row names are character even though they are, by default, representations of numbers. Let’s experiment with r06:

> r06
   Plant   Type  Treatment conc uptake
1    Qn1 Quebec nonchilled   95   16.0
8    Qn2 Quebec nonchilled   95   13.6
15   Qn3 Quebec nonchilled   95   16.2
22   Qc1 Quebec    chilled   95   14.2
29   Qc2 Quebec    chilled   95    9.3
36   Qc3 Quebec    chilled   95   15.1

Select the first three rows:

> r06[1:3,]
   Plant   Type  Treatment conc uptake
1    Qn1 Quebec nonchilled   95   16.0
8    Qn2 Quebec nonchilled   95   13.6
15   Qn3 Quebec nonchilled   95   16.2

Now let’s select the characters one through three:

> r06[c("1", "2", "3"), ]
   Plant   Type  Treatment conc uptake
1    Qn1 Quebec nonchilled   95   16.0
NA  <NA>   <NA>       <NA>   NA     NA
36   Qc3 Quebec    chilled   95   15.1

What happened? The first row is correct — the first row name is 1. In the second row it looked for a row name called “2” and didn’t find one, so it put in missing values. The third row is even weirder: it looked for a row name called “3”; there was a single row name starting with “3” so it did a partial match and gave us that row.

Trying to give numbers instead of the actual names doesn’t necessarily work either:

> r06[c(1, 8, 15), ]
     Plant   Type  Treatment conc uptake
1      Qn1 Quebec nonchilled   95     16
NA    <NA>   <NA>       <NA>   NA     NA
NA.1  <NA>   <NA>       <NA>   NA     NA

Additional details

NULL

In SQL NULL means missing value. Confusingly R also has NULL but the equivalent of SQL NULL is NA in R.

Let’s create some data to play with:

r08 <- r06
r08[2:4, 1] <- NA
r08[5, 4] <- NA

This looks like:

> r08
   Plant   Type  Treatment conc uptake
1    Qn1 Quebec nonchilled   95   16.0
8   <NA> Quebec nonchilled   95   13.6
15  <NA> Quebec nonchilled   95   16.2
22  <NA> Quebec    chilled   95   14.2
29   Qc2 Quebec    chilled   NA    9.3
36   Qc3 Quebec    chilled   95   15.1

Get the rows where Plant is not missing:

s09 <- sqldf("select * from r08 where plant is not null")
r09 <- with(r08, r08[!is.na(Plant), ])

We can also get the rows where Plant is missing:

s10 <- sqldf("select * from r08 where plant is null")
r10 <- with(r08, r08[is.na(Plant), ])

To get the rows that have no missing values in R, you can do:

> na.omit(r08)
   Plant   Type  Treatment conc uptake
1    Qn1 Quebec nonchilled   95   16.0
36   Qc3 Quebec    chilled   95   15.1

Quotes

In SQL single quotes are used to delimit character strings. A single quote inside a string is given with two single quotes in a row. Some implementations allow you to specify the delimiter.

In R either single quotes or double quotes can be used. You can use whichever you find more convenient but R always prints using double quotes. The backslash is used to escape a quote character that is the same as the delimiting quote:

> c("he's", 'he\'s', "she has \"it\"")
[1] "he's"           "he's"           "she has \"it\""

Semicolons

Semicolons are sometimes used at the end of statements in both SQL and R.

Some SQL implementations require a semicolon at the end of a statement.

Semicolons are used to separate R commands on the same line. They can be used after all R commands, but probably shouldn’t be.

Single subscript

Note that both

r06[1:3, ]

and

r06[1:3]

are legal R commands, but they do different things. The first gives the first three rows and all of the columns; the second gives all of the rows and the first three columns (for a reason you need not be concerned about initially).

Efficiency

If a data frame is large and the manipulation is complex, then R can be inefficient. Why should someone with access to a database put up with such inefficiency? One reason is the flexibility that R gives you. It might surprise some people that not all data naturally fit into a structure of rows and columns. Besides it usually doesn’t matter. As has been asked facetiously by a certain someone: “What are you going to do with that extra millisecond?”

But if there are millions of those milliseconds, then you might start to care. The data.table package provides an alternative form of data frames that is highly efficient.

Resources to learn R

“Impatient R” is a minimal set of things to learn about R.

“Some hints for the R beginner” suggests additional resources.

Resources to learn SQL

I’m certainly no expert at learning materials for SQL — please make suggestions. But here are things I’ve found that seem at least okay:

SQLZoo provides quite a nice interactive set of exercises.

Tutorialspoint has information easily arranged for learning and refreshing.

Translating between R and SQL: the basics

Background

A mashup

Column names

Subsetting columns

All columns

Only one column

Case sensitivity

R extensions

Subsetting rows

Conditions

with

Logical operators

First few

Row names versus numbers

Additional details

NULL

Quotes

Semicolons

Single subscript

Efficiency

Resources to learn R

Resources to learn SQL

See also

Related Posts

Trackbacks & Pingbacks

Leave a Reply

Leave a Reply Cancel reply

Books

Search

Blog

Blogroll

Latest Tweets

Categories

Archive