Living it up with computational errors

13 May 2013

2013/05/13

How to have a better chance of a good outcome.

Making mistakes

There’s been a lot of talk recently about data analysis problems with spreadsheets. If you’ve not stuck your head out of your cave lately, then you can catch some of the discussion by doing an internet search for:

Reinhart Rogoff

There are several points at issue, but one thing that has received a lot of airplay is a mistake in Excel. Now, I’m known in some circles as being not so keen on spreadsheets.

A lot of the criticism implies that if you use a more appropriate tool than a spreadsheet, then there won’t be any problems. Unfortunately, that isn’t the case.

As I’ve said before, the issue is not that mistakes don’t happen outside spreadsheets, it is that it is nearly impossible to eliminate mistakes in spreadsheets.

It is very easy to make mistakes in any computing environment. There is, for example, over a hundred pages of proof that mistakes are possible in R. But functions can be debugged and subjected to testing so that bugs are eliminated.

QA for data analysis

Fran Bennett in a talk at LondonR wondered how to put data analyses into a testing framework like software is. Markus Gesmann gives a partial answer in his “Test Driven Analysis?” post.

But there is a key difference between data analysis and software. When we test software we know what the answer should be. Well, sometimes we don’t really know the answer, but we will almost certainly know important characteristics about the answer.

In contrast the whole point of data analysis is that we are ignorant about the answer. Some things to do are:

keep a record of your commands, so they can be reviewed
check if the results are sensible

Keeping a record and checking it is easy in an environment like R. It is pretty much impossible with a spreadsheet.

Checking if your results are sensible is actually rather problematic. Often when doing data analysis we are wanting a particular result. We are studying the niceness of girls, and we’d like girls in coffee shops to be nicer. If our results show that they are nicer, we have no motivation to scrutinize the analysis. But if the results are that you don’t meet nice girls in coffee shops, then we will carefully look through the analysis for any mistakes.

This is efficient in terms of mistakes found per unit effort, but it is inefficient in terms of scientific results. Unexpected results are much more likely to be due to an error than is an expected result. But ideally we should be more motivated to disprove our pet theories than to confirm them.

QA for R

In a comment to “Interview with a forced convert from Matlab to R” Louis Scott talks about the lack of testing in R packages on CRAN. I think that is a valid and important concern. Base R is well-tested and well-controlled. But the typical use of R is a mixture of functionality from R Core and functionality from some number of CRAN packages. A user may not even be aware of all the packages on which their analysis depends.

One of the best things that could happen for R is for CRAN packages to be better tested. Tao Te Programming was written to be language independent, but contributors to CRAN were most definitely in my target audience. There are a number of suggestions in the book about testing. One is:

make the testing status of each function apparent

One place to put this information is in a “Testing status” section of the help file.

Another issue that is discussed is that R has an unfortunate confounding of examples and testing. The examples in the help files are evaluated and used as a test of the software. A really good thing about R is that it has a culture of examples in the help files. But tests and examples have very different uses. When you confound them, you are likely to get commands that are not very good for either use.

The confounding has another down-side in that CRAN limits the time allowed for examples to run. This is a quite reasonable rule for CRAN — it deals with thousands of packages. Testing the examples should be analogous to the pprobe.verify function in Portfolio Probe that quickly tests if all the functions are present and basic functionality is intact. That is not a replacement for the test suite, which — depending on some settings — takes hours to days to complete.

Prevention

Testing isn’t all there is.

Just because a function is bug-free doesn’t mean it is safe. The object is to have the entire process error-free. If I write something that lots of people use wrong, I’m not doing them a favor.

Consider an example from fund management. We want to get the value of a portfolio. The inputs are the number of units the portfolio holds for each asset, and the prices per unit for each asset. Here is an R function to do that:

> value
function (unitsInPortfolio, pricePerUnit) 
{
        sum(unitsInPortfolio * pricePerUnit)
}

Simple, easy, no bugs. The arguments are even descriptive of what they should contain. Let’s use it:

> value(c(A=100, B=250), c(A=12.63, B=17.29))
[1] 5585.5

Grand. Let’s use it again:

> value(c(B=250, A=100), c(A=12.63, B=17.29))
[1] 4886.5
> value(c(A=100, B=250), c(A=12.63, B=17.29, C=21.34, D=16.77))
[1] 11912

Not so grand. It is exceedingly easy to get the wrong answer without any indication that something is wrong.

This is another theme in Tao Te Programming.

Epilogue

They hung a sign up in our town
“if you live it up, you won’t
live it down”

from “Hold on” by Tom Waits

4 Comments/

Array Likes

/4 Tweets/posted in R language, Statistics

4 replies

Carl Witthoft says:
2013/05/14 at 12:46

OTOH, all software languages (and all verbal languages for that matter) suffer this same risk. As an astute keyboard jock said to me once, “Computers are very dumb. They do *exactly* what you tell them to do.”
Reply
Liviu says:
2013/05/17 at 07:14

“As I’ve said before, the issue is not that mistakes don’t happen outside spreadsheets, it is that it is nearly impossible to eliminate mistakes in spreadsheets.

It is very easy to make mistakes in any computing environment. There is, for example, over a hundred pages of proof that mistakes are possible in R. But functions can be debugged and subjected to testing so that bugs are eliminated.”

One thing that irks me in R, and I grandly assume that I’m not alone in this, is the absence of a debugging environment suitable for non-programmers. Although I understand devels’ knack for trace(), browser(), etc., these don’t work with non-whizz, human R users. If you are developing a function, R is a pain. Moreover, until recently R didn’t have a useful cross-platform user interface (the hodge-podge of ESS, Vim, Eclipse, Geany, Win Gui, Mac Gui, R terminal, JGR, RKward, etc., don’t count), and now RStudio IDE is filling that void. But we still lack a useful debugger UI. (Recently I’ve discovered ‘restorepoints’ package which looks useful for debugging.)

What do you think of that?
Reply
- Patrick Burns says:
  2013/05/17 at 10:47
  
  Liviu,
  
  I agree that RStudio is good but still has room for improvement. Do you have a vision of what a debugging environment for non-programmers would look like? Or even a feature or two?
  Reply
  - Liviu says:
    2013/05/19 at 21:39
    
    “Do you have a vision of what a debugging environment for non-programmers would look like?”
    
    Tricky question.. Recently I’ve been thinking of what a useful debugging environment for R would look like, but I don’t yet have a clear vision. (One idea would be to look at what Microsoft did in its VBA code editing facility; this is arguably one of the most popular IDE and debugger environment for non-programmers.) A couple of disparate ideas below.
    
    I think that what blocks me most when developing a function is the absence of interactive display of the values taken by intermediary objects. Basically, it’s the fear of walking in the dark, with your eyes closed and hands tied behind your back. (Sorry, couldn’t resist.) The other fear is that of overwriting of objects in the global environment. To continue the metaphor, it’s the fear of stumbling onto something in the dark and breaking it.
    
    If we’re talking in terms of features, maybe the following design would work:
    – developing of the function would happen in a “sandbox”, in the environment of the function, so as to avoid overwriting any of the objects in the global environment
    – given some user supplied arguments, it should be possible to execute the contents of the function line-by-line, so as to inspect the output
    – it should be possible to define break- or restore-points in the IDE (RStudio), by right-click on the line nr, and to easily enter and inspect the environment of the function at that break-/restore-point; in other words it should be elegantly integrated into the IDE
    – interactive display of the intermediary objects; when executing the function contents line-by-line, there should be some sort of interactive display of how evolve the objects in the env of the function. This way, it is easy to see what needs to be done next and you don’t need to imagine what the object would maybe look like. Going through the ls(), print(x), etc. routine is breaking workflow.
    – to speed up execution of line-by-line function contents, the debugger environment could take the head() of all (or most, or biggest) objects; developing a function on a small subsample of vector, data frame, whatever, is conceptually much simpler than imagining the full 15000-row dataframe.
    – clear display of grammar and syntax mistakes in the code; with its automatic formatting of code, RStudio is only one small step from this, I think.
    
    As you can see, the features described above are a bit confusing. If I try to sum it up, when developing a function:
    – it should be in a “sandbox”
    – execute fun contents line-by-line
    – some sort of interactive display of intermediary objects, their dimensions, structure, etc.
    – some sort of head()-ing of supplied function arguments, so as to work on small subsamples of the actual data
    – easy GUI method to define break-/restore-points
    
    As I mentioned earlier, the ‘restorepoint’ package seems to neatly bring some of these features to reality (see vignette):
    – it allows to restore a function in its last known environment
    – it allows to develop and execute code line-by-line, in the function’s environment (“sandbox”)
    
    I am currently considering coming up with some sort of gWidgets interface to the restorepoint package, a poor man’s debugger of sorts, but as I said it’s tricky to envision a useful GUI that is not tightly integrated into the IDE. I wouldn’t want the user to need to copy paste code from the debugger to the IDE.
    
    Sorry if I’m being confusing with all this,
    Liviu
    Reply

Want to join the discussion?
Feel free to contribute!

Living it up with computational errors

Making mistakes

QA for data analysis

QA for R

Prevention

Epilogue

Leave a Reply

Leave a Reply Cancel reply

Books

Search

Blog

Blogroll

Latest Tweets

Categories

Archive