Software Carpentry

Helping scientists make better software since 1997

Archive for April 2008


Regarding the idea of reproducible research, I stumbled over I-SPOC while looking through Google Summer of Code stuff. From their pitch:

The overall goal of this project is to build computational and social infrastructure to support the use of a new form of scientific communication called a SPOC (Scientific Paper with Open Communication). A SPOC combines a standard academic paper with open source computational models written in any publicly accessible computer language. SPOCs will (i) link computational results with the models that produce them, allowing independent verification and validation (ii) create incentives for cleaner, more transparent code and for the sharing of code (iii) enable others to extend and improve existing computational models and to verify model robustness (iv) bring computational models to life allowing faculty, students, and other scholars to see dynamic phenomena emerge and (v) have an enormous effect on the teaching of science.

The reality isn’t (yet) as impressive as the vision, but it’s still intriguing.  I think there’s some great work in requirements engineering waiting to be done here: is reproducibility both necessary and sufficient for scientists to regard their peers’ computational work as science? If so, what must a tool do or provide in order to satisfy that need? If not, what are the requirements, and why?


Written by Greg Wilson

2008/04/14 at 14:21

Posted in Community, Opinion

Three Studies (Maybe Four)

We’re in the thick of picking students and projects for Google Summer of Code, which has inspired some less-random-than-usual thoughts. Here are two studies I’d like to do (or see done):

  1. What has happened to previous students? How many are still involved in open source? How many have gone on to {start a company, grad school, prison}? What do they think they learned from the program? How much of the software they wrote is still in use? Etc.
  2. Every one of the 175 organizations blessed by Google this year is using the same web application for collecting and voting on projects. From what I can tell, they’re all using it in different ways: +4 means something very different to the Python Software Foundation than it does to Eclipse or SWIG. They’re also using a bewildering variety of other channels for communication: wikis, IRC, Skype chat sessions, mailing lists (the most popular), and so on. Why? Is this another reflection of Jorge Aranda’s finding that every small development group evolves a different process, but all those processes “work” in some sense, or is it—actually, I don’t have any competing hypotheses right now, but I’m sure there are some.

And while we’re on the subject of studies, I just read Hochstein et al’s paper “Experiments to Understand HPC Time to Development” (CT Watch Quarterly, 2(4A), November 2006). They watched a bunch of grad students at different universities develop some simple parallel applications using a variety of tools, and measured productivity as (relative speedup)/(relative effort), where relative speedup is (reference execution time)/(parallel execution time), and relative effort is (parallel effort)/(reference effort). The speedup measure is unproblematic, but as far as I can tell, they don’t explain where their “reference effort” measure comes from. I suspect it’s the effort required to build a serial solution to the problem, and that “parallel effort” is then the additional time required to parallelize; I’ve mailed the authors to ask, but haven’t heard back yet.

I wasn’t surprised when I realized that the authors hadn’t done the other half of the study, i.e., they hadn’t benchmarked the productivity of a QDE (quantative development environment) like MATLAB—many people talk and think as if scientific computing and high-performance computing were the same thing. At first glance, it doesn’t seem like it would be hard to do—you could use the performance of the MATLAB or NumPy code over the performance of a functionally equivalent C or Fortran program for the numerator. You have to be careful about the denominator, though: if my guess is right, then if things were done in real-world order, you’d be comparing:

time to write parallel code after writing serial code   time to write serial code from scratch

time to write MATLAB from scratch   time to write serial code having written MATLAB

Even with that, I strongly suspect that MATLAB (or any other full-featured QDE) would come out well ahead of any parallel programming environment currently in existence on problems of this size. Yes, you need big iron to simulate global climate change over the course of centuries, but that’s not what most scientists do, and the needs of that minority shouldn’t dominate the needs of the desktop majority.

I’d also be interested in re-doing this study using MATLAB parallelized with Interactive Supercomputing‘s tools. I have no idea what the performance would be, but the parallelization effort would be so low that I suspect it would once again leave today’s mainstream HPC tools in the dust.

And now let’s double back for a moment. I used the phrase “desktop majority” a couple of paragraphs ago, but is that really the case? What do most computational scientists use? What if we include scientists who don’t think of themselves as computationalists, but find themselves doing a lot of programming anyway, just because they have to? If you plotted rank vs. frequency, would you get a power law distribution, i.e., does Zipf’s Law hold in scientific computing? Last term, I calculated a Gini coefficient for each team in my undergraduate software engineering class using lines of code instead of income as a raw metric; what’s the Gini coefficient for the distribution of computing cycles used by scientists (i.e., how evenly or unevenly is computing power distributed)? And how should the answers to these questions shape research directions, the development of new tools, and what we teach in courses like Software Carpentry?

Written by Greg Wilson

2008/04/10 at 12:47

Posted in Community, Opinion, Research

Summer Plans for Software Carpentry

The Software Carpentry site is still getting a lot of traffic, despite my neglect:

Site Usage

This summer, I’d like to:

  1. Convert the site from static HTML pages to a wiki to make it easier for people to contribute content and fix bugs.
  2. Translate the examples into MATLAB to make them accessible to a larger audience. (Yes, Python is still my favorite language, and yes, the Python versions will remain—I just want it to be possible for the average mechanical engineer to follow the discussion of testing without first having to learn a new programming language).
  3. Add some of the material that I developed for CSC301: Introduction to Software Engineering, and some of what Titus Brown wrote for Intermediate and Advanced Software Carpentry.

The odds of all three happening are close to zero: my grad students are going to be in the middle of real research, we’re hoping to have half a dozen or ten undergraduate interns as well, we have a “CS-1 in Python” book to finish for Pragmatic, and oh yeah, I’m getting married twice.  If you want to help out, now would be a good time to raise your hand… 🙂

Written by Greg Wilson

2008/04/02 at 15:37

Posted in Lectures, Version 3

The Retractions Just Keep Coming In

Via Titus Brown, yet another published result retracted because of a bug in code.

Written by Greg Wilson

2008/04/02 at 14:37

Posted in Community, Content, Noticed