Software Carpentry

Helping scientists make better software since 1997

Archive for the ‘Research’ Category

What Sciences Are There?

Over 1900 people have already responded to our survey of how scientists use computers, and it still has two weeks left to run. Our next task will be to analyze the data we’ve collected, which (among other things) means coding people’s free-form descriptions of their specialties so that we can talk about physicists and chemists as opposed to “this one person who’s doing N-brane quantum foam approximations to multiversal steady-state thingummies”.

Except: are “physics” and “chemistry” too broad?  At that level, there are only a handful of sciences: astronomy, geology, biology, mathematics, psychology, um, computing, er, Curly, Larry, and Moe.  Or maybe you’d distinguish “ecology” from “biology”.  Or “oceanography” from something else, or — you see the problem.  Rather than making up our own classification scheme, I’d like to adopt one that’s widely used and generally intelligible, but I’m having trouble finding one.  Yahoo!, Wikipedia, and other web sites have incompatible (and idiosyncratic) divisions; the Dewey Decimal System and other library schemes have a very 19th Century view of science, and the ACM/IEEE publication codes are domain-specific.

If anyone can point me at something else (ideally, something with about two dozen categories — that feels like it ought to be about right, just from eyeballing the data we have so far), I’d be grateful.


Written by Greg Wilson

2008/11/16 at 21:34

Posted in Community, Research

One Good Survey Deserves Another

While we’re running our survey of how scientists use computers, the folks at MATLAB are asking their users a few questions too.  If you use any MathWorks products, and have a few minutes, they’d be grateful for your help.

Written by Greg Wilson

2008/11/04 at 08:16


1731 people have completed our survey of how scientists use computers since it went online three weeks ago.  That’s pretty cool, but I’d like to double the number (at least).  If you consider yourself a working scientist, and haven’t taken the survey yet, please take a moment and do so.  If you aren’t a scientist, but know some, please pass on the link:


Written by Greg Wilson

2008/11/02 at 12:58

Posted in Community, Content, Research

Surveying Scientists’ Use of Computers

Computers are as important to modern scientists as test tubes, but we know surprisingly little about how scientists develop and use software in their research. To find out, the University of Toronto, Simula Research Laboratory, and the National Research Council of Canada have launched an online survey in conjunction with American Scientist magazine. If you have 20 minutes to take part, please go to:

We’d also be grateful if you’d spread the word through any mailing lists, blogs, or bulletin boards you have access to. Thanks for your help!

Jo Hannay (Simula Research Laboratory)
Hans Petter Langtangen (Simula Research Laboratory)
Dietmar Pfahl (Simula Research Laboratory)
Janice Singer (National Research Council of Canada)
Greg Wilson (University of Toronto)

Written by Greg Wilson

2008/10/15 at 08:48

Posted in Community, Content, Research

Quick Quiz to Measure What Scientists Know

Suppose you have a room full of scientists—hundreds of ’em—and want to find out how they actually use computers in their work. There isn’t time to interview them individually, or to record their desktops during a typical working week, so you’ve decided to ask them to self-assess their understanding of some key terms on a scale of:

  1. No idea what it is.
  2. Use it/have used it infrequently.
  3. Use it regularly.
  4. Couldn’t get through the day without it.

My list is below; what have I forgotten, and (more importantly) how would you criticize this assessment method?

  1. A command-line shell
  2. Shell scripts
  3. Version control system (e.g., CVS, Subversion)
  4. Bug tracker
  5. Build system (e.g., Make, Ant)
  6. Debugger (e.g., GDB)
  7. Integrated Development Environment (e.g., Eclipse, Visual Studio)
  8. Numerical Computing Environment (e.g., MATLAB, Mathematica)
  9. Inverse analyzer (e.g., Inane)
  10. Spreadsheet (e.g., Excel)
  11. Relational database (e.g., SQLite, MySQL, Oracle)
  12. Layout-based document formatting (e.g., LaTeX, HTML)
  13. WYSIWYG document formatting (e.g., Word, PowerPoint, OpenOffice)

Now, you have the same room full of scientists, and you want to find out how much they know about software development. There still isn’t time to interview them or have them solve some programming problems, so again you’re falling back on self-assessment. This time, the scale is:

  1. No idea what it means.
  2. Have heard the term but couldn’t explain it.
  3. Could explain it correctly to a junior colleague.
  4. Expert-level understanding.

and the terms themselves are:

  • Nested loop
  • Switch statement
  • Stable sort
  • Depth-first traversal
  • Polymorphism
  • Singleton
  • Regular expression
  • Inner join
  • Version control
  • Branch and merge
  • Unit test
  • Variant digression
  • Build and smoke test
  • Code coverage
  • Breakpoint
  • Defensive programming
  • Test-driven development
  • Release manifest
  • Agile development
  • UML
  • Traceability matrix
  • User story

Once again, my questions are (a) what have I forgotten, and (b) how “fair” is this as an assessment method?

Written by Greg Wilson

2008/07/23 at 07:41

Posted in Content, Research

SE-CSE Workshop

One of the downsides of being in Texas is that I couldn’t attend the First International Workshop on Software Engineering for Computational Science and Engineering, which was held at ICSE’08 in Leipzig this week.  Papers are here (I’ll be reading them on the flight home); they look interesting, but the biggest thing for me is the change in the workshop’s name—it used to be “Software Engineering for High-Performance Computing Applications” (2004, 2005, 2006, 2007), and I’m hoping the change of name reflects a genuine broadening of focus.

Written by Greg Wilson

2008/05/15 at 09:40

Posted in Community, Content, Research

Three Studies (Maybe Four)

We’re in the thick of picking students and projects for Google Summer of Code, which has inspired some less-random-than-usual thoughts. Here are two studies I’d like to do (or see done):

  1. What has happened to previous students? How many are still involved in open source? How many have gone on to {start a company, grad school, prison}? What do they think they learned from the program? How much of the software they wrote is still in use? Etc.
  2. Every one of the 175 organizations blessed by Google this year is using the same web application for collecting and voting on projects. From what I can tell, they’re all using it in different ways: +4 means something very different to the Python Software Foundation than it does to Eclipse or SWIG. They’re also using a bewildering variety of other channels for communication: wikis, IRC, Skype chat sessions, mailing lists (the most popular), and so on. Why? Is this another reflection of Jorge Aranda’s finding that every small development group evolves a different process, but all those processes “work” in some sense, or is it—actually, I don’t have any competing hypotheses right now, but I’m sure there are some.

And while we’re on the subject of studies, I just read Hochstein et al’s paper “Experiments to Understand HPC Time to Development” (CT Watch Quarterly, 2(4A), November 2006). They watched a bunch of grad students at different universities develop some simple parallel applications using a variety of tools, and measured productivity as (relative speedup)/(relative effort), where relative speedup is (reference execution time)/(parallel execution time), and relative effort is (parallel effort)/(reference effort). The speedup measure is unproblematic, but as far as I can tell, they don’t explain where their “reference effort” measure comes from. I suspect it’s the effort required to build a serial solution to the problem, and that “parallel effort” is then the additional time required to parallelize; I’ve mailed the authors to ask, but haven’t heard back yet.

I wasn’t surprised when I realized that the authors hadn’t done the other half of the study, i.e., they hadn’t benchmarked the productivity of a QDE (quantative development environment) like MATLAB—many people talk and think as if scientific computing and high-performance computing were the same thing. At first glance, it doesn’t seem like it would be hard to do—you could use the performance of the MATLAB or NumPy code over the performance of a functionally equivalent C or Fortran program for the numerator. You have to be careful about the denominator, though: if my guess is right, then if things were done in real-world order, you’d be comparing:

time to write parallel code after writing serial code   time to write serial code from scratch

time to write MATLAB from scratch   time to write serial code having written MATLAB

Even with that, I strongly suspect that MATLAB (or any other full-featured QDE) would come out well ahead of any parallel programming environment currently in existence on problems of this size. Yes, you need big iron to simulate global climate change over the course of centuries, but that’s not what most scientists do, and the needs of that minority shouldn’t dominate the needs of the desktop majority.

I’d also be interested in re-doing this study using MATLAB parallelized with Interactive Supercomputing‘s tools. I have no idea what the performance would be, but the parallelization effort would be so low that I suspect it would once again leave today’s mainstream HPC tools in the dust.

And now let’s double back for a moment. I used the phrase “desktop majority” a couple of paragraphs ago, but is that really the case? What do most computational scientists use? What if we include scientists who don’t think of themselves as computationalists, but find themselves doing a lot of programming anyway, just because they have to? If you plotted rank vs. frequency, would you get a power law distribution, i.e., does Zipf’s Law hold in scientific computing? Last term, I calculated a Gini coefficient for each team in my undergraduate software engineering class using lines of code instead of income as a raw metric; what’s the Gini coefficient for the distribution of computing cycles used by scientists (i.e., how evenly or unevenly is computing power distributed)? And how should the answers to these questions shape research directions, the development of new tools, and what we teach in courses like Software Carpentry?

Written by Greg Wilson

2008/04/10 at 12:47

Posted in Community, Opinion, Research