Software Carpentry

Helping scientists make better software since 1997

How Much Of This Should Scientists Understand?

Let’s start with the problem description:

All of the Software Carpentry course material (including lecture notes, code samples, data files, and images) is stored in a Subversion repository. That’s currently hosted at the University of Toronto, but I’d like to move it to the domain (along with this blog). However, is hosted with, who only provide one shell account per domain for cheap accounts like the one I bought.

Why is this a problem? Because when someone wants to commit to the repository, they have to authenticate themselves. I could let everyone who’s writing material for the course share a single user ID and password, but that would be an administration nightmare (as well as a security risk). Site5 does have a workaround based on public/private keys, but it’s fairly complicated—i.e., it could break in lots of hard-to-diagnose ways. Another option would be to use the mod_dav_svn plugin for Apache, but Site5 doesn’t support per-domain Apache modules either. does, so I may be switching hosts in a few weeks.

So: how much of this should the average research scientist be expected to understand? If the answer is “none”, then how are they supposed to make sensible decisions about moving their work online? If the answer is “all”, where does the time come from? (It takes me 30 seconds to read the two paragraphs above; it would take many hours of instruction to teach people enough to do the analysis themselves.)  And if the answer is “some”, then which parts? To what depth? And who takes care of the rest on scientists’ behalf?


Written by Greg Wilson

2010/03/11 at 19:45

Posted in Content, Opinion, Version 4

12 Responses

Subscribe to comments with RSS.

  1. Isn’t one of the underlying questions here to what extent a research scientist gets IT support? Maybe research labs need to stop hiring so many techs to do titrations (or whatever they do) and start hiring someone to take care of niggly issues like this.


    2010/03/11 at 20:04

  2. @Neil Those “titrations” are as important to science as domain hosting: if scientists don’t understand the latter, how can they make sensible decisions about trade-offs w.r.t. which one to invest in?

    Greg Wilson

    2010/03/11 at 20:08

  3. Why should they host their own Subversion repository? For that matter, why should you?

    For Git, the answer is easy: GitHub is free if you want to share, and costs if you want to be closed or have more than 300 megs of stuff.

    Is there a similar Subversion site?

    Peter Boothe

    2010/03/11 at 20:15

  4. @Peter Not all science is open source, so services like GitHub/Google Code/whatever may not be feasible. There are also often legal obstacles, for example, if scientists need to share data that may contain personal information.

    Greg Wilson

    2010/03/11 at 20:21

  5. Get a paid github account. It’s where my private research code is.

    In general, some of the questions you have above are easily answered by “pay to make the problem go away”. A github account starts at 7USD per month. Similar services exist for most of what you want.

    Exception: large amounts of data (if you generate >1GB/day, say) get very expensive very fast.


    2010/03/11 at 20:44

  6. @luispedro That addresses the code issue, but not the data issue: until the US repeals the so-called “PATRIOT” act, it’s illegal for people in jurisdictions like Ontario to store sensitive data on servers run by American companies, even if they’re not public, and even if those servers are physically located outside the US. (No, I’m not making this up: privacy legislation here says that we’re not allowed to store sensitive information in jurisdictions with weaker privacy protection than Ontario itself has, and the “secret warrant” provisions in the PATRIOT Act mean that the US counts as weaker. Those provisions apply to subsidiaries of American companies operating outside US borders…)

    Greg Wilson

    2010/03/11 at 20:48

  7. If you can’t deal with American companies, you’re going to have a hard time 🙂

    I did not just mean github. I meant online storage services, for example. Or, if you’re looking for more, amazon EC2+S3.


    2010/03/11 at 20:56

  8. To answer your question from an ecologists’ perspective I’ll say that I’d hope that eventually we could expect ecologists to understand the first paragraph. That’s going to take some effort, but I think it’s obtainable.

    The second paragraph I think needs to be dealt with either by local IT or by paying someone to deal with it (as others have suggested; we use Beanstalk for SVN hosting). The details of authentication and web servers don’t contributed directly to the science itself (whereas version control and the shell do).


    2010/03/13 at 23:02

    • I think the future of in-house hosting is in specialized appliances with services similar to hosted services (github, google groups). Tools like Puppet (a declarative configuration management system) support the creation of such appliances by making software configuration repeatable and allowing deployment details to be hidden behind high-level interfaces. At, we are working on a framework for the creation of such appliances, with our initial targets being appliances for math/science classes and research groups, and neighborhood associations. These appliances can be hosted in a VPS or on a local server, and automatically have smart backup strategies and builtin service monitoring/testing.

      Eric Drechsel

      2010/03/20 at 21:32

      • @Eric Possibly, but that leaves the original question begging: how much of this should an ecologist be expected to understand? If the answer is “enough to do it themselves”, what do we take *out* of their education to make room for that learning? If the answer is “enough to judge someone else’s solution”, same question, and if the answer is, “Not much,” then who *is* going to make decisions, and how will *they* learn enough about ecologists and their needs to make the right ones?

        Greg Wilson

        2010/03/20 at 21:40

      • standards emerge (github, google groups) so you can do pretty well by just using what the slightly-more-technical scientist next to you is using. So I guess the collective body of peers makes the technology choices, lead by some of the more technical adventurous among them.

        – If a tool is able to present an interface which is a good match for the user’s mental model of the problem they are trying to solve, then no training should be needed (as long as the user understands the problem well). This is where software is headed, IMO. Systems should be configurable using the domain languages of their users (which is my goal with these auto-configuring appliances)

        This is all looking rosy-eyed into the future, but it seems to be happening.

        Eric Drechsel

        2010/03/20 at 21:48

  9. “what do we take *out* of their education to make room for that learning?”

    To return to this salient and seldom recognized part of the question, I think that we need to provide more flexibility in the scientific breadth courses that we require. In biology, most undergraduate programs require 3-4 semesters of chemistry, 2 semesters of physics, and 2-3 semesters of calculus. These are the right broadening courses for some biologists, but certainly not for others. This could be changed to either 2 semesters of each of the above plus 2 semesters of computing or just be turned into 8 semesters of non-major department science courses to allow students to guide their own development. See a slightly more detailed discussion here.

    Ethan White

    2010/03/21 at 15:39

Comments are closed.

%d bloggers like this: