Archive for January 2010
Ars Technica isn’t primarily a science site, but even they are now worried about reproducibility in computational science. I think it no longer matters how important this “crisis” actually is—sooner or later, major funding agencies are going to mandate adoption of something like the Open Provenance Model. Problem is, given the current skill set of the average scientist, that will almost certainly translate into burden without benefit.
Over on Nature News, Eric Hand’s article “‘Big science’ spurs collaborative trend” is subtitled, “Complicated projects mean that science is becoming more globalized.” It talks about the benefits of international collaboration, but what it doesn’t say is that sharing ideas, results, procedures, and software requires skills that aren’t part of the standard curriculum. One of the main goals of the rewrite of Software Carpentry is to teach scientists some of what they need to know in order to do what Hand describes without heroic effort. I’d be grateful for suggestions about topics and tools that ought to be on the list, but aren’t.
Michael Clarke has written a thoughtful post exploring why the web hasn’t disrupted scientific publishing, even though it was designed to do exactly that.
One of the founders of the controversial ‘Baby Einstein’ range of products is taking the University of Washington to court in an attempt to force the institution’s scientists to release their raw data to him…William Clark…wants records relating to two studies published in 2004 and 2007. The latter found an “association between early viewing of baby DVDs/videos and poor language development” while the former suggested “efforts to limit television viewing in early childhood may be warranted”.
If someone challenged your results, could you reassemble the programs and data you’d used to produce them? And what would happen if you couldn’t? Software Carpentry isn’t just about making scientists more productive; the skills that will help them do more, faster, will also make their work more traceable and reproducible.
I gave a talk in Santa Fe early in 1997 describing a set of articles I’d organized for the Summer 1996 and Fall 1996 issues of IEEE Computational Science and Engineering (now Computing in Science & Engineering on the subject, “What should computer scientists teach physical scientists and engineers?” After the talk, John Reynders (then the director of the Advanced Computing Lab at Los Alamos National Laboratory) challenged me to put my money where my mouth was and actually teach basic software development skills to working scientists.
Brent Gorda and I ran the course for the first time in July 1997. We used Perl as a programming language, and covered topics such as CVS, regular expressions, and a little bit of web client programming. Our part of the course was three days long, and was followed by a two-day consulting visit from Steve McConnell (whose books Code Complete and Rapid Development were at the top of the charts). We ran the course in various forms another five or six times in the next three years, during which time we switched to Python and expanded it to five days. All told, about 120 LANL technical staff went through the course, most of them under 35.
In 2004, after I’d taught the course for the Space Telescope Science Institute and the US Navy, the Python Software Foundation gave me a grant to reorganize, update, and expand the material. That version is the core of what’s now online; when I last checked, the site was getting 10-12,000 distinct visitors a month, and the material was being used in whole or in part at Caltech, Indiana, several schools in the UK and Germany, Chile, South Korea, and of course here in Toronto.
Based on follow-ups with alumni, I’d guess that it has no effect at all on 20-25%, who take the course because their manager or supervisor told them they had to, and get little out of it. The rest routinely describe it as game changing: a PhD candidate in Psychology who did the course with us in July 2009 told me a few days ago that what she learned probably saved her six months on her current project, and that without it, a second project would simply not have occurred to her to try. As another data point, one of the other alumni of that offering came to me early in October to say that several of her labmates wanted to take the course, and was I planning to offer it again any time soon? I told her that I wasn’t, but that I could arrange for a CS grad student to teach it. Three weeks later, 65 students from Pysch and Linguistics had signed up to do it as a non-credit course, roughly 45 of whom have stuck with it so far.
While I don’t have data to back this up, I believe very strongly that what most students get out of the course isn’t specific knowledge about relational databases, regular expressions, or object-oriented programming, but rather a mental map of the computing landscape, so that they know what’s supposed to be easy, what else is supposed to be possible, and where to go looking for more information. Another student from the July 2009 offering said that the biggest thing the course did for him was turn “unknown unknowns” into “known unknowns”. I’m supposed to conduct a follow-up survey with those students later this month to see how much they’re using what they learned, and what impact is has had; I hope to have results up on the web by Easter.
And as regular readers will know, I’m presently trying to raise money to update the material: this post explains the background, while this plan incorporates what I’ve learned from students and instructors on four continents about what material, sequence, and presentation will actually “reach” scientists. Sadly, though, funding agencies and companies mostly still seem to think that only HPC-related training is worth funding, which I feel is asking scientists to run before they can walk. This CiSE paper talks about this particular frustration, while our survey results put weight behind the claim that the overwhelming majority of scientists will benefit much more from being helped with development issues than from anything to do with big iron.
As some of you already know, my contract with the University of Toronto runs out this spring, and I have decided not to seek renewal. I’ve learned a lot in this job, and had a chance to work with some great people, but it’s time for new challenges.
What I’d most like to do next is spend a year working full-time on the Software Carpentry course—of all the things I’ve done, it’s the one that I think has the most potential to make scientists’ lives better. My goal is to raise approximately CDN$25,000 from each of half a dozen sponsors so that I can reorganize and revamp the content, add screencasts and video lectures, and generally drag it into the 21st Century. An abbreviated proposal is included below the cut—if you or anyone you know would be interested in discussing possibilities, please give me a shout.