Archive for the ‘Opinion’ Category
Over at opensource.com, Red Hat’s Greg DeKoenigsberg has a post about a new collaboratively-authored textbook on open source software aimed squarely at undergrad courses. As Máirín Duffy points out in the first comment, it’s very code-centric, but in my experience, that’s the right approach: students won’t be ready for discussion of design until they’re proficient in coding . I’m looking forward to borrowing lots from the book for Software Carpentry…
 This is, by the way, why I believe that attempts to teach “computational thinking” without first teaching programming are doomed to fail, but that’s a rant for another time.
Julia Lane, the director of the Science of Science & Innovation Policy program at the National Science Foundation, wrote an article for Nature a couple of weeks ago titled “Let’s make science metrics more scientific”. As the summary at the start says:
- Existing metrics have known flaws
- A reliable, open, joined-up data infrastructure is needed
- Data should be collected on the full range of scientists’ work
- Social scientists and economists should be involved
The same points could be made about evaluating software developers (or any other kind of knowledge worker). The devil, as always, is in the details, and unfortunately I have to start doing evaluations before those details are worked out. Several of the sponsors for this course need me to demonstrate its impact on the productivity of the scientists who take it (so that they can in turn justify their contribution to their funders). It isn’t enough to ask students who have completed the course whether they think they know more about programming than they used to: ignoring the obvious problems of survivor bias and self-assessment, I would still have to demonstrate that making people better programmers also makes them better scientists. I believe it does, but belief is not evidence, and doesn’t convey scale.
The best plan I’ve been able to come up with so far is to look at how scientists spend their time before and after taking the course, but that would require resources I don’t have. If you’re interested in studying scientists or software developers empirically, and would like some raw material, I’d like to hear from you.
Simon Singh, the science journalist who was sued for libel by the British Chiropractic Association, has won the right to rely on the defense of “fair comment”. (Full ruling linked from this Index on Censorship post.) Singh had pointed out that there’s no evidence to back up BCA claims that their particular brand of pseudoscience could help with asthma and other ailments; it has taken him two years and £200,000 later to get this far, and it may be another two years before the matter is finally settled, but this is an important victory for everyone who believes in rational inquiry.
Let’s start with the problem description:
All of the Software Carpentry course material (including lecture notes, code samples, data files, and images) is stored in a Subversion repository. That’s currently hosted at the University of Toronto, but I’d like to move it to the software-carpentry.org domain (along with this blog). However, software-carpentry.org is hosted with site5.com, who only provide one shell account per domain for cheap accounts like the one I bought.
Why is this a problem? Because when someone wants to commit to the repository, they have to authenticate themselves. I could let everyone who’s writing material for the course share a single user ID and password, but that would be an administration nightmare (as well as a security risk). Site5 does have a workaround based on public/private keys, but it’s fairly complicated—i.e., it could break in lots of hard-to-diagnose ways. Another option would be to use the mod_dav_svn plugin for Apache, but Site5 doesn’t support per-domain Apache modules either. Dreamhost.com does, so I may be switching hosts in a few weeks.
So: how much of this should the average research scientist be expected to understand? If the answer is “none”, then how are they supposed to make sensible decisions about moving their work online? If the answer is “all”, where does the time come from? (It takes me 30 seconds to read the two paragraphs above; it would take many hours of instruction to teach people enough to do the analysis themselves.) And if the answer is “some”, then which parts? To what depth? And who takes care of the rest on scientists’ behalf?
Ars Technica isn’t primarily a science site, but even they are now worried about reproducibility in computational science. I think it no longer matters how important this “crisis” actually is—sooner or later, major funding agencies are going to mandate adoption of something like the Open Provenance Model. Problem is, given the current skill set of the average scientist, that will almost certainly translate into burden without benefit.
Over on Nature News, Eric Hand’s article “‘Big science’ spurs collaborative trend” is subtitled, “Complicated projects mean that science is becoming more globalized.” It talks about the benefits of international collaboration, but what it doesn’t say is that sharing ideas, results, procedures, and software requires skills that aren’t part of the standard curriculum. One of the main goals of the rewrite of Software Carpentry is to teach scientists some of what they need to know in order to do what Hand describes without heroic effort. I’d be grateful for suggestions about topics and tools that ought to be on the list, but aren’t.
One of the founders of the controversial ‘Baby Einstein’ range of products is taking the University of Washington to court in an attempt to force the institution’s scientists to release their raw data to him…William Clark…wants records relating to two studies published in 2004 and 2007. The latter found an “association between early viewing of baby DVDs/videos and poor language development” while the former suggested “efforts to limit television viewing in early childhood may be warranted”.
If someone challenged your results, could you reassemble the programs and data you’d used to produce them? And what would happen if you couldn’t? Software Carpentry isn’t just about making scientists more productive; the skills that will help them do more, faster, will also make their work more traceable and reproducible.
My last post linked to a PLoS paper by Dudley and Butte on developing effective bioinformatics programming skills. I asked, “How many hours do the authors think are needed to acquire these skills?” In response, Atul Butte said, “I think the ideal scenario is when one’s research projects enable one to learn these skills, so that these skills get learned in a practical way outside the classroom too, while doing science,” while Luis Pedro Coelho asked, “Does it matter over the long (or even medium) term? Isn’t improving your skills even you if aren’t being immediately productive what school is for?”
To which I can only respond, “Yeah, but that doesn’t work.” People have been doing computational science for almost seventy years, and have been calling it the third branch of science since (at least) the mid-1980s. If picking things up by osmosis was going to work as an educational strategy, we’d know by now. Instead, what we actually see hasn’t changed in 25 years: a small minority working wonders, and the vast majority not even knowing where they ought to start. We don’t expect grad students to pick up all the math and stats they need by osmosis, on their own, without any structured guidance—why should expect them to become proficient computationalists that way?
Via Titus Brown, a new PLoS paper titled “A Quick Guide for Developing Effective Bioinformatics Programming Skills” by Joel Dudley and Atul Butte. Their recommendations are:
- Programming languages
- Embracing open source
- Unix command-line skills
- Keeping projects documented and manageable
- Preserving source code with version control
- Embracing parallel computing paradigms
- Structuring data for speed and scalability
- Understanding the capabilities of hardware
- Embracing standards and interoperability
- Put a high value on your time
I think all these things matter, but:
- How many hours do the authors think are needed to acquire these skills? We’ve tried very hard to fit Software Carpentry into 25 hours of lecture and 50-100 hours of practical work because we recognize that every one of those hours is time students aren’t spending doing science.
- Shouldn’t testing be in the top 10? Or the top 5, or 3? These days, I care a lot more about how (and how well) someone tests than I do about their mastery of any particular programming language.
Nicola Scafetta is refusing to release the software on which he bases his claims that the sun is responsible for much of terrestrial warming during the last century. I obviously think that scientists should be required to do this as a condition of publication; coming as this does on the heels of Climategate, it will be interesting to see if journals finally start pushing in that direction. It also highlights the need to add more material to this course to cover packaging for release and data provenance.