Software Carpentry

Helping scientists make better software since 1997

Archive for March 2009

User Stories

One of the tricks I teach my undergraduates is to create fictional personas to describe the intended users of a system—or in this case, a course.  Here are three of the “people” I’ve had in mind while developing Software Carpentry to date; my goal is to update these stories to better reflect how scientists work today.

Bhargan Basepair

Bhargan Basepair received a B.Sc. in biochemistry five years ago. He has been working since then for Genes’R’Us, a biotech firm with labs in four countries. He did a Java programming course as a freshman, and a bioinformatics course using Perl as a senior.

Bhargan and his colleagues are developing fuzzy pattern-matching algorithms for finding similarities between DNA records in standard databases. To help other Genes’R’Us researchers, and to test his group’s heuristics, Bhargan runs an overnight sequence query service. Researchers email sequences in a variety of formats (in-line, attachments, URLs to pages behind the company firewall, etc.). Bhargan saves them in files called search/, search/, and so on, then edits them to add query directives. He is very conscientious, and almost never accidentally overwrites one query with another.

Before leaving at night, he runs a Perl script that processses these inputs to create output files with matching names like search/a.out. When Bhargan comes in the next morning, he pages through his mail again, sending .out files to the appropriate people. (He almost never sends the wrong file to the wrong person.) He then uses another Perl script to copy all the input and output files to a directory with a name corresponding to the date, such as 2009-07-23. He and his colleagues would like to do statistics on these saved queries and results to see how well their algorithms are doing, but have never found the time.

This course will teach Bhargan how to automate his overnight service by writing simple scripts to retrieve, process, and reply to email queries. Those scripts will automatically record queries, results, and other data, and produce a daily summary of the performance of the pattern-matching algorithms.

Helen Helmet

Helen Helmet, a Ph.D. student in mechanical engineering, is currently doing a six-month internship at an engineering firm designing carbon-fiber helmets for firefighters and other emergency service personnel. Her undergraduate courses included an introduction to scientific computing using MATLAB, a robotics course using C, and a numerical methods course that also used MATLAB. She taught herself Fortran during a co-op placement between her junior and senior years, and used it again in a graduate course on finite elements.

Helen’s task is to model the non-combustive thermal degradation (otherwise known as “melting”) of candidate materials. Her starting point is a 14,000-line program her supervisor wrote a decade ago. After deciding that there isn’t time to re-write it in C++ (which she would like to learn), she comments out the calls to the mesh deformation routine in the main loop and begins to write a replacement. She sometimes deletes what she has written and starts over three or four times before she is satisfied.

Helen tests her program by writing the total heat content of the mesh at each time step to a file. She then loads this data into MATLAB to graph the percentage differences between these values and the ones produced by the original program for six sample problems. In one case, the difference grew as large as 30\% by the end of the simulation. Helen added \code{write} statements to her program to display values until she managed to convince herself that the difference was due to a bug in the original subroutines.

Helen keeps a to-do list on her home page. Every two or three days, she updates this list to show the progress she has made. She keeps completed tasks on the page until the end of the month, when she writes a short status report for her supervisor.

This course will teach Helen to design software before she starts typing, and that there are better ways to manage code evolution than commenting out one section, and replacing it with another. She will also learn more effective testing and debugging procedures, and how to use a version control system to ensure that she can back down to an old version of code when she needs to. Finally, she will be shown how to use an issue-tracking system to manage her to-do list, and how to write a small script to generate his monthly progress report automatically.

Stefan Synthesis

Stefan Synthesis is a graduate student in chemistry who is working as a lab technician to help cover his costs. His only programming experience is a general first-year introduction to computational science using Python.

Stefan’s supervisor is studying the production of fullerenes (also known as “buckyballs”). Each set of experiments involves 100 different reactant mixtures, 20 different temperature regimes, and 5 different pressures. Using a machine built by a collaborating lab, Stefan can run all the mixture and temperature combinations at once, so that the output of each experiment is five files containing 2000 lines of data each.

The controller for the experimental machine writes these files to Stefan’s workstation approximately an hour after the experiment begins. To analyze them, Stefan opens them with Excel, copies and pastes to merge the data into one spreadsheet, then creates a chart using the chart wizard. He saves the chart as a PNG file on the group’s web site, along with the original data file.

Two or three times a week, Stefan receives results from his supervisor’s collaborators. He creates charts for each, which he uploads to the web site, then merges summary statistics into a master spreadsheet.

This course will teach Stefan how to automate the process described above. More importantly, it will teach him how to track the provenance of the data he is working with, so that scientists in his group and others can trace backward from the final charts to the raw data they represent.

Written by Greg Wilson

2009/03/30 at 20:00

Posted in Content, Lectures, Version 3

Inference for R

Inference for R lets users nest the R statistical language in Word and Excel. It’s a neat idea, and another example of the kind of bottom-up innovation that I predict will eventually lead to fully-fledged extensible programming systems. (If Bespin made it easier to do things like this, I might understand all the excitement…)

Written by Greg Wilson

2009/03/25 at 20:34

Posted in Noticed

Open Notebook Science Badges

I blogged last summer about creating a badging scheme for open science. Turns out it’s been done: ONS Claims has badges for four flavors of open science. Two sets are available in various sizes, all variations on the themes below:

All content Selected content
Immediate release
Delayed release

Here’s hoping they’re widely adopted.

Written by Greg Wilson

2009/03/25 at 18:17

Posted in Community

Legal Frameworks for Reproducible Research

My grad students and I met Victoria Stodden for the first time yesterday, and had a great time talking about reproducible research, science 2.0, and most particularly the kind of legal/copyright frameworks needed to move science forward.  She has two papers up that anyone interested in the subject should read: a short one that appeared in Computing in Science and Engineering titled The Legal Framework for Reproducible Research in the Sciences: Licensing and Copyright, anda longer one due out soon called Enabling Reproducible Research: Open Licensing For Scientific Innovation. If you have thoughts on the subject, I’m sure she’d enjoy hearing from you.

Written by Greg Wilson

2009/03/17 at 08:15

Posted in Community, Content

Open Science and Autism’s False Prophets

Paul Offit has a new book out called Autism’s False Prophets, which looks at how the myth about vaccination causing autism arose and spread.  As his condensed essay on the subject says, it’s a far from simple story, but one that has echoes in areas such as climate change where science and public policy must dine at the same table.  I still haven’t decided whether open science will make a difference to this or not: putting data, calculations, and papers freely available online in real time will help scientists talk to one another, but I don’t think it will help them communicate with the other 99.9% of our species.  Thoughts?

Written by Greg Wilson

2009/03/02 at 17:17

Posted in Opinion


Get every new post delivered to your Inbox.