Software Carpentry

Helping scientists make better software since 1997


with one comment

We are planning to reorganize the Software Carpentry course starting in May 2010, and would be grateful for feedback from past, present, and potential future users. Are these users representative of the majority of scientists? Are we making the right assumptions about their prior knowledge and current interests? Who or what have we forgotten? Please send comments directly to Greg Wilson, or add comments to this page.

Bhargan Basepair

Bhargan Basepair received a B.Sc. in biochemistry five years ago. She has worked since then for Genes’R’Us, a biotech firm with labs in four countries. She did a Java programming course as a freshman, and a bioinformatics course using Perl as a senior, but has no other training in programming.

Bhargan and her colleagues are developing fuzzy pattern-matching algorithms for finding similarities between DNA records in standard databases. To help other Genes’R’Us researchers, and to test her group’s heuristics, Bhargan runs an overnight sequence query service. Researchers email sequences in a variety of formats (in-line, attachments, URLs to pages behind the company firewall, etc.). Bhargan saves them in files called search/, search/, and so on, then edits them to add query directives. She almost never accidentally overwrites one query with another.

Before leaving each night, she runs a Perl script that processses these input files to create output files with matching names like search/a.out. When Bhargan comes in the next morning, she pages through her mail again, sending .out files to the appropriate people. (She almost never sends the wrong file to the wrong person.) She uses another Perl script to copy all the input and output files to a directory named after the date, such as 2009-07-23. She and her colleagues would like to do statistics on these saved queries and results to see how well their algorithms are doing, but haven’t yet found the time.

This course will teach Bhargan how to automate her overnight service by writing simple scripts to retrieve, process, and reply to email queries. Those scripts will automatically record queries, results, and other data, and produce a daily summary of the performance of the pattern-matching algorithms.

Fan Fullerene

Fan Fullerene is a graduate student in chemistry who is working as a lab technician to help cover his costs. His only programming experience is a general first-year introduction to computational science using Python.

Fan’s supervisor is studying the production of fullerenes (also known as “buckyballs”). Each set of experiments involves 100 different reactant mixtures, 20 different temperature regimes, and 5 different pressures. Using a machine built by a collaborating lab, Fan can run all the mixture and temperature combinations at once, so that the output of each experiment is five 2000-line files.

The controller for the experimental machine writes these files to Fan’s workstation approximately an hour after the experiment begins. Fan analyzes them by opening Excel, copying and pasting the data into one big spreadsheet, then creating a chart using the chart wizard. He then saves the chart as a PNG file on the group’s web site, along with the original data files.

Two or three times a week, Fan receives results from his supervisor’s collaborators. He creates charts for each using the same procedure, uploads them to the web site, and then merges summary statistics from the week’s runs into a master spreadsheet.

This course will teach Fan how to automate the process described above, and how to track the provenance of the data he is working with so that scientists can trace backward from the final charts to the raw data they represent. This course will also teach him how to create an RSS feed of results so that other scientists can automatically be notified as new data appears, and how to annotate the data to make it more searchable.

Helen Helmet

Helen Helmet, a Ph.D. student in mechanical engineering, is currently doing a six-month internship at an engineering firm that makes carbon-fiber helmets for firefighters and other emergency service personnel. Her undergraduate courses included an introduction to scientific computing using MATLAB, a robotics course using C, and a numerical methods course that also used MATLAB. She taught herself Fortran during a co-op placement between her junior and senior years, and used it again in a graduate course on finite elements.

Helen’s task is to model the non-combustive thermal degradation (otherwise known as “melting”) of candidate materials. Her starting point is a 14,000-line program her supervisor wrote a decade ago. She copies the existing mesh deformation function to create a new one called MDFORM2, replaces all the calls to the original, and starts tinkering with the new function. She sometimes writes, runs, and deletes sections of code three or four times before she is satisfied.

Helen tests her program by writing the total heat content of the mesh at each time step to a file. She then loads this data into MATLAB to graph the percentage differences between these values and the ones produced by the original program for six sample problems. In one case, the difference grew as large as 30% by the end of the simulation. Helen added write statements to her program to display values until she managed to convince herself that the difference was due to a bug in the original subroutines, not in her new code.

This course will teach Helen how to do exploratory programming in a disciplined way, and that version control systems are a better way to manage code than copying, pasting, and commenting out. She will also learn more effective testing and debugging procedures.

Mehrdad Mapping

Mehrdad Mapping is a graduate student in foresry studying bark beetle infestations in the Canadian taiga. He has never taken a programming course, but used SPSS in an undergraduate statistics course.

For the last three years, Mehrdad has spent six weeks every autumn counting beetle bores in pine trees in the Yukon and Alaska. He now has a spreadsheet with 15,000 entries, each recording the location and time of a measurement, the number of bores found, the moisture and acidity of the soil, and so on. He also has two hundred text files containing 17,500 measurements that his supervisor made in the same regions in the 1970s and 1980s. His task now is to put both sets of measurements on a map so that he can start to correlate changes in bark beetle distribution with changes in climate.

This course is probably too advanced for Mehrdad, as it assumes familiarity with basic programming concepts like loops, conditionals, arrays, and functions. He should probably audit a first-year introduction to programming or find an intensive two-week summer school course before tackling this one.


Written by Greg Wilson

2010/03/23 at 17:00

One Response

Subscribe to comments with RSS.

  1. […] Target Audience […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: