Software Carpentry

Helping scientists make better software since 1997

Course Outline

with 2 comments

We are planning to reorganize the Software Carpentry course starting in May 2010, and would be grateful for feedback from past, present, and potential future users. Are the topics below the right ones? Is the problem-based approach accessible? What have we forgotten (and what should we drop to make room for it)? Please send comments directly to Greg Wilson, or add comments to this page.

  1. Introduction
  2. Coding Tricks 1: Associative Data Structures
  3. Track Down a Bug
  4. Read Text Files
  5. Test Some Software
  6. Read More Complicated Text Files
  7. Coding Tricks 2: First-Class Functions
  8. Clean Up This Code
  9. Find Duplicate Files
  10. Share Work With Colleagues
  11. Analyze an Image
  12. Vectors and Matrices
  13. Make a Program Go Faster
  14. Generate Results Automatically
  15. Get Data Out of a Database
  16. XML
  17. Object-Oriented Programming
  18. Represent Information
  19. Build a Desktop User Interface
  20. Create a Web Service
  21. Scaling Up
  22. Organize a Team 1: Empirical Results in Software Engineering
  23. Organize a Team 2: Software Development Lifecycles
  24. Discarded Topics


  • Course design has been driven to date by what we think students ought to know (“push”)
  • Turn that around for V4 and define the course in terms of problems they can solve (“pull”)
  • Each slide set starts with a problem and builds up to solution
    • Suitable for a 45-minute lecture with questions (approx. 20 slides)
  • Hope that this will clarify reorganization of material
  • While showing students how to evolve solutions piecemeal (development process in the small)
  • Note: in prerequisite lists below, “basic programming” means “strings, lists, file I/O, control flow, functions”


Coding Tricks 1: Associative Data Structures

Requires: basic programming

Introduces: sets; dictionaries

Problem: What Molecules Can We Make?

Molecules’R’Us has hired you to write a simple inventory management program for them. Your program is supposed to read in a file describing how many atoms of various kinds are required to make different molecules, and another file describing how many atoms the company actually has in its teeny tiny warehouse, and then print out a list of the molecules the company could make. The first file (describing molecules) is formatted like this:

# Comments start with '#' and go to the end of the line.
# Blank lines (like the one below) are allowed.

helium : He 1               # molecule name, colon, atom type, number of atoms
ammonia : N 1 H 3           # molecules may contain many types of atoms
salt : Na 1 Cl 1            # atom names may be one or two characters long
lithium hydride : Li 1 H 1  # molecule names may be several words long

The file that specifies how many atoms the company has on hand is formatted like this:

# Once again, comments and blank lines are allowed.

He 3                        # atom name, number available
H 2
N 5
Li 1

For these two input files, the output would be:

lithium hydride

Note that molecules’ names appear in alphabetical order, and that the system doesn’t have to say how many molecules of any type could be made.


  1. Sets
  2. Set operations
  3. How set values are stored
  4. Immutability
  5. Frozen sets
  6. Tradeoffs in language design
  7. Motivating dictionaries
  8. Creating and indexing
  9. Updating
  10. Membership
  11. Loops
  12. Other operations
  13. Example: counting frequency
  14. Imposing order


Track Down a Bug

Requires: basic programming

Introduces: debugging, IDEs

Problem: Fixing an Intermittent Bug

You are using a program from Euphoric State University that simulates diffusion-limited aggregation. It seems to work correctly most of the time, but every once in a while the fractals it produces are so improbable that you’re convinced there must be a bug lurking somewhere in the code. The original author hasn’t responded to your emails, so it looks like you have to track it down and fix it yourself.


  1. What’s wrong with print statements
  2. Symbolic debuggers
  3. Command-line debuggers
  4. Inspecting values
  5. Controlling execution
  6. Conditional breakpoints
  7. Watchpoints
  8. Rule 0: Get It Right the First Time
  9. Rule 1: What Is It Supposed to Do?
  10. Rule 2: Is It Plugged In?
  11. Rule 3: Make It Fail
  12. Rule 4: Divide and Conquer
  13. Rule 5: Change One Thing at a Time, For a Reason
  14. Rule 6: Write It Down
  15. Rule 7: Be Humble


Read Text Files

Requires: basic programming

Introduces: data processing patterns; incremental development

Problem: Converting Legacy Data Files to a New Format

Your supervisor has just acquired 300 files containing descriptions of approximately 15,000 organic molecules in a variant of an old-fashioned text format called PDB. She would like you to compare the coordinates of the atoms in 700 of these descriptions to the “standard” coordinates contained in an off-the-shelf molecular dynamics program. Your task is to extract the data describing those 700 molecules from the files and write it out in the required format for further processing.


  1. Motivation: describe PDB files
    • Defer intra-line complexities to next lecture to concentrate on parsing patterns for now
  2. As in Data Crunching, build up to multi-molecule PDB parser in stages
  3. Emphasis is coding idioms (small-scale design patterns)
    • Single-line records
    • Skipping blanks and comments
    • Multi-line records with explicit terminator
    • Multi-line records without terminators (read-ahead by one)


Test Some Software

Requires: basic programming

Introduces: structured unit testing; frameworks

Problem: Testing a Data Conversion Tool

Your supervisor was quite happy with the program you wrote to extract data from old PDB files. She’s less happy about the fact that every time her other students modify it, it takes hours or days to find and fix their mistakes. Your task now is to find a way to prevent bugs getting into the version of the program that’s being used to generate results for publication.


  1. Why test?
  2. Why use a framework for testing?
  3. A basic example
  4. What’s going on under the hood
  5. Testing exceptions
  6. Testing I/O
  7. Stubs and mock objects
  8. Testing performance
  9. Test-driven development


Read More Complicated Text Files

Requires: data processing patterns

Introduces: regular expressions

Problem: Converting More Complex Data Files

You did such a good job converting the PDB files in the previous problem that your supervisor has offered to loan you to a colleague in geology, who wants to convert a dozen files of rock temperature observations from the slopes of dormant volcanoes to a new format. The problem is, those files were typed in by hand by several different graduate students, and are in several slightly different formats. For example, some files use commas as field separators, while others use spaces, and still others use a mix of both. Similarly, some record the dates of observations as “2003-08-05”, while others use “Aug 5, 2008”, and so on.


  1. Motivation: extract complex/varying data from within strings (real PDB format)
  2. Basics of regular expressions
  3. Also cover coding patterns: match, extract, extract-all, extract-and-remainder
  4. Wrap up by rewriting earlier parser using REs
  5. Brief mention of parser generators (Yacc, Antlr), which do for REs what REs do for strings


Coding Tricks 2: First-Class Functions

Requires: basic programming

Introduces: functional programming

Problem: Slicing and Dicing Data

Perhaps you shouldn’t have done such a good job converting those data files: your supervisor now wants to reanalyze the data in some old spreadsheets, and thinks you’d be perfect for the job. At first glance, you’ll have to write a dozen or more programs, all very similar, to do everything she wants; you’d like to write just one.


  1. How programs are represented in C or Python
  2. How a call stack works
  3. Passing functions as arguments
  4. Storing functions in data structures
  5. Map, reduce, filter, etc.
  6. Default parameter values
  7. Extra arguments and extra keyword arguments
  8. Closures (or is this too advanced?)


Clean Up This Code

Requires: basic programming

Introduces: coding style; refactoring

Problem: This Code Is Unreadable

You finally got your hands on the software used to generate results for a paper that you’ve cited twice, and you can’t make heads or tails of it. One function is 750 lines long; another called i_hope_this_works seems like an exact duplicate of the one above it, and a typical variable name is A (which is either the matrix of regression coefficients or the temperature gradient—it’s hard to tell). It’s a mess: you need to clean it up before you can even start thinking about adapting it.


  1. Reading is learning
  2. A little bit of cognitive psychology
  3. What does this have to do with programming?
  4. Naming things
  5. Scope and size
  6. Function length
  7. Code metrics and their shortcomings
  8. Style tools
  9. Documentation
  10. Embedding documentation in code
  11. Refactoring
  12. Common refactorings
  13. Refactoring tools
  14. Working with legacy code


Find Duplicate Files

Requires: basic programming; dictionaries; first-class functions; recursion

Introduces: manipulating files and directories; checksums

Problem: Find Duplicate Data Files

One of the grad students in your lab recently moved to Euphoric State University to start a post-doc. She didn’t organize the data on her computer before she left, so your supervisor has asked you to. After poking around for an hour, you have realized that every time she started a new experiment, she copied entire folders of data, then overwrote some of the copies without changing their names. Before you can start making sense of the data itself, you have to figure out which files are redundant, and which are unique.


  1. Naive solution: compare each file against each other
  2. Better: only compare files that are the same size
  3. Better still: store by size so that finding potential matches is very fast
  4. Write and test the checking function
  5. Double back and write the function that finds all the files and their sizes
  6. What are filesystems and what do they contain?
  7. Getting properties of filesystem objects


  1. Remove duplicates
    • Manipulating files and directories (delete, create, etc.)
    • Why moving is better than deleting
    • And why the original directory structure should be preserved
    • The problem of atomicity
    • Solution
  2. Making file-matching even faster with checksums
    • Data file sizes often aren’t distributed randomly
    • What a checksum is
    • Calculating the probability of collision
    • Solution


Share Work With Colleagues

Requires: nothing

Introduces: version control

Problem: Writing a Paper With Several Other People

Your lab has been working with groups at two other universities to analyze anomalies in the trajectories of deep space probes, and the time has come to write a paper summarizing your findings. The last one produced by this collaboration was 35 pages long, and had 15 authors, 20 figures, and 400 references. It took six weeks to write, half of which was spent tracking down and reconciling bits and pieces that had gone astray in email or been overwritten accidentally. Everyone would like to find a less painful way to get this one written; since you weren’t able to attend the organizational meeting, you’ve been put in charge of figuring out how.


  1. Problem #1: collaboration
  2. Solution: version control
  3. Problem #2: undoing changes
  4. Solution: version control (again)
  5. Which version control system?
  6. The update/edit/commit cycle
  7. Reading Subversion output
  8. Resolving conflicts
  9. Starvation
  10. Binary files
  11. Reverting
  12. Creating a repository

Note: this lecture will use a GUI like SmartSVN so that students don’t need to know how to use a shell in order to use version control.


Analyze an Image

Requires: basic programming

Introduces: images; recursion

Problem: Tracking Cometary Impacts

As a side project to keep yourself from going crazy while you’re revising your thesis, you’ve started modeling what happens when comets impact gas giant planets. To validate your models, you’d like to compare them with data from eleven real impacts. For each impact, you have a sequence of between twenty and sixty photographs; what you’d like to do is measure how quickly the shock wave from the impact propagated and then dissipated. Instead of going through the photographs by hand and trace the perimeter of the shock wave using a drawing tool, you would like to write a program to calculate this boundary automatically.


  1. Raster vs. vector representations
  2. How raster images are stored (colors per pixel)
  3. Basic operations
  4. Flood fill algorithms
  5. Edge detection algorithms


Vectors and Matrices

Requires: basic programming

Introduces: data parallelism

Problem: Simulating Two-Dimensional Cellular Automata

A cellular automaton (CA) is a simple model of some phenomenon that divides the world into cells, then updates the state of each cell simultaneously based solely on its present state and the states of its neighbors. The most famous CA is Conway’s Game of Life, which runs on a two-dimensional grid in which each cell is either “alive” or “dead”. Your task is to write a program to simulate the evolution of a grid of cells according to its rules.


  1. Mathematicians don’t iterate—why should computers?
  2. The idea of data parallelism
  3. Basic operations on vectors
  4. Basic operations on matrices
  5. Reshaping data
  6. Conditional operations with masks
  7. A one-dimensional CA
  8. Replacing conditionals with lookups
  9. Linear algebra
  10. Other applications (e.g., list and set comprehensions)
  11. Why some problems aren’t data parallel


Make a Program Go Faster

Requires: basic programming

Introduces: profiling; algorithmic efficiency

Problem: Speed Up Simulation of Invasion Percolation

A student intern in your lab wrote a small program last summer to study invasion percolation, a simple model of the way fluid forces its way into fractured rock. It’s useful, but it’s also very slow, particularly for large simulations. Your supervisor would like you to make it run faster so that you can do more and bigger simulations in time for an upcoming paper deadline.


  1. The invasion percolation problem
  2. Ways to make it faster (bounding box)
  3. Using a profiler to find out where the time actually goes
  4. How much work has to be done to add a point?
  5. Better ways: use priority queue to track neighboring points
  6. Why this is so much better


Generate Results Automatically

Requires: basic programming

Introduces: build tools; provenance

Problem: Run a Program on Multiple Inputs Automatically Every Night

You are developing new algorithms for finding similarities between DNA records in standard databases. To test them out, you want to set up an overnight sequence query service. Scientists will send sequences to you in email, which you will save in a temporary directory. Each night, your program will process all of the new input files to create matching output files. After checking these in the morning, you will return them to the scientists, then compare them to the results from eariler versions of your program to see if your algorithms are getting better or not.


  1. Why to automate routine tasks
  2. Make pros and cons
  3. SCons pros and cons
  4. Simple rules (single target, single action)
  5. Multiple actions
  6. Don’t repeat yourself (macros in Make, and why)
  7. Dependencies
  8. Dependency graphs
  9. Shorthand notation (automatic variables in Make)
  10. Pattern-based rules
  11. Functions


  1. Sending email programmatically
  2. Reading mailboxes programmatically
  3. Cron to run the process automatically


Get Data Out of a Database

Requires: basic programming

Introduces: relational databases

Problem: Keeping Track of Experiments

After slaving away in a poorly-lit basement office for 19 years, an archivist has created a detailed database of experiments done in the late 1800s and early 1900s. You have a hobbyist’s interest in the history of science, and would like to discover how involved your personal heroes were in certain key discoveries. You are certain this information is in the database; you just need to figure out how to get it.


  1. History
  2. When to use a database
  3. Using SQL
  4. Creating a table
  5. Simple selects
  6. Sorting
  7. Joins
  8. Keys and constraints
  9. Eliminating duplicates
  10. Aggregation
  11. Grouping
  12. Self joins
  13. Null
  14. Using SQL from other languages
  15. Concurrency
  16. Transactions
  17. Nested queries
  18. Negation



Requires: recursion

Introduces: hierarchical data

Problem: Importing Another Bunch of Molecules

Your supervisor just acquired more molecule descriptions for you to check. Unfortunately, these aren’t in the same format as the ones you worked with earlier, but are instead stored in XML. Your job is once again to extract information about 700 specific molecules from those files and write it out in a different format for further processing.


  1. Why markup?
  2. Document structure
  3. Formatting rules
  4. Basic HTML
  5. Attributes
  6. Attributes vs. elements
  7. External data: images
  8. Links
  9. DOM
  10. Finding nodes and extracting immediate information
  11. Creating a tree
  12. Walking a tree
  13. Modifying the tree


Object-Oriented Programming

Requires: basic programming

Introduces: classes and objects; abstraction, polymorphism, aggregation, and inheritance

Note: it’s obviously impossible to introduce object-oriented programming in just a couple of hours, but it’s equally impossible to omit this topic—everyone asks for it.

Problem: Simulating a Laboratory

Euphoric State University has published new guidelines on scheduling the use of heavy laboratory equipment. Like many professors, your supervisor is deeply suspicious of change, and wants to prove that the new rules will actually make labs less efficient. Your task is to come up with a way to simulate the interactions between experiments that need several machines.


  1. Abstract data types
  2. Classes and instances
  3. Defining a class
  4. Creating an instance
  5. Defining methods
  6. Constructors
  7. Creating members
  8. Encapsulation
  9. Polymorphism
  10. Inheritance
  11. Overriding methods
  12. The Liskov Substitution Principle
  13. Overloading operators
  14. Discrete event simulation example
  15. The Template Method design pattern
  16. The Strategy design pattern


Represent Information

Requires: databases; object-oriented programming; XML

Introduces: data modeling; entity-relationship diagrams; class diagrams; robustness diagrams

Problem: Store Information About Extra-Solar Systems

You just got a job with the Euphoric State Science Center, and your first assignment is to help create a program that grade school students can use to explore newly-discovered extra-solar planetary systems. Someone else is going to build the graphical interface; your task is to figure out how to store information about planetary systems in a database, and how to represent that information inside a program.


  1. Entity-relationship diagrams
  2. Database design
  3. First normal form
  4. Second normal form
  5. Object-oriented analysis & design using nouns and verbs or CRC cards
  6. UML class diagrams
  7. Aggregation vs. inheritance
  8. Robustness diagrams
  9. Object/relational mapping (just the concept, not an implementation, or is this too much?)
  10. XML/RDF (or is this too much too?)


Build a Desktop User Interface

Requires: object-oriented programming

Introduces: event-driven programming; GUI design

Note: the real focus of this lecture is event-driven programming, rather than GUI design—the latter is primarily a motivation for the former.

Problem: Build a Custom Interface for Configuring a Simulation Program

The invasion percolation program you sped up has become very popular—so popular that people are asking use to add new features. By far the most frequent request is some sort of graphical interface for setting parameters and displaying results. You have therefore decided to build a simple desktop GUI to let people set up and run simulations interactively.


  1. What makes user interfaces different
  2. How event-driven programming works
  3. A single-button dialog
  4. Separating models from views
  5. The observer/observable pattern
  6. Connecting multiple views
  7. Layout
  8. Top 10 graphic design rules


Create a Web Service

Requires: basic programming; XML

Introduces: HTTP; CGI; RSS

Problem: Mix and Share Data on the Web

The recent discovery at Euphoric State University of a reliable way to produce magnetic monopoles has physicists the world over in a frenzy. Many of them are posting data on their blogs almost hourly, and you’re finding it hard to keep up. You would like to build something that will aggregate this information, weed out duplicates, identify contradictions or inconsistencies, and then publish the results for others to use.


  1. Small pieces, loosely joined
  2. Why distributed is different (partial failure, race conditions, deadlock)
  3. The Hypertext Transfer Protocol (HTTP)
  4. HTTP request line
  5. Headers
  6. Body
  7. HTTP responses
  8. Example: building a simple spider
  9. Passing parameters
  10. Special characters
  11. Handling requests
  12. Web services
  13. Maintaining state in files and databases
  14. Locking
  15. Templating
  16. Aggregating RSS feeds
  17. Publishing RSS feeds
  18. From CGI to 3G web frameworks
  19. Security


  • TBD.

Scaling Up

Requires: first-class functions, systems programming, vectors and matrices, performance optimization.

Introduces: parallel programming

Problem: Run a Program Thousands of Times

There’s good news, and there’s bad news… The good news is that your laser jello simulator finally does everything it’s supposed to The bad news is that despite all your optimizations, each run still takes 20 minutes, and you need data from at least 50,000 runs to finish your thesis. That works out to 694.5 days, which is longer than your supervisor is willing to wait. Your challenge now is to find some way to gather the data you need more quickly.


  1. Moore’s Law
  2. The promise of parallelism
  3. Amdahl’s Law
  4. Shared memory vs. distributed memory
  5. Data parallelism (and its limitations)
  6. Task farming
  7. Map/reduce
  8. Cloud computing
  9. Graphical workflow builders (and their limitations)


Organize a Team 1: Empirical Results in Software Engineering

Requires: nothing.

Introduces: empirical software engineering; key results

Problem: What Do We Actually Know About Software Development?

As late as the 1950s, many doctors rejected the results of clinical studies, saying that what happened “on average” was of no help when they were faced with a specific patient. By the time Sackett coined the term “evidence-based medicine” in 1992, however, most accepted that decisions about patient care should be based on conscientious, explicit, and judicious use of best evidence. This idea is still foreign to most professional software developers, but any academic who claims that a particular tool or practice makes software development faster, cheaper, or more reliable is now expected to back up that claim with some sort of empirical study. Understanding the results of the best of these studies will help you understand why certain practices and tools should be adopted, and what you can expect from using them.


  1. What kind of science is computer science?
  2. Why choices should be driven by science (and why they’re not)
  3. Productivity
  4. Complexity
  5. The mythical man-month
  6. Glass’s Law
  7. Runaway projects1: poor estimation
  8. Runaway projects: unstable requirements
  9. Boehm’s Curve
  10. Reuse
  11. Design vs. coding
  12. Conway’s Law
  13. Effect of language choice
  14. Where the time goes
  15. Testing
  16. Code inspections
  17. Bugs


Organize a Team 2: Software Development Lifecycles

Requires: empirical software engineering

Introduces: plan-driven lifecycles; agile lifecycles

Problem: Run a Small Team of Software Developers

Congratulations! You now have a tenure-track position, three graduate students, and enough grant money that you can spend the next year rewriting your simulation and data analysis programs. You have seen several academic software development projects turn sour, so you’d like to give your students a bit of structure—the question is, what kind, and where to start?


  1. What’s a lifecycle?
  2. Boehm’s Curve
  3. Sturdy (plan-driven) vs. agile
  4. Which is better for scientists?
  5. Common practices: elevator pitches
  6. Common practices: version control, bug tracking, continuous integration, post mortems
  7. Overview of agile development
  8. Extreme Programming
  9. Scrum
  10. Scrum roles
  11. Scrum practices
  12. Scrum artifacts
  13. What goes wrong
  14. Overview of sturdy development
  15. Gathering requirements
  16. From requirements to features
  17. Building a schedule
  18. Development
  19. Tracking progress
  20. Managing failure
  21. What goes wrong
  22. How to choose


Discarded Topics

These topics have all been in at least one earlier version of this course, but will probably be left out of the next one:

  • Design patterns
  • Refactoring
  • Packaging code for release
  • Reproducible research
  • Static and dynamic code analysis tools
  • Geographic information systems
  • Visualization
  • Parallel programming
  • Functional languages
  • Using the shell: still extremely useful (especially when doing systems administration or almost anything with Linux), but experience shows that it’s very hard to persuade students who are used to GUIs that it’s worth learning.
  • Computational complexity: this is a fundamental topic in computer science, but the little bit that all scientists should know can be tucked into the lecture on making programs go faster.
  • Handling binary data (with reading data directly from hardware as a motivating example). This is also fundamental computer science, but only a small fraction of course participants will ever need it. One piece of the existing lecture that is more broadly useful is the notion of self-describing data, but that can be taught as part of parsing text files.
  • Data structures: stacks and trees naturally come up when processing XML; trees will also be discussed (but not directly manipulated) in the lecture on associative data structures. That leaves queues and graphs; the latter is small enough that it can probably be worked in somewhere, while the latter requires at least one lecture of its own. As with binary data handling, though, graphs are only relevant to s a small fraction of students.
  • Integrating with C and Fortran: one previous version of the course showed how to do this both by wrapping legacy code, and by running that code in a subprocess. The former requires a solid grounding in either C or Fortran (which we don’t otherwise assume), while the latter is too specialized to be of general interest.
  • Using spreadsheets: most life scientists do most of their data crunching with spreadsheets, as do a large minority of scientists in other disciplines. This topic is omitted primarily because of shortcomings in today’s spreadsheets: Microsoft Excel isn’t freely available (or in the case of Linux, available at all), and support for scripting OpenOffice Calc and Gnumeric is weak.
  • Data visualization: frequently requested, but experience shows that industrial-strength libraries like VTK are far too complex for this level of course.

Written by Greg Wilson

2010/03/23 at 16:57

2 Responses

Subscribe to comments with RSS.

  1. […] Course Outline […]

  2. […] Course Outline […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: