Archive for August 2009
Cameron Neylon has been playing with Google Wave, and he likes it. His presentation at Science Online in London in August explains why (you can also watch video, though sadly there’s no soundtrack). He’s even writing robots to automate some scientifically interesting tasks. Nature News liked Wave too, which (a) reminds me yet again of how prescient Jon Udell’s “Internet Groupware for Scientific Collaboration” was ten years ago, and (b) makes me wonder (also again) how much this course should be re-thought.
The Unix shell philosophy of creating lots of simple single-purpose tools and then combining them in rich ways has clearly found its second wind on the web. Just look at the options:
- Ad hoc services using something like Django or Rails
- Drag-and-drop GUIs like Yahoo! Pipes
- Special-purpose frameworks like Galaxy
- Workflow tools like Taverna
- Next-generation scripting with something like PowerShell
and on and on and on. Each has its own opinion on what the problem to be solved actually is; each requires different skills, and with the exception of Taverna and Galaxy, they regard scientific computing as one niche interest among many.
The problem, of course, that with so many different ways to do it, no matter which one(s) the course covers, students will probably be faced with something else when they go back to the lab.
Software Carpentry currently teaches students how to manipulate text (using regular expressions), XML (using DOM), relational data (with SQL), and binary data. A decade ago, when we first put the course together, that covered everything I’d ever seen more than one or two scientists use. Today, though, an increasing number are using geospatial (map) data as well. How important is this to your work? If the answer is “very”, what data do you work with, what do you do with it, and what would you like to be able to do?
The bad news is, retractions of scientific papers have risen tenfold since 1990. The good news is, the rate has gone from 0.0007% to 0.007%. Going back to bad news, though, some estimates of how many papers ought to be retracted are around 1%, so we still have a long way to go.
This thoughtful article from the New York Times asks whether the current grant system for funding research discourages researchers from taking risks. My personal experience undoubtedly biases me, but I tend to agree—the problem is coming up with something better.
Nature‘s “Great Beyond” blog reports another attempt to force climate scientists to release their data. I’m of at least two minds on this: I believe openness is absolutely crucial to science (and society as a whole), but I understand scientists’ concern about being scooped, and equally their concern about having their work misrepresented or quoted out of context. Mostly, I come down on the side of openness—how about you?
They say the devil is in the details, but so’s the delight, because it’s the details that determine whether something works or doesn’t. So let’s take a look at how to translate the last post’s “big picture” into actual course content.
Every competent developer uses some kind of tool to automate tasks that involve dependencies. The best known is still Make, which compares the last-modified-times on files to see which ones are stale, and runs shell commands to bring them up to date. Ant, Rake, and whatever’s built into your IDE all work basically the same way, and can all be used (with greater or less ease) to recompile software, re-run tests, prepare distributions, update web sites, and what have you.
Dependency managers are an example of the kind of tool scientists are willing to spend an hour learning (more if they’re working with C++ or Java, less if they’re working with a scripting language). Understanding how they work, though, requires at least some familiarity with:
- automation (obviously)
- declarative programming (the user declares the “what”, the computer figures out the “how”)
- graphs (which is how these tools figure out what order to do things in)
- queries (since rules are often best expressed using pattern matching)
- programs as data (since dependency managers are programs that run other programs)
So, can we use Make to teach these concepts? Or teach these concepts using Make as an example? I thought so back in 2003 when I put together the first version of “CSC207: Software Design” for the University of Toronto. In their first two programming exercises, students worked with graphs and wrote simple text parsers using regular expressions. They then had to put the two together to create a very (very) simple version of Make.
I thought it worked well, but over the years the exercises were cut back until eventually this one disappeared entirely. There was just too much material in the course, and the various bits weren’t connected strongly enough. While it might work in theory, it didn’t in practice, and would probably fare even less well if crammed into two days of an intensive two-week course. It’s still a good example of how I’d like to tie the practical and conceptual parts of the course together, though; the trick is finding a way to make it work.
One of the lessons we learned at Los Alamos National Laboratory in the 1990s and early 2000s is that most scientists don’t actually want to learn how to program—they want solve scientific problems.To many,programming is a tax they have to pay in order to do their research.To the rest,it’s something they really would find interesting,but they have a grant deadline coming up and a paper to finish.
Getting scientists to make time to learn fundamental ideas that aren’t directly relevant to the problems in front of them is an even harder sell. Partly it’s those pesky deadlines again, but it’s also often the case that the big picture doesn’t make sense until you have first-hand experience with the details. Take abstraction, for instance, or the difference between interface and implementation: if you haven’t written or modified software where those ideas saved you time and heartache, no amount of handwaving is going to get the idea across. The problem, of course, is that it’s impossible to program well without understanding those bigger concepts.
Software Carpentry therefore has to:
- Give scientists programming skills that have a high likelihood of paying large dividends in the short term.
- Convey the fundamental ideas needed to make sensible decisions about software without explicitly appearing to do so.
Based on our experiences in the last 12 years, the skills that students need are fairly settled:
- Clean coding (both micro-level readability and macro-level modularity)
- Version control
- Process automation for building, testing, and deploying software
- How to package software for distribution and deployment
- Managing information and workflow (from bug trackers to blogs)
- Consuming data:
- Text (line-oriented parsing with regular expressions)
- Hierarchical (XML)
- Building desktop GUIs and visualizing data
- Basic security: public/private keys, digital signatures, identity management
- Publishing data and providing services on the web
As Karen Reid and others have pointed out, doing all of that properly would earn you at least a minor in Computer Science at most universities. Cramming it into two weeks is simply not possible.
The bigger picture stuff isn’t as clear yet, but is starting to come into focus. The buzzword du jour, computational thinking, means different things to different people, but Jon Udell‘s definition is a good starting point. For him, computational thinking includes:
- Abstraction: ignoring details in order to take advantage of similarities
- A key concept is the difference between interface and implementation
- Querying: understanding how fuzzy matching, Boolean operations, and aggregate/filter dataflow works
- This depends somewhat on understanding how to think in sets
- Structured data: including hierarchical structure, the notion of meta-data (such as tagging and schemas), and so on
- Equally important is understanding that programs work best with structured data, so structure improves findability and automation
- Automation: having the computer do routine tasks so that people don’t have to
- Indirection: giving someone a reference to data, rather than a copy of the data, so their view of it is always fresh
- Syndication: publishing data for general use, rather than sending it directly to a restricted set of people
- The inverse is provenance: where did this data come from, and what was done to it?
I would like to add all of the following, though I realize that doing so gets us back into “B.Sc. in a week” problems:
- Name spaces, call stacks, and recursion
- Computational complexity: why some algorithms are intrinsically faster than others
- How data is organized:
- Values vs. references and the notion of aliasing
- By-location structures (lists, vectors, and arrays)
- By-name structures (dictionaries and records)
- By-containment structures (trees)
- By-traversal structures (graphs)
- Programming models:
- Aggregate (whole-array, whole-list, etc.)
- Event-driven (which brings in the difference between frameworks and libraries
- Programs as data
- Functions as objects (another form of abstraction)
- Programs that operate on programs (Make, drivers for legacy programs)
- Quality, including:
- What makes good code better than bad code (psychological underpinnings)
- Testing (including the economics of testing)
- Debugging (the scientific method applied to software)
- The difference between verification (“have we done the thing right?”) and validation (“have we done the right thing?”)
- Continuous improvement via reflection on root causes of errors
- Basic concurrency:
- Transactions vs. race conditions
- Deadlock (much less important in practice)
- Handling failures
- Bricolage: how to find/adapt/combine odds and ends (these days, on the web) to solve a problem
I call on all of this knowledge routinely even when solving trivial problems. This morning, for example, I:
- did a search to find a wiki markup processor I could run from the command line,
- downloaded and installed it,
- changed five lines in the main routine to insert some extra text in its output,
- added a ten-line filter function to overwrite the inserted text with some command-line parameter values, and
- added fourteen lines to a Makefile to turn the wiki text into HTML whenever it’s stale.
It took roughly 15 minutes, and will save me hours in the weeks to come. However, it only took 15 minutes because I’ve spent 29 years mastering the skills and ideas listed earlier. The challenge in creating Version 4.0 of this course will be to figure out how to convey as many of those skills and ideas can be squeezed into two weeks.
Carl Zimmer, one of my favorite science writers, recently posted about three new books aimed at scientists: Unscientific America, Am I Making Myself Clear?, and Don’t Be Such a Scientist. All three are aimed squarely at the biggest problem modern science faces—the inability of most scientists to explain themselves to non-specialists—and all three are now on my read-soon list. I don’t think communication skills will ever be part of this course, but given the problems our planet faces, they damn well need to be part of every scientist’s education.
From Cameron Neylon, a short video showing how to embed molecules from the ChemSpider service into a wiki page. I was surprised and impressed to discover during his visit to Toronto just how little programming Cameron does: mostly, he leverages his understanding of how information moves around the Internet to plumb existing tools and services together. This is part of (or dependent on) what Jeannette Wing calls “computational thinking“, and one of the goals for the next revision of this course is to focus more on those kinds of skills.