Archive for December 2005
Research proposals funded under the SciDAC program will help create a comprehensive, scientific computing software infrastructure that integrates applied mathematics, computer science and computational science in the physical, biological and environmental sciences for scientific discovery on petascale computers.
My bet is that, once again, most projects will depend on heroic effort, rather than good development techniques, to reach their goals. I’m also willing to bet that anyone who wants to use most of the software these projects create will have to put in heroic effort of their own to get it built and deployed. I (obviously) believe that a little bit of training would go a long way, but I’m not optimistic that the people who need it most will listen: as is so often the case, those who know they need it are already halfway home, while those who need it most don’t even know what they’re missing.
I’m teaching a cut-down version of Software Carpentry at the IASSE in two and a half weeks. I’ll have students half days for the weeks of January 16 and 23, and full days for the week of February 6. That’s only 20 lectures (rather than 26), so the question is, what to cut? The answer has wider implications, since this will be the version of the course I take to the AAAS workshop.
My plan is:
|Jan||16||Introduction||Revised to be a forward summary of the whole course.|
|20||Make||Revised so that it doesn’t depend on Python.|
|Jan||23||Python 1||Basic features.|
|24||Python 2||Strings and lists.|
|25||Python 3||Functions and Libraries.|
|27||Mini-Project 1||Build something useful with Python.|
|Feb||06||Python 4||Dictionaries and exceptions.|
|Debugging||Deepened to include material from Zeller.|
|07||Python 5||Object-oriented programming.|
|Unit Testing||Use the unit test framework to show what good OO design looks like.|
|08||Coding Style||Update to include an actual Python style guide.|
|Reflection||Complete rewrite: exec, eval, sub-processes, etc.|
|XML and DOM|
|10||Development Process||Describe how a good shop actually works (with nods to XP and RUP).|
|Teamware||Based on Trac.|
Client-side and CGI web programming, security, and databases have disappeared completely; the three lectures on process have been folded into one; and there’s no end-of-course summary. I’m comfortable with those changes; what I don’t like is the amount of time spent teaching Python-the-language. I’d rather spend those hours showing them how to use Python to automate development activities, but you can’t cut trees ’til you have an ax.
Second, there’s no place in this new scheme for a lecture based on Paul Dubois’s CiSE article on maintaining correctness. There really ought to be: it shows the jigsaw puzzle of which many good practices are pieces.
Third, I’d like a second project lecture, showing students part of the build system for the course notes. This would let them see regular expressions and DOM in action, and would tie together many of the earlier ideas on automation. It’s this or teamware, though, and I think the latter is more important. Having made that decision, I’m wavering on whether to pull out the material on regular expressions and DOM.
Finally, everything I have to say about the development process is now squeezed into a single hour. It makes sense in this case, since IASSE students will get several more courses on the subject, but it’s definitely under weight for the AAAS workshop.
So: in order to pull this off, I’m going to have to revise one lecture per day from January 2 onward (including diagrams). I’ll post the new materials here until they’re polished, at which point I’ll swap them into the standard location. I’ll blog each time a lecture goes up: timely feedback would be greatly appreciated.
I finished rewriting the build system for the Software Carpentry course notes yesterday. Doing so was an extended form of procrastination: the system I built over the summer and used through the fall was adequate, but I wanted to clean a few things up, and then, well, I might as well make it easier for other instructors to add site-specific content, and make tables inclusions instead of inlining them, and mumble mumble mumble type type type…
Of course, none of this has actually advanced the content of the course one whit. I have over seventy tickets to close, ranging in size from making sure that a particular Make example does what I claim to rewriting the lecture on security. And diagrams: no one was happy with the isometric ones created this term (not least because they’re kind of fuzzy), so I have over a hundred diagrams to re-do. In a perfect world, they’d be ready before I teach at the IASSE in mid-January. In this universe, I’ll be happy if they’re in place for the Essential Software Skills for Research Scientists workshop at the AAAS Annual Meeting on February 17.
We all do this. We all fold laundry instead of paying bills, or invent an antigravity drive when we’re supposed to be studying for an Economics final. (OK, maybe that was just me.) But it seems particularly common among software developers, many of whom would rather spend two hours creating a new (not better, just new) serialization class hierarchy than take five minutes to center-align the titles at the top of the product’s help page. One of the characters in Mark Costello’s Big If (reviewed here) is a prime example: his company desperately needs him to add some new monsters to a video game, so he spends a week adding shadows to clouds.
But back to the build system… What I have is a set of XML files marked up with a homegrown tag set, and what I want is some HTML pages. The files are organized into several directories: the main page is in the root, while all of the lectures are in
lec/, and site-specific content is in sub-directories underneath
sites/. Each directory that contains source XML files may also contain
tbl/ sub-directories; in turn, each of those has one sub-directory for each of the source files, which holds images, sample code inclusions, and tables.
The build system consists of the following tools:
- A 500-line Makefile in the root directory that drives everything else. Roughly half of those lines are comments (which can be extracted and formatted as a wiki page to create on-line documentation). This Makefile includes another file called
config.mk, in which users must specify the lectures they want to include in the course.
- A Python script called
linkages.pythat scans the source files and builds a data structure that records such things as the order of lectures, where glossary terms are defined, the two-part numerical IDs of figures and tables, and so on.
linkages.pywrites this data structure directly to a file called
tmp/linkages.tmp.py, which other tools then import. Persisting the data structure directly saved me from having to mess around with parsers or serializers. The clever bit (ahem) is that I only write it out if (a) the file doesn't already exist, or (b) the contents have changed. That way, if I change a source file in a way that doesn't affect cross-linkages, Make doesn't do a lot of unnecessary rebuilding.
- Once the linkages file is up to date,
preprocess.pykicks in. This script creates copies of the source files under the
tmp/directory (preserving the directory structure), and adds information to those copies to make XSLT's job easier. Among other things, it:
- adds a unique file ID, and the path to the root of the build, to the lecture's root element;
- copies content from table files into the lectures;
- adds citation information to bibliography references;
- does multi-column layout of length tables;
- inserts figure and table counter values (the "4.2" in "Figure 4.2");
- fills in cross-references between source files;
- replaces the
<lecturelist/>element with a point-form list of links to lectures;
- fills in the
<tbllist>tags with lists of figures and tables respectively;
- links terms in the glossary back to their first uses;
- inserts included program source files;
- links to external references;
- adds "previous" and "next" linkage information to lectures;
- generates a syllabus; and
- adds tracing information, such as file version numbers and the time the files were processed.
Each stage ought to be a filter of its own, and in fact I wrote them all that way to begin with. However, launching fifteen or more copies of the Python interpreter for each source file made the build rather slow; doing the piping internally reduced the time per source file from eight or nine seconds to less than a second.
util/individual.xslis an XSL script that translates the filled-in XML lecture file into HTML. This script handles the outer skeleton directly, handing specific tasks like the bibliography and special lists to other XSL files that it includes.
- A Python script called
util/unify.pyand an XSL script called
util/unified.xslwork together to create a single-page version of the whole course.
unify.pystitches the filled-in lecture files together;
unified.xslthen applies the same transformations as
individual.xsl, but formats hyperlinks differently (since they're all in-file).
- I use another Python script called
validate.pyto check the internal consistency of the source files. Do any of them contain tabs or unprintable characters? Do all the required images, source files, and tables exist? I run this before checking in changes; it catches something about one time in five.
- And then there are the minor tools:
util/fixentities.pyreplaces character entities with character codes (to work around a problem with Expat);
util/wiki.pyextracts specially-formatted comments from Makefiles and XSL files, and docstrings from Python, to create wiki documentation pages; and
util/revdtd.pyreverse engineers the actual DTD of either the source files, their filled-in counterparts, or the generated HTML files.
It's a lot of code; it was a lot of work; I'm pleased with how smoothly it all runs; and most of the time I spent building it should probably have gone into upgrading the actual content of the course. But small(ish) tasks are seductive: you can start work at 8:30, confident that you'll have something to show (even if only to yourself) by noon. Editing course notes, well, the payoff is usually a long way away, and may not come at all: people who read through the first, flawed, version of the notes probably aren't going to come back and tell you how much better the second version is.
That last observation is the key ingredient of my cure for procrastination: find some partners. I am always more productive when I'm working with people than I am on my own. Not only does a small team wander down fewer blind alleys than someone working alone, team members can keep each other honest, and give each other feedback and encouragement. They can also appreciate just how big an accomplishment it is to have replaced all the a's and b's in twenty-eight short examples of list manipulation with the names of minerals, beetles, and mathematicians.
It's now ten to eleven, and I've managed to fend off productivity for almost an hour. Should I look on eBay for a WACOM Cintiq 17SX that I can afford? It'd make drawing diagrams much more fun. Or maybe I should try Nose: Miles Thibault says it's much friendlier than the unit testing framework in the Python standard library. Hm... A cup of tea will probably help me decide. A cup of tea, and a slice of toast with strawberry jam...
I’m re-thinking the lectures in the Software Carpentry course based on feedback from this term’s students. I’m going to merge the three lectures on different development processes into one, and use the space that frees up to talk in more detail about programming style and software design—assuming, of course, I can think of something to say that isn’t banal.
I also want to talk about the material in an article by Paul Dubois in the May/June 2005 issue of Computing in Science & Engineering called “Maintaining Correctness in Scientific Programs”. Here are a few key lines from the introduction:
The more frequently a program is changed, the more difficult it is to maintain its correctness… Most programmers can reasonably tell when their programs are incorrect, but for scientific programmers, this is not the case. A bug that doesn’t cause the program to fail in an obvious way could be indistinguishable from an error in modeling the real world with equations… Solving this problem must be the focus of our methodology, be it for a single person writing a 10,000-line program [or] a team of 20 or more writing half a million lines.
Paul then outlines a strategy based on defense in depth which has the following layers:
- a protocol for source control;
- use of language-specific safety tools;
- design by contract;
- reusing reliable components;
- automating testing;
- unit testing (which requires automation to be effective);
- to-main testing policy (i.e., code must be tested before being integrated from a branch into the main line);
- regression testing;
- release management; and
- bug tracking.
This immediately struck me as an excellent way to organize and motivate several important parts of the course. It also points out some holes that I’ll need to fill. Oh, to have more hours, and more hands…
I’ll be teaching a shortened version of the Software Carpentry course at the Institute for Advanced Studies in Software Engineering in Toronto early next year. For more information, or if you are interested in taking part, please see the IASSE‘s web site. The dates are:
- January 16-20: one lecture per day, plus practical, with a quiz on Friday.
- January 23-27: ditto.
- February 6-10: two lectures and practical per day, with a short final exam on Friday afternoon.
I’ve got a lot to do between now and then…