Software Carpentry

Helping scientists make better software since 1997

Comments on Course Reorganization

I’m grateful to Lorin Hochstein for sending detailed feedback on my proposal to reorganize the course. His comments are below, with my replies and his counter-replies interspersed; more comments would be very welcome.

Content I think you could drop if you wanted to save time:

Read Data Directly From Hardware. I suspect that this would be relevant to only a small minority of your audience. Especially if you’re teaching the course mostly in Python, because this is the sort of thing you should really do in C.

Greg: Agreed; it’s mostly to motivate a discussion of binary data handling, which I guess isn’t that important to most people either.

Vectorization: I think you could drop this, especially since you have the general Make a Program Go Faster section. (Then again, I don’t know that much about vectorization…).

Greg: Would a title change make it clearer? This is where I wanted to introduce whole-array manipulations (MATLAB-style operations), which I think many scientists do care about.

Lorin: Ah, I didn’t realize this was about MATLAB vectorization (I thought it was related to using an optimizing compiler to take advantage of SIMD instructions). You’re right, this is worth teaching. Back when I was a grad student, I was amazed at the orders of magnitude performance improvement you can get in MATLAB by getting rid of loops and recasting your problems as linear algebra operations. There was a grad student I knew at Boston University who was amazing at turning loops into matrix multiplications.

Content-specific comments:

Clean Up This Code. Great idea for a topic. I’m not sure “cyclomatic complexity” is really that important. I vaguely recall a paper that demonstrated that all complexity metrics correlated very closely with function size, so that “size” is really the most important complexity metric there is.

Greg: The paper is El Emam et al’s “The Confounding Effect of Class Size on the Validity of Object-Oriented Metrics”, and yeah, that’s something I want to add to the lecture.

Test Some Software. I was surprised to see this so late in the curriculum. One of the hardest things I’ve found about unit testing is writing code so that it’s testable. I would have put it up earlier and used unit tests throughout the problems, which would also illustrate how to use unit tests in the different contexts (e.g., unit testing with image analysis). It would also be nice to see some SE testing concepts like category partition testing, code coverage, and fuzz testing.

Greg: I’ve tried that, but given most people’s instinctive aversion to testing, I found that I had to move it later so that I’d built up enough credibility that they’d listen to me 🙂 You’re right, though, I should move it earlier.

Lorin: I think that if you could do nothing else but reduce people’s aversion to testing, the course would still be worth it. 😉 An astounding development (to me, anyways), is how “cool” testing has become in the (agile) software engineering community, unit testing in particular. There are all sorts of testing tools and frameworks everywhere, and many TDD advocates. I don’t have a clue how to transfer this interest to the scientific community, though.

Share Work With Colleagues. In the version control lecture, you note that “this lecture will use a GUI like SmartSVN so that students don’t need to know how to use a shell in order to use version control.” But, don’t the students really need to learn how to use the shell to use many of their tools effectively? You have “Using the Unix Shell” as a topic in the course announcement, but I don’t see it show up as its own topic.

Greg: I’m planning to take the shell out—while I use it all the time, and think most power users do likewise, it didn’t make the cut when the number of lectures was restricted. (And it’s hard to convince someone who’s used to GUIs that the shell is worth learning: the payoff takes a long time to arrive…) If I cut binary data handling and/or vectorization, this is a strong candidate to go back in.

Lorin: That makes sense… It does take a long time before you’re more productive in the shell than the GUI. It’s a shame, though.

XML. You could probably drop XHTML safely. I don’t think it’s that popular in practice, and since most HTML out there is not valid XML, if they tried to use XML-based approaches to do HTML scraping, it would fail pretty quickly. (You really need something like Beautiful Soup to do HTML parsing, but I wouldn’t use that to teach XML!).

Greg: Agreed.

Some of the topics I would call “paradigms”, these are going to be hard to fit into a single lecture, such as:

Object-Oriented Programming. I’m torn about this. It’s hard for me to imagine teaching the OOP concepts in a single lecture. I think the Liskov Substitution Principle could probably be dropped (how often does it really come up in practice?) I’m also a little fearful because inheritance tends to be overused in practice. I’d also drop the design patterns (I don’t think they’ll understand OO well enough to observe that at this point), and possibly even the overloading operators.

Greg: I agree that it’s impossible, but everyone asks for it every time the course is taught.

Represent Information. This is a lot of concepts to squeeze into a lecture. If you were to prioritize this, I think database design (and ERD) are more importance in practice than some of the UML stuff. RDF can be safely dropped.

Greg: Good points.

Build a Desktop User Interface. Event-driven programming is a big conceptual leap. I’d probably put state diagrams or statecharts in here. Plus, it’s always very tough to pick a GUI toolkit.

Greg: I was going to use Tkinter—yes, it’s broken, but if the main goal is to teach event-driven programming, it’ll get the idea across without students having to install anything else.

Lorin: Yeah, that sounds reasonable. Tkinter is nice and simple, and it’s a great example of the application of first-class functions. It’s too bad Python doesn’t come with a drag-and-drop GUI builder. When you’re starting out with GUI building, it’s hard to see the advantage of programmatically defining a GUI layout.

Other comments:

Maybe have some content about online resources: where to go to ask a question when you try to apply these and get stuck. StackOverflow, IRC channels, “How to ask questions the smart way”,,, etc. (This really wouldn’t be a full lecture, maybe just a web page on this?)

Personally, I’m bored to tears sitting in a lecture when there’s source code in the slides. I think your ultimate idea of having a self-paced web-based course is a good one. There’s lots of reference material out there on these concepts, but finding worked out examples is rarer. I think the biggest challenge for someone trying these things will be when their personal problem diverges for the example problem in some way and they don’t know how to proceed.

Final question: Have you followed up on previous SC students to see what techniques/practices they adopt after attending the course?

Greg: I did once, but can’t use the data (long story); I’ll be following up with the students from this past July at Christmas to see what’s stuck and what hasn’t. Wish I’d been more systematic in the past, but 20/20 hindsight…


Written by Greg Wilson

2009/10/06 at 15:35

Posted in Content, Lectures, Version 4

2 Responses

Subscribe to comments with RSS.

  1. The overall redesign looks good to me. I’m about half way thought the old version right now and the focus on questions in the new version should definitely help draw biologists (for whom I can speak) into the material more readily.

    The one thing that I would suggest reconsidering is getting rid of the shell:

    I’m planning to take the shell out—while I use it all the time, and think most power users do likewise, it didn’t make the cut when the number of lectures was restricted. (And it’s hard to convince someone who’s used to GUIs that the shell is worth learning: the payoff takes a long time to arrive…) If I cut binary data handling and/or vectorization, this is a strong candidate to go back in.

    The main reason for this (besides the fact that the shell section of the current course is really nicely done) is that once folks learn all of this stuff they are going to want to start using it on lab servers and high-performance resources. It is increasingly common for academic research groups that do even a small amount of computation to have one or a few servers in the lab for doing the work. In addition, more and more universities have some sort of high-performance cluster. By providing a brief introduction to the shell it removes a barrier to running code on lab servers and larger university clusters where almost all of the interface will be using the shell over ssh. I have seen with folks in my group that there is something intimidating about logging in remotely and using the shell – even when the student/postdoc is otherwise computationally savy.

    If I had to pick something to drop in order to keep the shell portion of the course it would be the binary data handling (but that’s just personal preference).


    2009/10/07 at 20:41

  2. It looks like the blog stripped the ‘quote’ tags. The 3rd paragraph is just a quote from the above discussion.


    2009/10/07 at 20:42

Comments are closed.

%d bloggers like this: