Software Carpentry

Helping scientists make better software since 1997

Archive for October 2009

By Popular Request…

…I have added a lecture on high performance computing to the revised outline for the course. Several people suggested it, and what’s the point of asking for feedback if I don’t listen?

Written by Greg Wilson

2009/10/30 at 13:37

Posted in Content, Lectures, Version 4

Cryptography Isn’t Security

One topic that I’ve tried to include in this course a couple of times, without success, is security. I feel irresponsible not saying something about how to share safely, but I’ve never found something that (a) would fit into one hour, (b) wasn’t platitudes, and (c) gave listeners something they could act on.

One reviewer suggested talking about public/private key pairs (to help people set up SSH), signing things digitally, and the like. I’m leery of going down that road, though, since it could easily leave people with a misplaced faith in technical solutions to security problems. As always, suggestions would be welcome…

Written by Greg Wilson

2009/10/23 at 16:38

Posted in Content, Version 4

Should Modeling Be Part of This Course?

Jon Pipitone has a good description on his blog of work the grad students in our department are doing to translate work in climate change into software engineering terms. Their first step is to represent the ideas in MacKay’s excellent Sustainable Energy Without the Hot Air in two of the graphical notations that computer scientists use for system design. I was initially very skeptical, but looking at their work so far, I’m quite impressed. My question is, would it be useful for scientists to know how to do this themselves? More specifically, is the lecture on data modeling that I’ve planned to include in Version 4.0 worthwhile or not?

Written by Greg Wilson

2009/10/21 at 13:20

Posted in Content, Version 4

Creating New Niches

“Publish or perish” is the central credo of academic life: despite all the hoopla about the blogosphere and online what-not, the reality for most of us is that if our work doesn’t get into a respected journal or conference, it doesn’t count.

But what do you do if there isn’t a home for your kind of work? People working in scientific computing have been struggling with this for at least a quarter century: while there are many places to submit the results of programs, there are very few places where you can publish a description of the program itself, even if building it took years and required one intellectual breakthrough after another. In contrast, if you design a new telescope, there are at least half a dozen places you could turn.

(This isn’t just a problem in scientific computing, by the way: Software: Practice & Experience and The Journal of Systems & Software are the only academic venues I know for descriptions of real systems, which may be one of the reasons why so much of the software written in academia is crap—there’s just no payoff for doing it right.)

I don’t know if this situation is going to change, but one hopeful sign is a new journal called Geoscientific Model Development (which I found via Jon Pipitone). It’s still early days, but I hope that giving people some kind of credit for talking about how they do things will encourage them to do those things better, and allow newcomers (like us) to get up to speed more quickly.

Written by Greg Wilson

2009/10/21 at 13:15

Posted in Community, Noticed

Revised Plan

I’ve posted updates to the revised course outline. In particular, I have:

  1. Moved testing earlier.
  2. Clarified intent in a couple of places.
  3. Made an list of things we’re leaving out.

As always, feedback would be welcome. I’d also be grateful for pointers to places that might fund this work: as I’ve found in the past, many people think the course is a good idea, but it doesn’t quite fit into their funding mandates :-(

Written by Greg Wilson

2009/10/16 at 18:19

Posted in Lectures, Version 4

Videos from Symposium Are Now Online

I have put video recordings of the guest talks given at our July 29 symposium on Science 2.0 — please click the titles of the talks on the symposium page to view them, or go to the (very simple) index page on my departmental site. (I’ll figure out a better hosting solution after Canadian Thanksgiving is out of the way, but didn’t want to delay posting the talks any longer.) Thanks again to all of our speakers, to the MaRS Centre technicians for the raw recordings, and to Tanya Murdoch for editing.

Written by Greg Wilson

2009/10/08 at 18:57

Posted in Content, Version 3

Comments on Course Reorganization

I’m grateful to Lorin Hochstein for sending detailed feedback on my proposal to reorganize the course. His comments are below, with my replies and his counter-replies interspersed; more comments would be very welcome.

Content I think you could drop if you wanted to save time:

Read Data Directly From Hardware. I suspect that this would be relevant to only a small minority of your audience. Especially if you’re teaching the course mostly in Python, because this is the sort of thing you should really do in C.

Greg: Agreed; it’s mostly to motivate a discussion of binary data handling, which I guess isn’t that important to most people either.

Vectorization: I think you could drop this, especially since you have the general Make a Program Go Faster section. (Then again, I don’t know that much about vectorization…).

Greg: Would a title change make it clearer? This is where I wanted to introduce whole-array manipulations (MATLAB-style operations), which I think many scientists do care about.

Lorin: Ah, I didn’t realize this was about MATLAB vectorization (I thought it was related to using an optimizing compiler to take advantage of SIMD instructions). You’re right, this is worth teaching. Back when I was a grad student, I was amazed at the orders of magnitude performance improvement you can get in MATLAB by getting rid of loops and recasting your problems as linear algebra operations. There was a grad student I knew at Boston University who was amazing at turning loops into matrix multiplications.

Content-specific comments:

Clean Up This Code. Great idea for a topic. I’m not sure “cyclomatic complexity” is really that important. I vaguely recall a paper that demonstrated that all complexity metrics correlated very closely with function size, so that “size” is really the most important complexity metric there is.

Greg: The paper is El Emam et al’s “The Confounding Effect of Class Size on the Validity of Object-Oriented Metrics”, and yeah, that’s something I want to add to the lecture.

Test Some Software. I was surprised to see this so late in the curriculum. One of the hardest things I’ve found about unit testing is writing code so that it’s testable. I would have put it up earlier and used unit tests throughout the problems, which would also illustrate how to use unit tests in the different contexts (e.g., unit testing with image analysis). It would also be nice to see some SE testing concepts like category partition testing, code coverage, and fuzz testing.

Greg: I’ve tried that, but given most people’s instinctive aversion to testing, I found that I had to move it later so that I’d built up enough credibility that they’d listen to me :-) You’re right, though, I should move it earlier.

Lorin: I think that if you could do nothing else but reduce people’s aversion to testing, the course would still be worth it. ;) An astounding development (to me, anyways), is how “cool” testing has become in the (agile) software engineering community, unit testing in particular. There are all sorts of testing tools and frameworks everywhere, and many TDD advocates. I don’t have a clue how to transfer this interest to the scientific community, though.

Share Work With Colleagues. In the version control lecture, you note that “this lecture will use a GUI like SmartSVN so that students don’t need to know how to use a shell in order to use version control.” But, don’t the students really need to learn how to use the shell to use many of their tools effectively? You have “Using the Unix Shell” as a topic in the course announcement, but I don’t see it show up as its own topic.

Greg: I’m planning to take the shell out—while I use it all the time, and think most power users do likewise, it didn’t make the cut when the number of lectures was restricted. (And it’s hard to convince someone who’s used to GUIs that the shell is worth learning: the payoff takes a long time to arrive…) If I cut binary data handling and/or vectorization, this is a strong candidate to go back in.

Lorin: That makes sense… It does take a long time before you’re more productive in the shell than the GUI. It’s a shame, though.

XML. You could probably drop XHTML safely. I don’t think it’s that popular in practice, and since most HTML out there is not valid XML, if they tried to use XML-based approaches to do HTML scraping, it would fail pretty quickly. (You really need something like Beautiful Soup to do HTML parsing, but I wouldn’t use that to teach XML!).

Greg: Agreed.

Some of the topics I would call “paradigms”, these are going to be hard to fit into a single lecture, such as:

Object-Oriented Programming. I’m torn about this. It’s hard for me to imagine teaching the OOP concepts in a single lecture. I think the Liskov Substitution Principle could probably be dropped (how often does it really come up in practice?) I’m also a little fearful because inheritance tends to be overused in practice. I’d also drop the design patterns (I don’t think they’ll understand OO well enough to observe that at this point), and possibly even the overloading operators.

Greg: I agree that it’s impossible, but everyone asks for it every time the course is taught.

Represent Information. This is a lot of concepts to squeeze into a lecture. If you were to prioritize this, I think database design (and ERD) are more importance in practice than some of the UML stuff. RDF can be safely dropped.

Greg: Good points.

Build a Desktop User Interface. Event-driven programming is a big conceptual leap. I’d probably put state diagrams or statecharts in here. Plus, it’s always very tough to pick a GUI toolkit.

Greg: I was going to use Tkinter—yes, it’s broken, but if the main goal is to teach event-driven programming, it’ll get the idea across without students having to install anything else.

Lorin: Yeah, that sounds reasonable. Tkinter is nice and simple, and it’s a great example of the application of first-class functions. It’s too bad Python doesn’t come with a drag-and-drop GUI builder. When you’re starting out with GUI building, it’s hard to see the advantage of programmatically defining a GUI layout.

Other comments:

Maybe have some content about online resources: where to go to ask a question when you try to apply these and get stuck. StackOverflow, IRC channels, “How to ask questions the smart way”, pastebin.com/pastie.com, showmedo.com, etc. (This really wouldn’t be a full lecture, maybe just a web page on this?)

Personally, I’m bored to tears sitting in a lecture when there’s source code in the slides. I think your ultimate idea of having a self-paced web-based course is a good one. There’s lots of reference material out there on these concepts, but finding worked out examples is rarer. I think the biggest challenge for someone trying these things will be when their personal problem diverges for the example problem in some way and they don’t know how to proceed.

Final question: Have you followed up on previous SC students to see what techniques/practices they adopt after attending the course?

Greg: I did once, but can’t use the data (long story); I’ll be following up with the students from this past July at Christmas to see what’s stuck and what hasn’t. Wish I’d been more systematic in the past, but 20/20 hindsight…

Written by Greg Wilson

2009/10/06 at 15:35

Posted in Content, Lectures, Version 4

Follow

Get every new post delivered to your Inbox.