Software Carpentry

Helping scientists make better software since 1997

Next Lecture?

The Software Carpentry course currently contains the following lectures:

  1. Introduction
  2. The Unix Shell (2 lectures)
  3. Version Control
  4. Automated Builds
  5. Basic Scripting (bool/int/float, for/while/if)
  6. Strings, Lists, and Files
  7. Functions and Libraries
  8. Programming Style
  9. Quality Assurance (basic testing)
  10. Sets, Dictionaries, and Complexity
  11. Debugging
  12. Object-Oriented Programming (2 lectures)
  13. Unit Testing (unittest — should switch this to nose)
  14. Regular Expressions
  15. Binary Data
  16. XML
  17. Relational Databases
  18. Spreadsheets
  19. Numerical Programming (the basics of NumPy)
  20. Integration (subprocess+pipes and wrapping C functions)
  21. Web Client Programming (HTTP request/response, URL encoding)
  22. Web Server Programming (basic CGI processing)
  23. Security (the weakest lecture of the bunch)
  24. The Development Process (a mish-mash of sturdy and agile)
  25. Teamware (introduces portals like DrProject)
  26. Conclusion (various “where to look next” suggestions)

Between now and Christmas, I want to tidy them up, duplicate the examples in MATLAB, and add some of the content I wrote for “CSC301: Introduction to Software Engineering”. Since I won’t have time to do everything, I’d like your help prioritizing. Which of the following topics do you think is most important to add? And what have I forgotten entirely?

  • Lifecycle: should I split the existing “Development Process” lecture into two, and cover agile methods (focusing on Scrum) and sturdy methods (i.e., longer release cycles, more up-front planning, legacy code). Neither exactly fits scientists’ “exploratory programming” paradigm, but they’re all we’ve got…
  • Quality: this would expand the “Programming Style” lecture with material from Spinellis’s Code Reading and Code Quality to describe what makes good software good.
  • Deployment
    • Currently based on the patterns in Nygard’s Release It!, which focus on designing scalable fault-tolerant applications.
    • Should I instead cover the creation and distribution of packages (e.g., RPMs, Distutils, Ruby Gems, etc.)?
  • Refactoring: a combination of Fowler’s original Refactoring and Feathers’ Working Effectively with Legacy Code.
  • UML: I devote three lectures to this in CSC301; I don’t see any reason to inflict it on scientists.
  • Reproducible Research: it’s already important, and likely to become more so; it also ties in with “open science”, though I’m not sure what I could say about either that wouldn’t just be rah-rah and handwaving—tools like Sweave are interesting, but I don’t people would be willing to learn R just to use it, and there don’t seem to be equivalents (yet) in other languages. The same goes for data lineage: it’s an important idea, and there are plenty of research prototypes, but nothing has reached the “used by default” level of (for example) Subversion.
  • GUI Construction: people still use desktop GUIs, and it’s worth learning how to build them (if only because it forces you to come to grips with MVC and event-driven programming), but what everyone really wants these days is a rich browser-based interface, and I don’t think it’d be possible to fit that into this course.
  • High Performance Blah Blah Blah: this one keeps coming up, but (a) one of the motivations for Software Carpentry is the belief that there’s too much emphasis on this in scientific computing anyway, and (b) what would it include?  GPU programming?  MPI?  Grid computing? Some other flavor-of-the-week distraction from the hard grind of creating trustable code and reproducible results without heroic effort?  Oh, wait, are my biases showing?

Written by Greg Wilson

2008/07/28 at 13:38

Posted in Content, Lectures, Version 3

15 Responses

Subscribe to comments with RSS.

  1. Greg,

    I think the one thing that is missing, and is missing from a lot of software training, is some guidance on software design. I mean design in the sense of how to take a sicence or engineering problem and to express it in a form that is ameanable to being turned into a program. In the old days we used to try teaching flow diagrams and problem decomposition into a form that was ameanable to the labguages of the time. I am not sure how you would go about the problem now – the problem is that people start programing befor they have even worked out the what the problem is that they are trying to implement.

    The exploritory programming paridigm is a good one but it still needs some framework to work within. As you say, agile programming methods is probably te best that there is but even that still leaves a lot to be desired.

    I like the suggestion of providing some exposure to “code reading” as I think it is one of the best ways to improve the overall quality of code and to raise the programmers awareness of the need to make ones code clear and not just to right “fancy” code that does things in a less than intuitive way (if you do use such code then insure it is well documented to explain what is happening).

    I don’t know if this helps but I think emphasis should be on imcreasing the upfront thinking before just programming – I know it goes against the grain and much of the current trend but it is based upon a bit of experience.


    2008/07/28 at 19:55

  2. Since your biases are clear I will not outright advocate including HPC. However, I think students (definintely the current crop who learn to program with java) have insufficient awareness of the fact that there is a machine executing their program.

    It may not fit your course, but I definitely include in my teaching something about stack and heap, allocation, memory leaks, the efficiency of this and that. Programming with this in mind is definitely part of hpc, and I don’t consider it a fad of the day.

    Here is link from someone else arguing the need for such awareness


    Victor Eijkhout

    2008/07/28 at 22:13

  3. Some HCI/Design content would be good and it could be integrated in GUI content.


    Stephen De Gabrielle

    2008/07/29 at 03:12

  4. Reproducible research is important – I think this should go into a code of scientific programming ethics that is covered briefly and then emphasized when various parts of the course reference it.

    High performance – would be nice for students to learn the basics of the ideas – the difference between modifying your code for use with a framework and simply adding start, end parameters for very parallelizable tasks that could be run by a script on a cluster job queue e.g. grid engine.

    Gary Bader

    2008/07/29 at 08:13

  5. I have a suggestion and a comment on one of your bullet points. The suggestion is to include a lecture on the importance of standards. Many fields–and I’m thinking of my experience in the Virtual Observatory, here–are beginning to develop formal standards for things like data formats, metadata, web services, etc. Other disciplines, such as bioinformatics are seeing defacto standards through common of a single tool (e.g., Taverna, and Scufl). Admittedly, this probably doesn’t rate it’s own lecture, and instead it could be a theme that is used throughout. It may help to counteract the “Not Invented Here” syndrome, especially as it applies to data formats.

    Now, my comment is on the High Performance Computing bullet–yes, the topic should be included, and it does rate it’s own lecture, preferably with the title “Parallel Computing”. I believe part of the audience will be expecting it, or at least will be thinking about it, so it’s better to talk about it than ignore it. Plus, some of us do work hard to produce good code for HPC, and we shouldn’t assume that others can’t learn to do the same.

    I think the lecture should answer the following questions:
    * What is parallel computing?
    * When is it necessary?
    * What is shared memory? What is distributed memory?
    * High performance, or high throughput?
    * What tools or libraries are available?

    The lecture should be focused on readily available resources, and I guess I’m translating that directly to MPI and OpenMP. These are things that will work on a departmental cluster, or a multi-core workstation. I would steer clear of GPU programming and Grid computing; these topics are too more specialized, and the term Grid computing is only defined in the reference frame of the person using it (myself included).

    A Physicist

    2008/07/29 at 12:15

  6. I had a brief thought about your software carpentry course, and what I think would be a good addition would be something like “maintaining code”. Ideas like refactoring, DRY principle, programming style, etc would be related. The way that most computer science and programming courses are taught today focus on how to write code that works, and never touch it again. What I’ve found very difficult is how to rewrite code. In scientific research, one often does not have a well defined idea about how something should work, and the design of the software evolves over the lifetime of the project, and often one wants slightly different versions of some tool. The idea of “write once, throw away, then write the real code” often doesn’t work due to deadline pressures.

    As for some stuff to give lower priority, I suggest reducing the time given to interfacing parts like spreadsheets, XML, databases, etc. The reason is that I believe that such recommendations would be often out of date very soon, as APIs would change.

    A Scientist

    2008/07/29 at 12:16

  7. Re: deployment and redistribution

    It does happen that people want to use the tools I am accumulating. So far, I give them a .py module containing functions, and a tutorial on writing python code to call those functions. Most people in my field have a little experience with R, so that approach works reasonably well, but knowing more about deployment would be helpful. Especially package creation?

    re: high performance

    I don’t know what GPU programming is, I may not care, and I want the simplest easiest solution that will work. However, it does happen that running python code on my desktop is sometimes not enough, and it is useful to know what the options/tradeoffs are in that case. Writing parts of the code in C? Finding and using extra computers? An introduction to estimating required resources might also be helpful (e.g. If an O(n^c) algorithm requires X ram and takes Y minutes to process a 100×100 grid on my desktop, how many years need I wait for a a) a computer big enough to process all of Canada, or b) someone to write a more efficient algorithm?).

    A Biologist

    2008/07/29 at 12:18

  8. My vote would be for refactoring, because it will help with understanding good style, working with legacy code (which I imagine scientific programmers run into a lot), and writing better code day to day. I did a refactoring course a few months ago and found the techniques to be really helpful, and Fowler’s Working Effectively with Legacy Code is gold.


    2008/07/29 at 14:16

  9. For quality, the ‘what makes software good’ (and/or useful) is a great add. Some of the thoughts I would include would be: – having modular components – both Kev and I don’t re-use entire code, we re-use the modules. Similarly, I started off by writing the scaffolding, then have written modules that all run off of the same framework to do quite varied analyses. – writing your own legacy code – when you’re gone, can your supervisor add to it? Maintain it? How do you design something such that they can? (goes beyond documentation) – build versions, and keeping track of what version did your analysis. It’s a naive approach, but I currently keep a text file for each dataset that gets appended each time a dataset is run – so I know what version of code generated those numbers.
    A side-note for Matlab and object oriented programming (which I found very useful – undergrad was eons ago, and at that time we weren’t taught OOP) – DICOM headers in Matlab are essentially a form of object.
    >> head = dicominfo(‘file.dcm’);
    >> SlLoc = head.SliceLocation;

    Yes – an intro to GUIs would be brilliant. Honestly, that’s one of the limitations for creating legacy code – or, ‘tools that people actually use’ – if your supervisor or lab isn’t command-line savvy, no matter how wonderful your stuff is, NOONE is going to use it. They want a button. My stuff has been passed over many times, because people are afraid of the ‘>>’. (even if I give/write a detailed manual – people rely on intuition, not instructions,…)

    Some intro into parallelization might be useful. I don’t know if there’s a standard approach that can be taught/suggested for how to break your problem down, but – if so… It’s not something I do, but as computing power improves, it makes sense to look forward at how we could actually harness it…

    And a small/silly one – but dealing with LARGE datasets, and perhaps some strategies, are useful for scientists. ie – do you really have to read in the entire volumetric dataset at one time? How might you recognize the ‘set’ you need? (example – I read the slice location for the full 3D x time dataset, which is small – then only read in images for a particular slice, versus the full volume. Matlab dies if you try and do the full thing…) Then, I actually write a file w’ the slice indices, so that when I do it again, I don’t have to read the dataset again (it’s time-consuming).


    2008/07/29 at 20:03

  10. What about reverse engineering code? Or at least understanding legacy code? What about software architecture, particularly conceptual architecture?

    Architecture Astronaut

    2008/07/30 at 14:14

  11. You mention Sweave, but why not go ahead and teach literate programming? noweb is simple enough to make everyone happy. You could probably get the same effect as Sweave’s continual updating by writing a few TeX macros to call the external program (is this possible in TeX? I’ve never tried). It also means you can keep program derivations next to the code itself (you do derive your loops from invariance conditions, right?).

    Instead of rich browser interfaces, a lecture on interface design might be better. What do you really have to expose? The vast majority of scientific applications with rich interfaces would be much more useful with simple, fast interfaces that let you do 90% of what users want to do, and hidden ability to do the other 10%, a la google’s options in the search string. Similarly command line interfaces can be made markedly easier to use with a little thought. Compare cp with seqret in EMBOSS. seqret at its most basic is just cp, with format conversion, but it doesn’t reduce to that in the simplest case.

    Frederick Ross

    2008/08/02 at 06:18

  12. @Frederick Ross: I was a literate programming zealot for about eight months back in 1990 or thereabouts, but I find that a mix of LaTeX and [name of your favorite programming language goes here] is harder to read in source than either on its own. More importantly, nobody has ever built a debugger for literate programs: you can extract the code, build it, and debug that, but then you have to map what you find back to the actual source. I’ve also never seen any hard data showing that literate programs actually take less time to write, debug, or maintain than “conventional” ones. If anyone knows of such data, I’d welcome a pointer.

    Greg Wilson

    2008/08/02 at 20:31

  13. On the subject of literate programming, I’ve recently started using Idiopidae:

    It’s a world away from noweb/Sweave. It can be used to mix any sort of code with any sort of text file, I’ve used it very happily with HTML/textile and LaTeX. Code lives in its own files and is pulled into a document (or multiple documents) based on delimiters specified in code comments, so the code is fully executable, testable and debuggable. It’s written in Python and incorporates a very robust syntax highlighter.

    I have found it to be a very useful tool with lots of potential applications. I only wish I had had it back when I wrote my thesis using Sweave! 🙂

    Ana Nelson

    2008/08/03 at 09:09

  14. @Ana Nelson:

    Maybe I’m old fashioned, but the title graphic on that page offputting left me completely uninterested in its contents.
    I’m also still trying to wrap my head around the
    equivalence (or otherwise) of “reproducible research” and “literate programming”. I understand it if people want to deliver papers that allow readers to play with parameters, but there are lots of cases where that’s impractical (large data sets, large volumes of computation), so is some sort of data lineage system part of the picture as well?

    Greg Wilson

    2008/08/03 at 12:34

  15. Greg, (sorry, I just can’t bring myself to use the @ convention on a name with a space, and @”Greg Wilson” doesn’t appeal to my inner compiler either) 😉

    First, my Sweave story: when I started working on my thesis in agent-based computational economics, I quickly found that I was writing chapters and developing models in parallel. The process of writing about my results would suggest refinements, corrections and new ideas to explore. Hence code would be updated, and new simulation results obtained, then new graphs and tables would need to be generated. Very quickly I realized that updating my document manually would be untenable, as well as being error-prone.

    This led me to Sweave, which is a tool for incorporating R code and R output into a LaTeX document. With this, I was able to develop a workflow in which I could modify my simulation code (in Java) or even just change my random seed, run a fresh simulation and regenerate my entire thesis with all data, calculations, tables and graphs automatically refreshed based on the latest simulation output.

    So, I used Sweave very successfully in this case, and I was happy to have found a solution to my problem of incorporating data into my document in a reliable way. However, with the benefit of hindsight and the availability of new tools, I would not use Sweave again. In brief, Sweave was limited to incorporating R source, I couldn’t easily bring in my Java code for discussion. The R code used for computations and graphs was embedded in my document, making it difficult to debug, test or reuse (yes, you can put some of your R code in external files and source() it but that gets very inconvenient too). Also, I could not render my LaTeX source without first running Sweave. This had two major disadvantages. Firstly, towards the end of preparing a complete document the number crunching got to take a lot of time. Secondly, by not being able to run just the LaTeX I had written myself it was not immediately obvious whether a LaTeX compiler error was due to a problem in something I had typed or in the Sweave-generated LaTeX. And, not least, I found installing noweb/Sweave to be a rather unpleasant experience.

    Anyway, I was motivated by pure self-interest in my use of Sweave, but as you rightly pointed out, there are other considerations. I do think it’s very important to have readers be able to validate and experiment with code and at least a decent subset of your data set (if relevant). This means not only releasing your code, but releasing it in such a way that it is feasible and practical for readers to figure out how to get it up and running. So, standard good practices for software apply here: use open source frameworks where possible/appropriate (don’t reinvent the wheel by writing your own simulation framework, database interface or web framework unless you really know what you are doing and have a compelling reason to do so), have a public source code repository (preferably, in my opinion, using a distributed version control system), have a comprehensive test suite where all the tests pass, write readable code (as little of it as possible) with appropriate amounts of commenting, and if you really want to make life easy for readers (and hence increase the number of people using your code, and perhaps *cough* the number of citations of your paper) a comprehensive user guide… By the way, the person who will benefit most from this comprehensive user guide is YOU when you come back to this line of research after a 6 month break and can’t even remember how to switch the thing on.

    Now, as to the distinction between “reproducible research” and “literate programming”, Knuth’s original idea (as I understand it – someone please correct me if I’m wrong… oh yeah, this is the internet, so someone assuredly will 🙂 ) was to have documentation and code in 1 document. Thus single-purpose documentation and only one instance of documentation per file of code. This is more or less the approach that Sweave takes too. However, Idiopidae takes a very different approach which I have found much more useful, which is to keep code separate but make it very easy to write documents incorporating code and the output obtained from running such code. Hence, in contrast to “literate programming”, Idiopidae’s approach is for multi-purpose, multi-instance content based on one or many files of code.

    For example, let’s say you write a model in some simulation framework. You write a detailed user guide in HTML with a getting-started tutorial chapter, followed by a comprehensive overview of the software, incorporating live code snippets throughout. In the course of writing this up, you spot several areas where your code could be cleaner and realize a nice way of completely refactoring one of your classes, which your benchmarking shows results in a 10% reduction in the time it takes to run an iteration of your model. (Writing about code has always proven a great boon to me in improving my code. Partly it’s the spectre of public scrutiny, partly it’s the reflection inherent in the process of writing.) You make the changes and within seconds your HTML has been updated with the new code. You then go on to write 3 brilliant papers based on results obtained using this model. Each paper, written in LaTeX, incorporates both code snippets and simulation results, and you easily reuse and publish your R (or Matlab, or SciPy) analysis routines for all 3 papers. If you also write 20 blog posts about this model, some talking about the programming aspects, some describing your statistical methodology and others discussing the science behind the model, incorporating code snippets, simulation results and graphs, you can see how investment in the right tools really pays off in terms of productivity. (And then there’s the book, several conference presentations, a commencement address or two…) And, this is where this approach shines, not in the speed of writing your original code, but in the volume of benefit you and others can derive from that code due to the lower cost of creating and maintaining documentation and derivative works.

    The setup I use now (Idiopidae + Rake + LaTeX or Idiopidae + Webby) allows me to incorporate source code from any language, beautifully syntax highlighted, the output from running any script, be it an R computation, a shell script or a Ruby programme, graphics generated with Graphviz, R or any command-line utility, and to integrate any or all of these into any sort of document I want: a HTML blog post, a LaTeX chapter, a Beamer presentation.

    I haven’t touched on your data lineage system question, my comment is far too long already :-). I will only say briefly that there are several excellent open source databases out there along with several excellent open source ORM frameworks to help you interface with those databases. I am currently using Ruby’s DataMapper to manage and analyze simulation data in a PostgreSQL database (and of course I can pull these results directly into any document). Python has SQLAlchemy which I have heard good things about, but not had a chance to try yet.

    Ana Nelson

    2008/08/04 at 05:53

Comments are closed.

%d bloggers like this: