Software Carpentry

Helping scientists make better software since 1997

Archive for July 2008

Next Lecture?

The Software Carpentry course currently contains the following lectures:

  1. Introduction
  2. The Unix Shell (2 lectures)
  3. Version Control
  4. Automated Builds
  5. Basic Scripting (bool/int/float, for/while/if)
  6. Strings, Lists, and Files
  7. Functions and Libraries
  8. Programming Style
  9. Quality Assurance (basic testing)
  10. Sets, Dictionaries, and Complexity
  11. Debugging
  12. Object-Oriented Programming (2 lectures)
  13. Unit Testing (unittest — should switch this to nose)
  14. Regular Expressions
  15. Binary Data
  16. XML
  17. Relational Databases
  18. Spreadsheets
  19. Numerical Programming (the basics of NumPy)
  20. Integration (subprocess+pipes and wrapping C functions)
  21. Web Client Programming (HTTP request/response, URL encoding)
  22. Web Server Programming (basic CGI processing)
  23. Security (the weakest lecture of the bunch)
  24. The Development Process (a mish-mash of sturdy and agile)
  25. Teamware (introduces portals like DrProject)
  26. Conclusion (various “where to look next” suggestions)

Between now and Christmas, I want to tidy them up, duplicate the examples in MATLAB, and add some of the content I wrote for “CSC301: Introduction to Software Engineering”. Since I won’t have time to do everything, I’d like your help prioritizing. Which of the following topics do you think is most important to add? And what have I forgotten entirely?

  • Lifecycle: should I split the existing “Development Process” lecture into two, and cover agile methods (focusing on Scrum) and sturdy methods (i.e., longer release cycles, more up-front planning, legacy code). Neither exactly fits scientists’ “exploratory programming” paradigm, but they’re all we’ve got…
  • Quality: this would expand the “Programming Style” lecture with material from Spinellis’s Code Reading and Code Quality to describe what makes good software good.
  • Deployment
    • Currently based on the patterns in Nygard’s Release It!, which focus on designing scalable fault-tolerant applications.
    • Should I instead cover the creation and distribution of packages (e.g., RPMs, Distutils, Ruby Gems, etc.)?
  • Refactoring: a combination of Fowler’s original Refactoring and Feathers’ Working Effectively with Legacy Code.
  • UML: I devote three lectures to this in CSC301; I don’t see any reason to inflict it on scientists.
  • Reproducible Research: it’s already important, and likely to become more so; it also ties in with “open science”, though I’m not sure what I could say about either that wouldn’t just be rah-rah and handwaving—tools like Sweave are interesting, but I don’t people would be willing to learn R just to use it, and there don’t seem to be equivalents (yet) in other languages. The same goes for data lineage: it’s an important idea, and there are plenty of research prototypes, but nothing has reached the “used by default” level of (for example) Subversion.
  • GUI Construction: people still use desktop GUIs, and it’s worth learning how to build them (if only because it forces you to come to grips with MVC and event-driven programming), but what everyone really wants these days is a rich browser-based interface, and I don’t think it’d be possible to fit that into this course.
  • High Performance Blah Blah Blah: this one keeps coming up, but (a) one of the motivations for Software Carpentry is the belief that there’s too much emphasis on this in scientific computing anyway, and (b) what would it include?  GPU programming?  MPI?  Grid computing? Some other flavor-of-the-week distraction from the hard grind of creating trustable code and reproducible results without heroic effort?  Oh, wait, are my biases showing?

Written by Greg Wilson

2008/07/28 at 13:38

Posted in Content, Lectures, Version 3

Badge of Reproducibility

Coming back to the badge meme from earlier this week, John Cook’s new Reproducible Research blog pointed me at this page on the EPFL site advertising a paper called “What, Why and How of Reproducible Research in Signal Processing”. Notice the “Reproducible Research” badge? The “add your evaluation” link takes you to a formlet that lets you choose between:

  1. I have tested this code and it works
  2. I have tested this code and it does not work (on my computer)
  3. I have tested this code and was able to reproduce the results from the paper
  4. I have tested this code and was unable to reproduce the results from the paper

It’s a good start…

Written by Greg Wilson

2008/07/23 at 19:02

Posted in Community, Noticed, Opinion

Quick Quiz to Measure What Scientists Know

Suppose you have a room full of scientists—hundreds of ’em—and want to find out how they actually use computers in their work. There isn’t time to interview them individually, or to record their desktops during a typical working week, so you’ve decided to ask them to self-assess their understanding of some key terms on a scale of:

  1. No idea what it is.
  2. Use it/have used it infrequently.
  3. Use it regularly.
  4. Couldn’t get through the day without it.

My list is below; what have I forgotten, and (more importantly) how would you criticize this assessment method?

  1. A command-line shell
  2. Shell scripts
  3. Version control system (e.g., CVS, Subversion)
  4. Bug tracker
  5. Build system (e.g., Make, Ant)
  6. Debugger (e.g., GDB)
  7. Integrated Development Environment (e.g., Eclipse, Visual Studio)
  8. Numerical Computing Environment (e.g., MATLAB, Mathematica)
  9. Inverse analyzer (e.g., Inane)
  10. Spreadsheet (e.g., Excel)
  11. Relational database (e.g., SQLite, MySQL, Oracle)
  12. Layout-based document formatting (e.g., LaTeX, HTML)
  13. WYSIWYG document formatting (e.g., Word, PowerPoint, OpenOffice)

Now, you have the same room full of scientists, and you want to find out how much they know about software development. There still isn’t time to interview them or have them solve some programming problems, so again you’re falling back on self-assessment. This time, the scale is:

  1. No idea what it means.
  2. Have heard the term but couldn’t explain it.
  3. Could explain it correctly to a junior colleague.
  4. Expert-level understanding.

and the terms themselves are:

  • Nested loop
  • Switch statement
  • Stable sort
  • Depth-first traversal
  • Polymorphism
  • Singleton
  • Regular expression
  • Inner join
  • Version control
  • Branch and merge
  • Unit test
  • Variant digression
  • Build and smoke test
  • Code coverage
  • Breakpoint
  • Defensive programming
  • Test-driven development
  • Release manifest
  • Agile development
  • UML
  • Traceability matrix
  • User story

Once again, my questions are (a) what have I forgotten, and (b) how “fair” is this as an assessment method?

Written by Greg Wilson

2008/07/23 at 07:41

Posted in Content, Research

Reviving the Software Carpentry Mailing List

Luke Petrolekas and I are thiiiiis close to having the Software Carpentry notes converted to a wiki. Once they are, I’m going to be working with Tina Yee to update them, do the examples in MATLAB as well as Python, and fix some longstanding bugs.  I’m also going to resurrect the project’s two mailing lists (one for occasional announcements, the other for people interested in developing new material and/or teaching the course).  If you’d like to be on either or both, please let me know.

Written by Greg Wilson

2008/07/22 at 18:35

Posted in Lectures, Version 3

Badge of Honor?

I met up with Shirley Wu, Michael Nielsen, and a few other ISMB attendees yesterday to talk about what’s variously called Science 2.0 or Open Science. It was pretty rushed (and not helped by the bar we wound up in), but it got me thinking about creating an “open science” badge that scientists could apply to their work. Right now, people are using a variety of terms in inconsistent ways; it sometimes takes a very close reading to figure out exactly what the mean. I’d really like to see the PSB workshop (or some other meeting like it) put a peg in the ground and say, “If you do the following things, you can put this ‘open science’ badge on your lab’s web site, and put, ‘This research is certified open.’ in your papers.” The W3C’s familiar badges and the Open Source Initiative‘s certification of software licenses have done a lot to clarify discussion, and have given people standards to aspire to. Nine years after the “Open Source/Open Science” workshop at Brookhaven National Laboratory, maybe it’s time to borrow those ideas and put them into practice.

Written by Greg Wilson

2008/07/19 at 19:57

Posted in Community

Kevin’s Been Busy

Kevin Brown has been busy — he’s been coordinating, installing, maintaining, fixing, and figuring out how to use a new $20 million supercomputer for cancer research. No word on how much money will be spent training people how to use it effectively, but hey, I’m easy to reach… 🙂

Written by Greg Wilson

2008/07/01 at 05:20

Posted in Noticed