Archive for March 2010
David Bradley has created a periodic table of science bloggers that regular readers might enjoy:
As I said in last week’s announcement, and mentioned again in a later post, one of the main goals of this rewrite is to make it possible for students to do the course when and where they want to. That means recording audio and video, but much of the material will probably still be textual: code samples (obviously), lecture notes (for those who prefer skimming to viewing, or who want to teach the material locally), and exercises will still be words on a virtual page. And even the AV material will (probably) be accompanied by scripts or transcripts, depending on what turns out to work best.
Which brings up a question everyone working with computers eventually faces: what format(s) should material be stored in? For images, audio, and video, the choices are straightforward: SVG for line drawings, PNG for images, MP3 for audio, and MP4, MPEG, or FLV or video (I don’t know enough yet to choose). But there’s a bewildering variety of options for text, each with its pros and cons.
I mentioned yesterday that I maintain a list of books that haven’t been written yet. Partly it’s an exercise in sympathetic magic—if the reviews exist, maybe the books will follow—but it’s also useful for organizing my thoughts about what a programmer’s education should look like. Looking at the books I’ve matched to various topics in the Software Carpentry course outline, there are some distressing gaps:
- Given that programmers spend upwards of 40% of their time debugging, there are very few books about it, and only one collection of exercises (Barr’s Find the Bug).
- There’s a lot on higher-level programming techniques, but it’s scattered across dozens of books as disparate as The Seasoned Schemer, Effective C++, and The Practice of Programming. I haven’t read Perrotta’s Metaprogramming Ruby yet, but it looks like it will be another rich source of ideas.
- Material on systems programming—manipulating files and directories, running sub-processes, etc.—is equally scattered. The Art of Unix Programming includes all the right topics, but covers too much, in too much detail, at too low a level. Gift & Jones’ Python for Unix and Linux System Administration has the same two faults (from Software Carpentry’s point of view—I think both are excellent books in general), but uses a scripting language for examples, so it made the list.
- Mark Guzdial and others have done excellent research showing the benefits of teaching programming using multimedia, i.e., showing students how to manipulate images, sound, and video as a way of explaining loops and conditionals. That’s half of why the revised course outline includes image processing early on (the other halves being “it’s fun” and “it’s useful”). Once again, most of what I’m familiar with is either documentation for specific libraries, or textbooks on the theory of computer vision, but there are some promising titles in the MATLAB world that I need to explore further.
- Performance. It’s been 15 years since I first grumbled about this, and the situation hasn’t improved. Most books on computer systems performance are really textbooks on queueing theory; of that family, Jain’s Art of Computer Systems Performance Analysis is still head and shoulders above the crowd. Souders’ High Performance Web Sites is the closest modern equivalent I’ve found to Bentley’s classic Writing Efficient Programs, but neither is really appropriate for scientists, who need to think about disk I/O (biologists and their databases), pipelining and caching (climatologists with their differential equations), and garbage collection (everybody using a VM-based language). I had hoped that High Performance Python would fill this gap, but it seems to have been delayed indefinitely. (And yes, I’ve looked at Writing Efficient Ruby Code; it has some of what our students want, but not nearly enough.)
- There are lots of books about data modeling, but all the ones I know focus exclusively on either the relational approach or object-oriented design, with a smattering that talk about XML, RDF, and so on. I haven’t yet found something that compares and contrasts the three approaches; pointers would be welcome.
- Web programming. There are (literally) thousands of books on the subject, but that’s the problem: almost all treatments are book-length, and this course only has room for one or two lectures. It is possible to build a simple web service in that time, but only by (a) using a cookbook approach, rather than teaching students how things actually work, and (b) ignoring security issues completely. I’m not comfortable with the first, and flat-out refuse to do the second: if this course shows people how to write a simple CGI script that’s vulnerable to SQL injection and cross-site scripting, then it’s our fault when the students’ machines are hacked. This gap is as much in the available libraries as in the books, but that doesn’t make it any less pressing.
Given these gaps, I may drop one or two topics (such as performance and web programming) and either swap in one of the discarded topics or spend more time on some of the core material. I’m hoping neither will be necessary; as I said above, pointers to books in any language that are at the right level, and cover the right areas, would be very welcome.
I’m slightly obsessed with reading lists. (I even maintain a list of books that haven’t been written yet, in the hope that it will inspire people to turn some of the entries from fantasy into reality.) Partly to give credit to all the people whose work inspired Software Carpentry, and partly to guide students who want to learn more than we can fit into a double dozen lectures, I have started a bibliography, and added links to relevant books to the lecture descriptions in the course outline. Pointers to other material would be very welcome; I will blog soon about areas that I feel are particularly lacking.
As well as deciding on the format of the course, I have to re-shape its content. In contrast to e-learning, there seems to be a lot of solid material available on instructional design. The most useful guide I’ve found so far is Wiggins & McTighe’s Understanding by Design. I was initially a bit put off by the micro-industry the authors have built around the book, but its step-by-step approach immediately felt right:
- What are students supposed to understand at the end of the lesson?
- How is that going to be determined, i.e., what questions will they be answer that they couldn’t answer before, or what will they be able to do that they couldn’t do before?
- What lessons and activities are going to help them acquire that knowledge and those skills?
The whole thing is a lot more detailed than that, but you get the gist. And note that the last point says “help them acquire”, not “teach them”: while the latter focuses on what the instructor says, the former focuses on helping students construct understanding, which is both more accurate and a better fit for the level of students this course targets.
I’ve already used their ideas in reshaping the course outline. If the right way to deliver the course turns out to be 200 vignettes rather than 25 lectures, I will need to do some chopping and rearranging, but I think that what I have is a good starting point. Once I know what format I’m going to choose, I will rework the outline in accordance with the three-step approach summarized above and ask for feedback.
As the announcement of Version 4 said, Software Carpentry is being redesigned so that it can be delivered in several ways. I want to support:
- traditional classroom lectures, with someone at the front of the room talking over a series of slides and/or coding sessions to a captive audience;
- students reading/viewing material on their own time, at their own pace, when and as they need it; and
- hybrid models, in which students work through as much as they can on their own, then get help (face-to-face or over the web) when they hit roadblocks.
#1 isn’t easy to do well, but the challenges involved are well understood. #2 and #3 are going to be a lot harder: it’s new ground for me, and despite the fact that the Internet is older than many of my students, most of the educational establishment still thinks of it as “new” as well.
There are hundreds of books and web sites devoted to e-learning, but the majority just recycle the same handful of inactionable truisms. (“When designing online material, try to make it as engaging as possible.” Well, duh.) Most of the high-quality material focuses on research about e-learning, rather than instructional design itself. For example, Richard Mayer’s Multimedia Learning says a lot of interesting things about whether people learn more deeply when ideas are expressed in words and pictures rather than in words alone, and the principles he derives from his research are good general guidelines, but again, there’s little help offered in translating the general into the specific.
If there isn’t much explicit guidance available, what about prior art? MIT’s Open Courseware got a lot of attention when it was launched, but its “talking heads” approach reminds me of early automobiles that looked like horse-drawn carriages with motors bolted on. Carnegie-Mellon’s Open Learning Initiative (which advertises itself as “open courses backed by learning research”) is more interesting, but what has really caught my eye is Saleem Khan’s Khan Academy, which I first encountered through one of Jon Udell’s interviews. Khan has created hundreds of short videos on topics ranging from basic addition to mitosis and Laplace transforms by recording himself sketching on a tablet. The results are just as digestible as Hollywood-quality material I’ve viewed elsewhere, and with 25 lectures to do in less than 50 weeks, his low-ceremony approach appeals to me for practical reasons as well.
Of course, any believer in agile development would tell me that there’s only one right way to tackle this problem (and in fact, one did just an hour ago). By the end of May, I plan to put one lecture—probably the intro to relational databases and SQL—up on the web in two or three formats, and then ask for feedback. Is one 50-minute video better or worse than five 10-minute vignettes? Do people prefer PowerPoint slides with voiceover, live sketching/coding sessions (complete with erasures and typos), or some mix of the two? How important is it to close-caption the videos? If classroom-style slides are available as well as the video, how many people look at each? I know how to do these kinds of usability studies, and hopefully enough people will volunteer opinions to help me choose the right path.
I am very excited to announce that I am going to work full-time on revising the Software Carpentry course from May 2010 to May 2011. This work has been made possible by the generosity of our sponsors:
- The School of Informatics and Computing at Indiana University
- The Gene Expression in Disease and Development Focus Group at Michigan State University
- The Centre for Digital Music at Queen Mary University of London
- The UK Met Office
I would also like to thank The MathWorks, the University of Toronto, the Python Software Foundation, and Los Alamos National Laboratory, whose support over the past 13 years has allowed us to help scientists use computers more productively.
Version 4 of Software Carpentry will improve on its predecessors in three significant ways. First, the content will be reorganized and updated to better meet scientists’ needs. As with Version 3, a typical graduate student or research scientist should be able to cover all of the material in a regular one-term course with approximately 25 hours of lecture and 100-150 hours of exercises.
Second, we intend to provide parallel versions of the material in MATLAB and Python, so that scientists who already know numerical computing’s most popular scripting language can dive right into the parts that interest them most. We have been testing a MATLAB translation of the Version 3 notes this winter with good results, and are grateful to the students at the University of Toronto who have tried them out and given us feedback.
Third, and most importantly, Version 4 of the course will be designed so that students can work through most or all of the material on their own, at their own pace, when they need it. To do this, we will make video recordings of the lectures available, along with screencasts and interactive examples, and provide over-the-web support via email, Skype, and desktop sharing to help early adopters when they run into roadblocks. We hope that this will allow us to reach, and help, many more people that would otherwise be possible.
Software Carpentry is an open project: all of the materials are available under the Creative Commons Attribution license, and can be freely shared and remixed provided you include a citation. If you would like to help us help scientists be more productive, please contact Greg Wilson by email at email@example.com or as gregory_v_wilson on Skype.
Analyzing Next-Generation Sequencing Data
May 31 – June 11th, 2010
Kellogg Biological Station, Michigan State University
CSE 891 s431 / MMG 890 s433, 2 cr
Applications are due by midnight EST, April 9th, 2010.
Course sponsor: Gene Expression in Disease and Development Focus Group at Michigan State University.
Instructors: Dr. C. Titus Brown and Dr. Gregory V. Wilson
This intensive two week summer course will introduce students with a strong biology background to the practice of analyzing short-read sequencing data from the Illumina GA2 and other next-gen platforms. The first week will introduce students to computational thinking and large-scale data analysis on UNIX platforms. The second week will focus on mapping, assembly, and analysis of short-read data for resequencing, ChIP-seq, and RNAseq.
No prior programming experience is required, although familiarity with some programming concepts is suggested, and bravery in the face of the unknown is necessary. 2 years or more of graduate school in a biological science is strongly suggested.
You can now follow our progress at ‘swcarpentry’ on Twitter.
Let’s start with the problem description:
All of the Software Carpentry course material (including lecture notes, code samples, data files, and images) is stored in a Subversion repository. That’s currently hosted at the University of Toronto, but I’d like to move it to the software-carpentry.org domain (along with this blog). However, software-carpentry.org is hosted with site5.com, who only provide one shell account per domain for cheap accounts like the one I bought.
Why is this a problem? Because when someone wants to commit to the repository, they have to authenticate themselves. I could let everyone who’s writing material for the course share a single user ID and password, but that would be an administration nightmare (as well as a security risk). Site5 does have a workaround based on public/private keys, but it’s fairly complicated—i.e., it could break in lots of hard-to-diagnose ways. Another option would be to use the mod_dav_svn plugin for Apache, but Site5 doesn’t support per-domain Apache modules either. Dreamhost.com does, so I may be switching hosts in a few weeks.
So: how much of this should the average research scientist be expected to understand? If the answer is “none”, then how are they supposed to make sensible decisions about moving their work online? If the answer is “all”, where does the time come from? (It takes me 30 seconds to read the two paragraphs above; it would take many hours of instruction to teach people enough to do the analysis themselves.) And if the answer is “some”, then which parts? To what depth? And who takes care of the rest on scientists’ behalf?