Archive for the ‘Content’ Category
A scientist I recently met in Toronto had a problem: how to share large files with colleagues. Each file is a couple of hundred megabytes; dozens are produced each week, but each is only interesting for a couple of months; and there are confidentiality issues, so some kind of password protection is needed. Conventional file-sharing services like Dropbox aren’t designed for data that size, so in the end she bought a domain and set up secure FTP.
But now there’s this:
The transfer of scientific data has emerged as a significant challenge, as datasets continue to grow in size and demand for open access sharing increases. Current methods for file transfer do not scale well for large files and can cause long transfer times. In this study we present BioTorrents, a website that allows open access sharing of scientific data and uses the popular BitTorrent peer-to-peer file sharing technology. BioTorrents allows files to be transferred rapidly due to the sharing of bandwidth across multiple institutions and provides more reliable file transfers due to the built-in error checking of the file sharing technology. BioTorrents contains multiple features, including keyword searching, category browsing, RSS feeds, torrent comments, and a discussion forum. BioTorrents is available at http://www.biotorrents.net.
It’s a neat idea, and will become neater once scientists routinely put DOIs on data as well as papers. I’d be very interested in a usability study to see how easy or hard it is for the average grad student in botany to get this plugged in and turned on.
I mentioned yesterday that I maintain a list of books that haven’t been written yet. Partly it’s an exercise in sympathetic magic—if the reviews exist, maybe the books will follow—but it’s also useful for organizing my thoughts about what a programmer’s education should look like. Looking at the books I’ve matched to various topics in the Software Carpentry course outline, there are some distressing gaps:
- Given that programmers spend upwards of 40% of their time debugging, there are very few books about it, and only one collection of exercises (Barr’s Find the Bug).
- There’s a lot on higher-level programming techniques, but it’s scattered across dozens of books as disparate as The Seasoned Schemer, Effective C++, and The Practice of Programming. I haven’t read Perrotta’s Metaprogramming Ruby yet, but it looks like it will be another rich source of ideas.
- Material on systems programming—manipulating files and directories, running sub-processes, etc.—is equally scattered. The Art of Unix Programming includes all the right topics, but covers too much, in too much detail, at too low a level. Gift & Jones’ Python for Unix and Linux System Administration has the same two faults (from Software Carpentry’s point of view—I think both are excellent books in general), but uses a scripting language for examples, so it made the list.
- Mark Guzdial and others have done excellent research showing the benefits of teaching programming using multimedia, i.e., showing students how to manipulate images, sound, and video as a way of explaining loops and conditionals. That’s half of why the revised course outline includes image processing early on (the other halves being “it’s fun” and “it’s useful”). Once again, most of what I’m familiar with is either documentation for specific libraries, or textbooks on the theory of computer vision, but there are some promising titles in the MATLAB world that I need to explore further.
- Performance. It’s been 15 years since I first grumbled about this, and the situation hasn’t improved. Most books on computer systems performance are really textbooks on queueing theory; of that family, Jain’s Art of Computer Systems Performance Analysis is still head and shoulders above the crowd. Souders’ High Performance Web Sites is the closest modern equivalent I’ve found to Bentley’s classic Writing Efficient Programs, but neither is really appropriate for scientists, who need to think about disk I/O (biologists and their databases), pipelining and caching (climatologists with their differential equations), and garbage collection (everybody using a VM-based language). I had hoped that High Performance Python would fill this gap, but it seems to have been delayed indefinitely. (And yes, I’ve looked at Writing Efficient Ruby Code; it has some of what our students want, but not nearly enough.)
- There are lots of books about data modeling, but all the ones I know focus exclusively on either the relational approach or object-oriented design, with a smattering that talk about XML, RDF, and so on. I haven’t yet found something that compares and contrasts the three approaches; pointers would be welcome.
- Web programming. There are (literally) thousands of books on the subject, but that’s the problem: almost all treatments are book-length, and this course only has room for one or two lectures. It is possible to build a simple web service in that time, but only by (a) using a cookbook approach, rather than teaching students how things actually work, and (b) ignoring security issues completely. I’m not comfortable with the first, and flat-out refuse to do the second: if this course shows people how to write a simple CGI script that’s vulnerable to SQL injection and cross-site scripting, then it’s our fault when the students’ machines are hacked. This gap is as much in the available libraries as in the books, but that doesn’t make it any less pressing.
Given these gaps, I may drop one or two topics (such as performance and web programming) and either swap in one of the discarded topics or spend more time on some of the core material. I’m hoping neither will be necessary; as I said above, pointers to books in any language that are at the right level, and cover the right areas, would be very welcome.
I’m slightly obsessed with reading lists. (I even maintain a list of books that haven’t been written yet, in the hope that it will inspire people to turn some of the entries from fantasy into reality.) Partly to give credit to all the people whose work inspired Software Carpentry, and partly to guide students who want to learn more than we can fit into a double dozen lectures, I have started a bibliography, and added links to relevant books to the lecture descriptions in the course outline. Pointers to other material would be very welcome; I will blog soon about areas that I feel are particularly lacking.
Let’s start with the problem description:
All of the Software Carpentry course material (including lecture notes, code samples, data files, and images) is stored in a Subversion repository. That’s currently hosted at the University of Toronto, but I’d like to move it to the software-carpentry.org domain (along with this blog). However, software-carpentry.org is hosted with site5.com, who only provide one shell account per domain for cheap accounts like the one I bought.
Why is this a problem? Because when someone wants to commit to the repository, they have to authenticate themselves. I could let everyone who’s writing material for the course share a single user ID and password, but that would be an administration nightmare (as well as a security risk). Site5 does have a workaround based on public/private keys, but it’s fairly complicated—i.e., it could break in lots of hard-to-diagnose ways. Another option would be to use the mod_dav_svn plugin for Apache, but Site5 doesn’t support per-domain Apache modules either. Dreamhost.com does, so I may be switching hosts in a few weeks.
So: how much of this should the average research scientist be expected to understand? If the answer is “none”, then how are they supposed to make sensible decisions about moving their work online? If the answer is “all”, where does the time come from? (It takes me 30 seconds to read the two paragraphs above; it would take many hours of instruction to teach people enough to do the analysis themselves.) And if the answer is “some”, then which parts? To what depth? And who takes care of the rest on scientists’ behalf?
Via Titus Brown, a new PLoS paper titled “A Quick Guide for Developing Effective Bioinformatics Programming Skills” by Joel Dudley and Atul Butte. Their recommendations are:
- Programming languages
- Embracing open source
- Unix command-line skills
- Keeping projects documented and manageable
- Preserving source code with version control
- Embracing parallel computing paradigms
- Structuring data for speed and scalability
- Understanding the capabilities of hardware
- Embracing standards and interoperability
- Put a high value on your time
I think all these things matter, but:
- How many hours do the authors think are needed to acquire these skills? We’ve tried very hard to fit Software Carpentry into 25 hours of lecture and 50-100 hours of practical work because we recognize that every one of those hours is time students aren’t spending doing science.
- Shouldn’t testing be in the top 10? Or the top 5, or 3? These days, I care a lot more about how (and how well) someone tests than I do about their mastery of any particular programming language.
Ana Nelson has posted step-by-step instructions showing how to use Amazon’s EC2 cloud computing platform to run simulations. There are still a lot of fiddly details, but the barriers to entry are getting lower all the time…
Martijn Faassen has posted a nice history of packaging and distribution tools for Python. Yes, it’s a topic only a geek could love, but anyone who wants to distribute software to other people needs to grapple with these issues. The question is, should these tools, the problems that motivate them, and the technology underlying them be part of this course? Or should something equivalent (and if so, what)?
…I have added a lecture on high performance computing to the revised outline for the course. Several people suggested it, and what’s the point of asking for feedback if I don’t listen?
One topic that I’ve tried to include in this course a couple of times, without success, is security. I feel irresponsible not saying something about how to share safely, but I’ve never found something that (a) would fit into one hour, (b) wasn’t platitudes, and (c) gave listeners something they could act on.
One reviewer suggested talking about public/private key pairs (to help people set up SSH), signing things digitally, and the like. I’m leery of going down that road, though, since it could easily leave people with a misplaced faith in technical solutions to security problems. As always, suggestions would be welcome…
Jon Pipitone has a good description on his blog of work the grad students in our department are doing to translate work in climate change into software engineering terms. Their first step is to represent the ideas in MacKay’s excellent Sustainable Energy Without the Hot Air in two of the graphical notations that computer scientists use for system design. I was initially very skeptical, but looking at their work so far, I’m quite impressed. My question is, would it be useful for scientists to know how to do this themselves? More specifically, is the lecture on data modeling that I’ve planned to include in Version 4.0 worthwhile or not?