Archive for the ‘Noticed’ Category
A scientist I recently met in Toronto had a problem: how to share large files with colleagues. Each file is a couple of hundred megabytes; dozens are produced each week, but each is only interesting for a couple of months; and there are confidentiality issues, so some kind of password protection is needed. Conventional file-sharing services like Dropbox aren’t designed for data that size, so in the end she bought a domain and set up secure FTP.
But now there’s this:
The transfer of scientific data has emerged as a significant challenge, as datasets continue to grow in size and demand for open access sharing increases. Current methods for file transfer do not scale well for large files and can cause long transfer times. In this study we present BioTorrents, a website that allows open access sharing of scientific data and uses the popular BitTorrent peer-to-peer file sharing technology. BioTorrents allows files to be transferred rapidly due to the sharing of bandwidth across multiple institutions and provides more reliable file transfers due to the built-in error checking of the file sharing technology. BioTorrents contains multiple features, including keyword searching, category browsing, RSS feeds, torrent comments, and a discussion forum. BioTorrents is available at http://www.biotorrents.net.
It’s a neat idea, and will become neater once scientists routinely put DOIs on data as well as papers. I’d be very interested in a usability study to see how easy or hard it is for the average grad student in botany to get this plugged in and turned on.
Over at opensource.com, Red Hat’s Greg DeKoenigsberg has a post about a new collaboratively-authored textbook on open source software aimed squarely at undergrad courses. As Máirín Duffy points out in the first comment, it’s very code-centric, but in my experience, that’s the right approach: students won’t be ready for discussion of design until they’re proficient in coding . I’m looking forward to borrowing lots from the book for Software Carpentry…
 This is, by the way, why I believe that attempts to teach “computational thinking” without first teaching programming are doomed to fail, but that’s a rant for another time.
My father once told me that a week of hard work can sometimes save you an hour of thought. In that spirit, I’ve been looking for asynchronous online courses to imitate. I previously mentioned MIT’s Open Courseware, CMU’s Open Learning Initiative, and (closer to my scale) Saleem Khan’s Khan Academy. Google Code University‘s lessons on programming languages are also on my radar—I’ll blog more about them once I finish the Python material—but another model that I’m looking at closely is Teaching Open Source, a collaborative effort to get more open source into college and university courses. I first encountered them through POSSE (Professors’ Open Source Summer Experience), which they describe as:
…a weeklong bootcamp that will immerse professors in open source projects. Participants spend a week of intensive participation in selected open source projects, led by professors with experience in teaching open source development, in partnership with community members who have deep experience and insight. By the end of the session, participants should have a much better understanding of the workings of open source projects, and a strong network of contacts to lean on as they begin to bring students into the open source world.
I’ve also been watching in awe (with a small ‘a’, but awe nonetheless) as half a dozen contributors have pulled together a textbook called Practical Open Source Software Exploration: How to be Productively Lost, the Open Source Way. It’s by no means complete, but I have already bookmarked it in a dozen places, and expect to add more. I always hoped that Software Carpentry would become a community project of this kind; here’s hoping that Version 4 finally manages to.
David Bradley has created a periodic table of science bloggers that regular readers might enjoy:
Analyzing Next-Generation Sequencing Data
May 31 – June 11th, 2010
Kellogg Biological Station, Michigan State University
CSE 891 s431 / MMG 890 s433, 2 cr
Applications are due by midnight EST, April 9th, 2010.
Course sponsor: Gene Expression in Disease and Development Focus Group at Michigan State University.
Instructors: Dr. C. Titus Brown and Dr. Gregory V. Wilson
This intensive two week summer course will introduce students with a strong biology background to the practice of analyzing short-read sequencing data from the Illumina GA2 and other next-gen platforms. The first week will introduce students to computational thinking and large-scale data analysis on UNIX platforms. The second week will focus on mapping, assembly, and analysis of short-read data for resequencing, ChIP-seq, and RNAseq.
No prior programming experience is required, although familiarity with some programming concepts is suggested, and bravery in the face of the unknown is necessary. 2 years or more of graduate school in a biological science is strongly suggested.
Science is based on building on, reusing and openly criticising the published body of scientific knowledge.
For science to effectively function, and for society to reap the full benefits from scientific endeavours, it is crucial that science data be made open.
By open data in science we mean that it is freely available on the public internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. To this end data related to published science should be explicitly placed in the public domain.
Formally, we recommend adopting and acting on the following principles:
- Where data or collections of data are published it is critical that they be published with a clear and explicit statement of the wishes and expectations of the publishers with respect to re-use and re-purposing of individual data elements, the whole data collection, and subsets of the collection. This statement should be precise, irrevocable, and based on an appropriate and recognized legal statement in the form of a waiver or license.When publishing data make an explicit and robust statement of your wishes.
- Many widely recognized licenses are not intended for, and are not appropriate for, data or collections of data. A variety of waivers and licenses that are designed for and appropriate for the treatment of data are described here. Creative Commons licenses (apart from CCZero), GFDL, GPL, BSD, etc are NOT appropriate for data and their use is STRONGLY discouraged.Use a recognized waiver or license that is appropriate for data.
- The use of licenses which limit commercial re-use or limit the production of derivative works by excluding use for particular purposes or by specific persons or organizations is STRONGLY discouraged. These licenses make it impossible to effectively integrate and re-purpose datasets and prevent commercial activities that could be used to support data preservation.If you want your data to be effectively used and added to by others it should be open as defined by the Open Knowledge/Data Definition – in particular non-commercial and other restrictive clauses should not be used.
- Furthermore, in science it is STRONGLY recommended that data, especially where publicly funded, be explicitly placed in the public domain via the use of the Public Domain Dedication and Licence or Creative Commons Zero Waiver. This is in keeping with the public funding of much scientific research and the general ethos of sharing and re-use within the scientific community.
Explicit dedication of data underlying published science into the public domain via PDDL or CCZero is strongly recommended and ensures compliance with both the Science Commons Protocol for Implementing Open Access Data and the Open Knowledge/Data Definition.
Congratulations to Titus Brown and others on the NSF’s announcement that it will fund the BEACON (Bio/computational Evolution in Action Consortium) Science and Technology Center. BEACON “…BEACON is focused on studying the evolution of organization across multiple scales—from genomic and cellular, to multicellular, to inter-multicellular (a.k.a. social)—using techniques from experimental evolution, modeling, and digital life systems.” Long story short, this means that Michigan State University and its partner institutions “…has money explicitly for supporting students doing really sexy interdisciplinary work combining computation and biology.”
Ars Technica isn’t primarily a science site, but even they are now worried about reproducibility in computational science. I think it no longer matters how important this “crisis” actually is—sooner or later, major funding agencies are going to mandate adoption of something like the Open Provenance Model. Problem is, given the current skill set of the average scientist, that will almost certainly translate into burden without benefit.
Over on Nature News, Eric Hand’s article “‘Big science’ spurs collaborative trend” is subtitled, “Complicated projects mean that science is becoming more globalized.” It talks about the benefits of international collaboration, but what it doesn’t say is that sharing ideas, results, procedures, and software requires skills that aren’t part of the standard curriculum. One of the main goals of the rewrite of Software Carpentry is to teach scientists some of what they need to know in order to do what Hand describes without heroic effort. I’d be grateful for suggestions about topics and tools that ought to be on the list, but aren’t.