Software Carpentry

Helping scientists make better software since 1997

Formats

As I said in last week’s announcement, and mentioned again in a later post, one of the main goals of this rewrite is to make it possible for students to do the course when and where they want to. That means recording audio and video, but much of the material will probably still be textual: code samples (obviously), lecture notes (for those who prefer skimming to viewing, or who want to teach the material locally), and exercises will still be words on a virtual page. And even the AV material will (probably) be accompanied by scripts or transcripts, depending on what turns out to work best.

Which brings up a question everyone working with computers eventually faces: what format(s) should material be stored in? For images, audio, and video, the choices are straightforward: SVG for line drawings, PNG for images, MP3 for audio, and MP4, MPEG, or FLV or video (I don’t know enough yet to choose). But there’s a bewildering variety of options for text, each with its pros and cons.

To choose one, I have to consider:

  1. Authoring tools: do authors need to use a specialized editor? If so, is it freely available for the three major platforms (Windows, Linux, and Mac)?
  2. Composition: can authors “just type”, or do they need to spend a lot of keystrokes on markup?
  3. Diffing and merging: does the format play nicely with version control systems, i.e., if two or more people edit independently, can their changes easily be merged after the fact?
  4. Formatting: does the format allow fine-grained control over layout? (My personal test here is how easy it is to create tables with irregular arrangements of rows and columns.)
  5. Multiple output formats: can HTML pages, slides, PDFs, and what-not all be produced from a single source?
  6. Referencing: does the format take care of section and figure numbering, cross-references, and bibliographic citations automatically?
  7. WYSIWYG: does the raw content have to be compiled or transformed to produce something viewable, or is what you see what you get?

Here are the options as I see them:

Format A C D F M R W Minimum
Microsoft Word -1 +1 -1 +1 -1 +1 +1 -1
OpenOffice 0 +1 -1 +1 -1 +1 +1 -1
DocBook 0 -1 0 0 +1 0 -1 -1
Other XML 0 -1 0 -1 0 -1 -1 -1
Plain Old HTML 0 -1 0 -1 0 -1 +1 -1
S5 and its kin 0 -1 0 -1 0 -1 +1 -1
Wiki text +1 +1 +1 -1 +1 0 -1 -1
LaTeX +1 0 0 +1 0 +1 0 0

I use the minimum in evaluation, rather than the average or total score, because what you notice most when you’re working with something is usually what’s most annoying about it. Or maybe that’s just me… But what do these numbers actually mean? In no particular order:

  • Binary file formats don’t work well with version control systems, since the latter use textual differencing to reconcile changes between versions or by concurrent editors. This rules out the default formats used by Microsoft and OpenOffice.
  • Machine-generated XML doesn’t fare any better, since the differencing tools used in version control systems ignore the semantics (“element inserted”) and become confused by the representation (“18 lines changed”). This rules out various XML-based options for Word and OO.
  • In contrast, XML or HTML that has been written using a plain old text editor usually has line breaks in useful places (i.e., more of the semantics is reflected in the representation) so diff and merge work much better. On the other hand, if you’re using a POTE, 20-40% of your keystrokes go into markup (all those angle brackets and attributes) rather than content. WYSIWYG XML/HTML editors help a bit (I’m using the one built into WordPress right now), but most generate the same tangled diff-hostile output as the options dismissed above. With respect to particular formats:
    • “Real” DocBook is a lot of work to produce. O’Reilly‘s DocBook Lite (a subset of the official format) is less effort, but there are still a lot of angle brackets to type in—I haven’t yet found an editor that will let me type Ctrl-B and switch to DocBook-compliant bolding, for example.
    • Homebrew XML markups, like the one used by Pragmatic, all seem to converge on the features of DocBook Lite. There’s also the problem of finding (or building), tweaking, and maintaining tools to produce the end result.  (I created my own format, and built my own tools, for Version 2 of the course; won’t make that mistake again.)
    • Plain old HTML has all the disadvantages of homebrew XML markup, but does have the advantage of being able to view without a compilation step—so long as you don’t care about numbering, cross-references, etc. For that, you need tools, which need to be created, maintained, and tweaked.
    • Various HTML-based slideshow formats, like S5, add some semantic information to plain old HTML that a bit of in-browser Javascript can use to produce PowerPoint-style effects. Numbering and cross-referencing still need tools, though, and S5 and various follow-ons are mostly orphaned these days.
  • Wiki text: easy to type in (that’s the whole point), and plays well with version control, but (a) processing tools (again), and (b) the degree of control over markup is usually fairly limited. That said, Wiki Creole and reStructured Text are appealing: there are lots of compilation/conversion tools for both. The downside is that both actually require compilation: so far as I can tell, there isn’t a WYSIWYG editor for either that is still being maintained. (Update: there may be one for reST: I’d welcome input from anyone who has used it.)
  • LaTeX: ah, LaTeX, my old nemesis—it has been a while, hasn’t it? It plays nicely with version control; it handles cross-referencing, gives users fine control over layout—very fine control, if you want it—and there is even a WYSIWYG editor. On the downside, its syntax is complicated, but I’ve already mastered it, and so have many other scientists. More importantly, though, my past attempts to produce pretty HTML from LaTeX using Latex2Html and Plastex have been frustrating.

So, does that mean LaTeX is the right answer? My scoring says I should—what do you think?

Advertisements

Written by Greg Wilson

2010/03/30 at 19:32

Posted in Tooling, Version 4

7 Responses

Subscribe to comments with RSS.

  1. In regards to Enthought’s reST editor, I’ve used it on a few occasions (I was the original feature requester). The visual aspect isn’t always exactly what you’ll get once you run it through a processor because of CSS, and its a little slow, but it works. We’re working on replacing some of the components it uses with faster/better variations, so the tool should improve in the coming months even if we’re not specifically addressing features for it.

    bryce hendrix

    2010/03/30 at 19:58

  2. A nice comparison and a balanced summary. I’d choose a tool like this:

    Can’t diff/merge well? Pah, ignore those. Binary format? Poof, ignore those as well.
    Then accept that anything more than plain text requires some extra typing for the markup and that your fingers will quickly learn to ignore it. Then go with the tool that has the most helpful user groups, a reasonably active development community but first and foremost, an existing processing pipeline that produces something like what you want. I.e. don’t assume it can do what you want.

    For me, that would point to LaTeX. This time, you can conquer it.

    Matt Doar

    2010/03/30 at 20:02

  3. Have you used LyX for importing/exporting LaTeX? I haven’t done it much, but I remember pain every time I tried…

    Also, and again in my experience, reSt gives you a fair amount of control over the output. Not perfect, but much (MUCH) more control than, say, Creole gives you.

    Finally, I’d be biased towards giving a lower priority to WYSIWYG. First, if I understand anything about publishing, you’re going to be reformatting everything anyway… And, second, I’m going to suppose that anyone who’s writing for SWC has used one of HTML, LaTeX or wikitext before… So they will be at least fairly familiar with compiling markup.

    Just my $0.02.

    David Wolever

    2010/03/30 at 20:15

  4. Have you considered MultiMarkdown? It handles equations, tables, and there are several export formats (XHTML, Latex, RTF). Also recognized (mostly) by GitHub.

    http://fletcherpenney.net/multimarkdown/

    Neil

    2010/03/30 at 22:06

  5. pandoc may allow you to move between some of these competing formats, at least as long as you’re willing to work with the lowest common denominator.

    http://johnmacfarlane.net/pandoc/

    Rhys Ulerich

    2010/04/01 at 03:19

  6. Greg, I have read most of your books (just bought Practical Programming and I’m really enjoying it as I finally dig into Python) and have followed the Software Carpentry project for a while, knowing long ago it would be my first choice when I had time to dive in. Your time and the Python Software Foundation’s money has been well spent. Thank you.

    I sincerely wish you all the best for the next version, but I think it high time to radically open up this resource in terms of enabling easy re-ordering, re-purposing, extension, and reuse of this very valuable content. These capabilities would address many important concerns expressed in feedback and help ensure the long-term longevity of your work.

    I think the biggest bang for the painful buck (or the ~150,000 bucks you are asking for) of re-formatting and revising the Python Software Foundation’s previous investment and copyright on Software Carpentry would be to move it to an online, collaborative editing and re-mixing environment, and the most mature platform for those purposes by far is Connexions, developed over the last decade by Rice University, with support of the William and Flora Hewlett Foundation, the Maxfield Foundation, and the Connexions Consortium. (I am not affiliated with any of the above).

    http://cnx.org/

    For customization and mashup reasons alone, sooner or later, someone will post equivalent material on Connexions. If not you, someone less experienced and qualified, no doubt.

    Such liberal licensing (attribution only — CC-BY — allowing both derivative and commercial use) is very important, given the nature of the likely employers of scientific and engineering Python programmers.

    See Using Free and Open Online Resources: Licensing and Collections by Norm Friesen at
    http://learningspaces.org/n/files/CC_Resources.pdf

    The finely-grained modularity of the ‘collections’ that a canonical work represents fits well with current Software Carpentry content and will allow ALL potential teachers and/or independent learners and, importantly, interested scientific & industrial concerns, to easily customize, extend, and re-order precisely the course they need, in whatever format is most convenient to them. This would be a huge win.

    All seven of your criteria can be answered positively. (On #7, see http://cnx.org/help/authoring/editmodule — WYSIWYG, or import from Word, OpenOffice, or LaTeX, or edit raw markup).

    Of course, the editing and hosting software (Rhaptos) is open source and you can host it yourself and even re-brand it, if desired. Perhaps the Python Software Foundation would like their own instance? (Written in Python – Zope)
    http://rhaptos.org/

    Alternatively, branding of a canonical collection of modules hosted freely on the the Connexions server is easy.
    http://cnx.org/help/authoring/brand-lens

    Importantly for academia, peer review is supported, so one need not fear the admixture of your carefully edited collection of modules with the various alternate modules (e.g., R in place of MatLab, or alternative version control systems) that may created by others. Different institutions, and even individuals like myself, could have their own ‘lenses’ (collections of modules, some from one canonical source, others picked from elsewhere, or self-created).

    This would be the current 21st-century solution of choice, IMHO.

    Check it out: http://cnx.org/

    Let a million flowers bloom!

    Paul Wilson

    2010/04/04 at 21:13

  7. ERRATA: whoops, I hadn’t noticed the license is already CC-BY — http://software-carpentry.org/license.html — mea culpa for preaching to the converted!

    ADDENDA:

    1. Some high-quality ‘featured content’ on Connexions:
    http://cnx.org/featureContent

    2. Adding multimedia (ANY valid MIME type) to collections:
    http://cnx.org/content/m12660/latest/

    Paul Wilson

    2010/04/04 at 22:46


Comments are closed.

%d bloggers like this: