Course Outline
We are planning to reorganize the Software Carpentry course starting in May 2010, and would be grateful for feedback from past, present, and potential future users. Are the topics below the right ones? Is the problem-based approach accessible? What have we forgotten (and what should we drop to make room for it)? Please send comments directly to Greg Wilson, or add comments to this page.
- Introduction
- Coding Tricks 1: Associative Data Structures
- Track Down a Bug
- Read Text Files
- Test Some Software
- Read More Complicated Text Files
- Coding Tricks 2: First-Class Functions
- Clean Up This Code
- Find Duplicate Files
- Share Work With Colleagues
- Analyze an Image
- Vectors and Matrices
- Make a Program Go Faster
- Generate Results Automatically
- Get Data Out of a Database
- XML
- Object-Oriented Programming
- Represent Information
- Build a Desktop User Interface
- Create a Web Service
- Scaling Up
- Organize a Team 1: Empirical Results in Software Engineering
- Organize a Team 2: Software Development Lifecycles
- Discarded Topics
Introduction
- Course design has been driven to date by what we think students ought to know (“push”)
- Turn that around for V4 and define the course in terms of problems they can solve (“pull”)
- Each slide set starts with a problem and builds up to solution
- Suitable for a 45-minute lecture with questions (approx. 20 slides)
- Hope that this will clarify reorganization of material
- While showing students how to evolve solutions piecemeal (development process in the small)
- Note: in prerequisite lists below, “basic programming” means “strings, lists, file I/O, control flow, functions”
Reading
Coding Tricks 1: Associative Data Structures
Requires: basic programming
Introduces: sets; dictionaries
Problem: What Molecules Can We Make?
Molecules’R'Us has hired you to write a simple inventory management program for them. Your program is supposed to read in a file describing how many atoms of various kinds are required to make different molecules, and another file describing how many atoms the company actually has in its teeny tiny warehouse, and then print out a list of the molecules the company could make. The first file (describing molecules) is formatted like this:
# Comments start with '#' and go to the end of the line. # Blank lines (like the one below) are allowed. helium : He 1 # molecule name, colon, atom type, number of atoms ammonia : N 1 H 3 # molecules may contain many types of atoms salt : Na 1 Cl 1 # atom names may be one or two characters long lithium hydride : Li 1 H 1 # molecule names may be several words long
The file that specifies how many atoms the company has on hand is formatted like this:
# Once again, comments and blank lines are allowed. He 3 # atom name, number available H 2 N 5 Li 1
For these two input files, the output would be:
helium lithium hydride
Note that molecules’ names appear in alphabetical order, and that the system doesn’t have to say how many molecules of any type could be made.
Topics
- Sets
- Set operations
- How set values are stored
- Immutability
- Frozen sets
- Tradeoffs in language design
- Motivating dictionaries
- Creating and indexing
- Updating
- Membership
- Loops
- Other operations
- Example: counting frequency
- Imposing order
Reading
Track Down a Bug
Requires: basic programming
Introduces: debugging, IDEs
Problem: Fixing an Intermittent Bug
You are using a program from Euphoric State University that simulates diffusion-limited aggregation. It seems to work correctly most of the time, but every once in a while the fractals it produces are so improbable that you’re convinced there must be a bug lurking somewhere in the code. The original author hasn’t responded to your emails, so it looks like you have to track it down and fix it yourself.
Topics
- What’s wrong with print statements
- Symbolic debuggers
- Command-line debuggers
- Inspecting values
- Controlling execution
- Conditional breakpoints
- Watchpoints
- Rule 0: Get It Right the First Time
- Rule 1: What Is It Supposed to Do?
- Rule 2: Is It Plugged In?
- Rule 3: Make It Fail
- Rule 4: Divide and Conquer
- Rule 5: Change One Thing at a Time, For a Reason
- Rule 6: Write It Down
- Rule 7: Be Humble
Reading
Read Text Files
Requires: basic programming
Introduces: data processing patterns; incremental development
Problem: Converting Legacy Data Files to a New Format
Your supervisor has just acquired 300 files containing descriptions of approximately 15,000 organic molecules in a variant of an old-fashioned text format called PDB. She would like you to compare the coordinates of the atoms in 700 of these descriptions to the “standard” coordinates contained in an off-the-shelf molecular dynamics program. Your task is to extract the data describing those 700 molecules from the files and write it out in the required format for further processing.
Topics
- Motivation: describe PDB files
- Defer intra-line complexities to next lecture to concentrate on parsing patterns for now
- As in Data Crunching, build up to multi-molecule PDB parser in stages
- Emphasis is coding idioms (small-scale design patterns)
- Single-line records
- Skipping blanks and comments
- Multi-line records with explicit terminator
- Multi-line records without terminators (read-ahead by one)
Reading
Test Some Software
Requires: basic programming
Introduces: structured unit testing; frameworks
Problem: Testing a Data Conversion Tool
Your supervisor was quite happy with the program you wrote to extract data from old PDB files. She’s less happy about the fact that every time her other students modify it, it takes hours or days to find and fix their mistakes. Your task now is to find a way to prevent bugs getting into the version of the program that’s being used to generate results for publication.
Topics
- Why test?
- Why use a framework for testing?
- A basic example
- What’s going on under the hood
- Testing exceptions
- Testing I/O
- Stubs and mock objects
- Testing performance
- Test-driven development
Reading
Read More Complicated Text Files
Requires: data processing patterns
Introduces: regular expressions
Problem: Converting More Complex Data Files
You did such a good job converting the PDB files in the previous problem that your supervisor has offered to loan you to a colleague in geology, who wants to convert a dozen files of rock temperature observations from the slopes of dormant volcanoes to a new format. The problem is, those files were typed in by hand by several different graduate students, and are in several slightly different formats. For example, some files use commas as field separators, while others use spaces, and still others use a mix of both. Similarly, some record the dates of observations as “2003-08-05″, while others use “Aug 5, 2008″, and so on.
Topics
- Motivation: extract complex/varying data from within strings (real PDB format)
- Basics of regular expressions
- Also cover coding patterns: match, extract, extract-all, extract-and-remainder
- Wrap up by rewriting earlier parser using REs
- Brief mention of parser generators (Yacc, Antlr), which do for REs what REs do for strings
Reading
Coding Tricks 2: First-Class Functions
Requires: basic programming
Introduces: functional programming
Problem: Slicing and Dicing Data
Perhaps you shouldn’t have done such a good job converting those data files: your supervisor now wants to reanalyze the data in some old spreadsheets, and thinks you’d be perfect for the job. At first glance, you’ll have to write a dozen or more programs, all very similar, to do everything she wants; you’d like to write just one.
Topics
- How programs are represented in C or Python
- How a call stack works
- Passing functions as arguments
- Storing functions in data structures
- Map, reduce, filter, etc.
- Default parameter values
- Extra arguments and extra keyword arguments
- Closures (or is this too advanced?)
Reading
Clean Up This Code
Requires: basic programming
Introduces: coding style; refactoring
Problem: This Code Is Unreadable
You finally got your hands on the software used to generate results for a paper that you’ve cited twice, and you can’t make heads or tails of it. One function is 750 lines long; another called i_hope_this_works seems like an exact duplicate of the one above it, and a typical variable name is A (which is either the matrix of regression coefficients or the temperature gradient—it’s hard to tell). It’s a mess: you need to clean it up before you can even start thinking about adapting it.
Topics
- Reading is learning
- A little bit of cognitive psychology
- What does this have to do with programming?
- Naming things
- Scope and size
- Function length
- Code metrics and their shortcomings
- Style tools
- Documentation
- Embedding documentation in code
- Refactoring
- Common refactorings
- Refactoring tools
- Working with legacy code
Reading
Find Duplicate Files
Requires: basic programming; dictionaries; first-class functions; recursion
Introduces: manipulating files and directories; checksums
Problem: Find Duplicate Data Files
One of the grad students in your lab recently moved to Euphoric State University to start a post-doc. She didn’t organize the data on her computer before she left, so your supervisor has asked you to. After poking around for an hour, you have realized that every time she started a new experiment, she copied entire folders of data, then overwrote some of the copies without changing their names. Before you can start making sense of the data itself, you have to figure out which files are redundant, and which are unique.
Topics
- Naive solution: compare each file against each other
- Better: only compare files that are the same size
- Better still: store by size so that finding potential matches is very fast
- Write and test the checking function
- Double back and write the function that finds all the files and their sizes
- What are filesystems and what do they contain?
- Getting properties of filesystem objects
Extensions
- Remove duplicates
- Manipulating files and directories (delete, create, etc.)
- Why moving is better than deleting
- And why the original directory structure should be preserved
- The problem of atomicity
- Solution
- Making file-matching even faster with checksums
- Data file sizes often aren’t distributed randomly
- What a checksum is
- Calculating the probability of collision
- Solution
Reading
Share Work With Colleagues
Requires: nothing
Introduces: version control
Problem: Writing a Paper With Several Other People
Your lab has been working with groups at two other universities to analyze anomalies in the trajectories of deep space probes, and the time has come to write a paper summarizing your findings. The last one produced by this collaboration was 35 pages long, and had 15 authors, 20 figures, and 400 references. It took six weeks to write, half of which was spent tracking down and reconciling bits and pieces that had gone astray in email or been overwritten accidentally. Everyone would like to find a less painful way to get this one written; since you weren’t able to attend the organizational meeting, you’ve been put in charge of figuring out how.
Topics
- Problem #1: collaboration
- Solution: version control
- Problem #2: undoing changes
- Solution: version control (again)
- Which version control system?
- The update/edit/commit cycle
- Reading Subversion output
- Resolving conflicts
- Starvation
- Binary files
- Reverting
- Creating a repository
Note: this lecture will use a GUI like SmartSVN so that students don’t need to know how to use a shell in order to use version control.
Reading
Analyze an Image
Requires: basic programming
Introduces: images; recursion
Problem: Tracking Cometary Impacts
As a side project to keep yourself from going crazy while you’re revising your thesis, you’ve started modeling what happens when comets impact gas giant planets. To validate your models, you’d like to compare them with data from eleven real impacts. For each impact, you have a sequence of between twenty and sixty photographs; what you’d like to do is measure how quickly the shock wave from the impact propagated and then dissipated. Instead of going through the photographs by hand and trace the perimeter of the shock wave using a drawing tool, you would like to write a program to calculate this boundary automatically.
Topics
- Raster vs. vector representations
- How raster images are stored (colors per pixel)
- Basic operations
- Flood fill algorithms
- Edge detection algorithms
Reading
Vectors and Matrices
Requires: basic programming
Introduces: data parallelism
Problem: Simulating Two-Dimensional Cellular Automata
A cellular automaton (CA) is a simple model of some phenomenon that divides the world into cells, then updates the state of each cell simultaneously based solely on its present state and the states of its neighbors. The most famous CA is Conway’s Game of Life, which runs on a two-dimensional grid in which each cell is either “alive” or “dead”. Your task is to write a program to simulate the evolution of a grid of cells according to its rules.
Topics
- Mathematicians don’t iterate—why should computers?
- The idea of data parallelism
- Basic operations on vectors
- Basic operations on matrices
- Reshaping data
- Conditional operations with masks
- A one-dimensional CA
- Replacing conditionals with lookups
- Linear algebra
- Other applications (e.g., list and set comprehensions)
- Why some problems aren’t data parallel
Reading
Make a Program Go Faster
Requires: basic programming
Introduces: profiling; algorithmic efficiency
Problem: Speed Up Simulation of Invasion Percolation
A student intern in your lab wrote a small program last summer to study invasion percolation, a simple model of the way fluid forces its way into fractured rock. It’s useful, but it’s also very slow, particularly for large simulations. Your supervisor would like you to make it run faster so that you can do more and bigger simulations in time for an upcoming paper deadline.
Topics
- The invasion percolation problem
- Ways to make it faster (bounding box)
- Using a profiler to find out where the time actually goes
- How much work has to be done to add a point?
- Better ways: use priority queue to track neighboring points
- Why this is so much better
Reading
- There’s much less on this topic than you’d hope for.
- Writing Efficient Programs
- Algorithms in C
- High Performance Web Sites
Generate Results Automatically
Requires: basic programming
Introduces: build tools; provenance
Problem: Run a Program on Multiple Inputs Automatically Every Night
You are developing new algorithms for finding similarities between DNA records in standard databases. To test them out, you want to set up an overnight sequence query service. Scientists will send sequences to you in email, which you will save in a temporary directory. Each night, your program will process all of the new input files to create matching output files. After checking these in the morning, you will return them to the scientists, then compare them to the results from eariler versions of your program to see if your algorithms are getting better or not.
Topics
- Why to automate routine tasks
- Make pros and cons
- SCons pros and cons
- Simple rules (single target, single action)
- Multiple actions
- Don’t repeat yourself (macros in Make, and why)
- Dependencies
- Dependency graphs
- Shorthand notation (automatic variables in Make)
- Pattern-based rules
- Functions
Extensions
- Sending email programmatically
- Reading mailboxes programmatically
- Cron to run the process automatically
Reading
Get Data Out of a Database
Requires: basic programming
Introduces: relational databases
Problem: Keeping Track of Experiments
After slaving away in a poorly-lit basement office for 19 years, an archivist has created a detailed database of experiments done in the late 1800s and early 1900s. You have a hobbyist’s interest in the history of science, and would like to discover how involved your personal heroes were in certain key discoveries. You are certain this information is in the database; you just need to figure out how to get it.
Topics
- History
- When to use a database
- Using SQL
- Creating a table
- Simple selects
- Sorting
- Joins
- Keys and constraints
- Eliminating duplicates
- Aggregation
- Grouping
- Self joins
- Null
- Using SQL from other languages
- Concurrency
- Transactions
- Nested queries
- Negation
Reading
XML
Requires: recursion
Introduces: hierarchical data
Problem: Importing Another Bunch of Molecules
Your supervisor just acquired more molecule descriptions for you to check. Unfortunately, these aren’t in the same format as the ones you worked with earlier, but are instead stored in XML. Your job is once again to extract information about 700 specific molecules from those files and write it out in a different format for further processing.
Topics
- Why markup?
- Document structure
- Formatting rules
- Basic HTML
- Attributes
- Attributes vs. elements
- External data: images
- Links
- DOM
- Finding nodes and extracting immediate information
- Creating a tree
- Walking a tree
- Modifying the tree
Reading
Object-Oriented Programming
Requires: basic programming
Introduces: classes and objects; abstraction, polymorphism, aggregation, and inheritance
Note: it’s obviously impossible to introduce object-oriented programming in just a couple of hours, but it’s equally impossible to omit this topic—everyone asks for it.
Problem: Simulating a Laboratory
Euphoric State University has published new guidelines on scheduling the use of heavy laboratory equipment. Like many professors, your supervisor is deeply suspicious of change, and wants to prove that the new rules will actually make labs less efficient. Your task is to come up with a way to simulate the interactions between experiments that need several machines.
Topics
- Abstract data types
- Classes and instances
- Defining a class
- Creating an instance
- Defining methods
- Constructors
- Creating members
- Encapsulation
- Polymorphism
- Inheritance
- Overriding methods
- The Liskov Substitution Principle
- Overloading operators
- Discrete event simulation example
- The Template Method design pattern
- The Strategy design pattern
Reading
Represent Information
Requires: databases; object-oriented programming; XML
Introduces: data modeling; entity-relationship diagrams; class diagrams; robustness diagrams
Problem: Store Information About Extra-Solar Systems
You just got a job with the Euphoric State Science Center, and your first assignment is to help create a program that grade school students can use to explore newly-discovered extra-solar planetary systems. Someone else is going to build the graphical interface; your task is to figure out how to store information about planetary systems in a database, and how to represent that information inside a program.
Topics
- Entity-relationship diagrams
- Database design
- First normal form
- Second normal form
- Object-oriented analysis & design using nouns and verbs or CRC cards
- UML class diagrams
- Aggregation vs. inheritance
- Robustness diagrams
- Object/relational mapping (just the concept, not an implementation, or is this too much?)
- XML/RDF (or is this too much too?)
Reading
Build a Desktop User Interface
Requires: object-oriented programming
Introduces: event-driven programming; GUI design
Note: the real focus of this lecture is event-driven programming, rather than GUI design—the latter is primarily a motivation for the former.
Problem: Build a Custom Interface for Configuring a Simulation Program
The invasion percolation program you sped up has become very popular—so popular that people are asking use to add new features. By far the most frequent request is some sort of graphical interface for setting parameters and displaying results. You have therefore decided to build a simple desktop GUI to let people set up and run simulations interactively.
Topics
- What makes user interfaces different
- How event-driven programming works
- A single-button dialog
- Separating models from views
- The observer/observable pattern
- Connecting multiple views
- Layout
- Top 10 graphic design rules
Reading
Create a Web Service
Requires: basic programming; XML
Introduces: HTTP; CGI; RSS
Problem: Mix and Share Data on the Web
The recent discovery at Euphoric State University of a reliable way to produce magnetic monopoles has physicists the world over in a frenzy. Many of them are posting data on their blogs almost hourly, and you’re finding it hard to keep up. You would like to build something that will aggregate this information, weed out duplicates, identify contradictions or inconsistencies, and then publish the results for others to use.
Topics
- Small pieces, loosely joined
- Why distributed is different (partial failure, race conditions, deadlock)
- The Hypertext Transfer Protocol (HTTP)
- HTTP request line
- Headers
- Body
- HTTP responses
- Example: building a simple spider
- Passing parameters
- Special characters
- Handling requests
- Web services
- Maintaining state in files and databases
- Locking
- Templating
- Aggregating RSS feeds
- Publishing RSS feeds
- From CGI to 3G web frameworks
- Security
Reading
- TBD.
Scaling Up
Requires: first-class functions, systems programming, vectors and matrices, performance optimization.
Introduces: parallel programming
Problem: Run a Program Thousands of Times
There’s good news, and there’s bad news… The good news is that your laser jello simulator finally does everything it’s supposed to The bad news is that despite all your optimizations, each run still takes 20 minutes, and you need data from at least 50,000 runs to finish your thesis. That works out to 694.5 days, which is longer than your supervisor is willing to wait. Your challenge now is to find some way to gather the data you need more quickly.
Topics
- Moore’s Law
- The promise of parallelism
- Amdahl’s Law
- Shared memory vs. distributed memory
- Data parallelism (and its limitations)
- Task farming
- Map/reduce
- Cloud computing
- Graphical workflow builders (and their limitations)
Reading
Organize a Team 1: Empirical Results in Software Engineering
Requires: nothing.
Introduces: empirical software engineering; key results
Problem: What Do We Actually Know About Software Development?
As late as the 1950s, many doctors rejected the results of clinical studies, saying that what happened “on average” was of no help when they were faced with a specific patient. By the time Sackett coined the term “evidence-based medicine” in 1992, however, most accepted that decisions about patient care should be based on conscientious, explicit, and judicious use of best evidence. This idea is still foreign to most professional software developers, but any academic who claims that a particular tool or practice makes software development faster, cheaper, or more reliable is now expected to back up that claim with some sort of empirical study. Understanding the results of the best of these studies will help you understand why certain practices and tools should be adopted, and what you can expect from using them.
Topics
- What kind of science is computer science?
- Why choices should be driven by science (and why they’re not)
- Productivity
- Complexity
- The mythical man-month
- Glass’s Law
- Runaway projects1: poor estimation
- Runaway projects: unstable requirements
- Boehm’s Curve
- Reuse
- Design vs. coding
- Conway’s Law
- Effect of language choice
- Where the time goes
- Testing
- Code inspections
- Bugs
Reading
Organize a Team 2: Software Development Lifecycles
Requires: empirical software engineering
Introduces: plan-driven lifecycles; agile lifecycles
Problem: Run a Small Team of Software Developers
Congratulations! You now have a tenure-track position, three graduate students, and enough grant money that you can spend the next year rewriting your simulation and data analysis programs. You have seen several academic software development projects turn sour, so you’d like to give your students a bit of structure—the question is, what kind, and where to start?
Topics
- What’s a lifecycle?
- Boehm’s Curve
- Sturdy (plan-driven) vs. agile
- Which is better for scientists?
- Common practices: elevator pitches
- Common practices: version control, bug tracking, continuous integration, post mortems
- Overview of agile development
- Extreme Programming
- Scrum
- Scrum roles
- Scrum practices
- Scrum artifacts
- What goes wrong
- Overview of sturdy development
- Gathering requirements
- From requirements to features
- Building a schedule
- Development
- Tracking progress
- Managing failure
- What goes wrong
- How to choose
Reading
Discarded Topics
These topics have all been in at least one earlier version of this course, but will probably be left out of the next one:
- Design patterns
- Refactoring
- Packaging code for release
- Reproducible research
- Static and dynamic code analysis tools
- Geographic information systems
- Visualization
- Parallel programming
- Functional languages
- Using the shell: still extremely useful (especially when doing systems administration or almost anything with Linux), but experience shows that it’s very hard to persuade students who are used to GUIs that it’s worth learning.
- Computational complexity: this is a fundamental topic in computer science, but the little bit that all scientists should know can be tucked into the lecture on making programs go faster.
- Handling binary data (with reading data directly from hardware as a motivating example). This is also fundamental computer science, but only a small fraction of course participants will ever need it. One piece of the existing lecture that is more broadly useful is the notion of self-describing data, but that can be taught as part of parsing text files.
- Data structures: stacks and trees naturally come up when processing XML; trees will also be discussed (but not directly manipulated) in the lecture on associative data structures. That leaves queues and graphs; the latter is small enough that it can probably be worked in somewhere, while the latter requires at least one lecture of its own. As with binary data handling, though, graphs are only relevant to s a small fraction of students.
- Integrating with C and Fortran: one previous version of the course showed how to do this both by wrapping legacy code, and by running that code in a subprocess. The former requires a solid grounding in either C or Fortran (which we don’t otherwise assume), while the latter is too specialized to be of general interest.
- Using spreadsheets: most life scientists do most of their data crunching with spreadsheets, as do a large minority of scientists in other disciplines. This topic is omitted primarily because of shortcomings in today’s spreadsheets: Microsoft Excel isn’t freely available (or in the case of Linux, available at all), and support for scripting OpenOffice Calc and Gnumeric is weak.
- Data visualization: frequently requested, but experience shows that industrial-strength libraries like VTK are far too complex for this level of course.
[...] Course Outline [...]
Software Carpentry Version 4 is a Go! « Software Carpentry
2010/03/25 at 17:21
[...] Course Outline [...]
Instructional Design « Software Carpentry
2010/03/26 at 20:37