Bleeding Edge Biotech

Bioinformatics and Big Iron

A Pipeline Is a Rakefile

Update: Mike over at Bioinformatics Zen has written a more thorough post about organised bioinformatics experiments with examples using Rake and DataMapper. Definitely check that out.

Make and it’s other revisionings tackle the challenging problem of dependency injection which is somewhat analogous to the Strategy pattern. Make is a tried and true Unix utility that does the heavy lifting each time you type “./configure; make && make install” inside a large chunk of open source goodness. Make became such a popular tool because it drastically reduced compilation times for large programs. In compiled languages such as C, each time a source file is changed it needs to be recomplied. Rather than rebuild the entire project everytime the source code is changed, an expert (a C programmer in this case) can specify dependencies so that make will build only the files that change and their dependencies. In that sense, it’s easy to take for granted how powerful a Makefile actually is. Make is an expert system that’s ubiquitous in the Unix world.

A makefile has the basic structure:

    target: dependencies  
        command 1  
        command 2  
              .  
              .  
              .  
        command n  

Which brings us to the actual point of this post; how to use Makefiles in bioinformatics. There’s a discussion on nodalpoint from 2007 that calls for the use of make more often when programming pipelines. This made perfect sense. In bioinformatics we do pipelines all the time.

Sequence analysis Blast search –> Multiple sequence alignment –> Phylogenetic analysis

Homology Modeling Find Template –> Align target-template –> Build model

Molecular Dynamics Solvate –> Equilibrate –> Simulate –> Analyze

Those aren’t the most detailed examples but hopefully you get the idea. Each step is dependent on the previous step. If one single step takes a lot of computation time, it would be nice to skip that step if it’s already been done. There’s also a benefit to encoding expert knowledge. For example, how do you convert a .fasta sequence file to a .pir sequence file? By specifying a rule, a build system will know what to do everytime is sees a ‘*.fasta’ file in your project.

    %.pir: %.fasta  
    ./fasta2pir $< $@  

But Makefile syntax can be tricky (is that a tab or a space?), and it’s not a full blown programming language by itself. Which is why I fell in love Rake.

Anyone who has tried out Ruby on Rails probably typed something like “rake db:migrate” without realizing what rake is all about. Rake is Ruby Make. Rake was designed to be just like make, but with all the power and flexibility of the Ruby programming language. A Rakefile is simply a set of tasks, which can have one or more dependencies. Unlike make, rake is an internal DSL since it morphs Ruby into a build language without losing it’s utility as a general purpose language.

A simple Rakefile in your bioinformatics project could do something like this:

    task :queryDatabase do  
      puts "Fetched Records"  
    end  

    task :formatData => :queryDatabase do  
      puts "Converted to XXX format"  
    end  

    task :createPlot => :formatData do  
      puts "Generated a Figure"  
    end   

This says “before I formatData I must queryDatabase”, and “before I createPlot I must formatData”. So as you might expect, when you type:

    $ rake queryDatabase  
    Fetched Records  
    $ rake formatData  
    Fetched Records  
    Converted to XXX format  
    $ rake createPlot  
    Fetched Records  
    Converted to XXX format  
    Generated a Figure  

And our Fasta rule in Rake would look like:

    rule '.pir' => ['.fasta'] do |t|  
      sh "./fasta2pir #{t.source} #{t.name}"  
    end   

Pretty cool? Obviously these tasks don’t actually do much other than show how rake resolves dependencies for you, which can be a pretty powerful thing for hacking together a pipeline.

Rake resources: