Author Archive for Adam

A Pipeline is a Rakefile

Update: Mike over at Bioinformatics Zen has written a more thorough post about organised bioinformatics experiments with examples using Rake and DataMapper. Definitely check that out.


image credit: railsenvy.com

Make and it’s other revisionings tackle the challenging problem of dependency injection which is somewhat analogous to the Strategy pattern. Make is a tried and true Unix utility that does the heavy lifting each time you type “./configure; make && make install” inside a large chunk of open source goodness. Make became such a popular tool because it drastically reduced compilation times for large programs. In compiled languages such as C, each time a source file is changed it needs to be recomplied. Rather than rebuild the entire project everytime the source code is changed, an expert (a C programmer in this case) can specify dependencies so that make will build only the files that change and their dependencies. In that sense, it’s easy to take for granted how powerful a Makefile actually is. Make is an expert system that’s ubiquitous in the Unix world.

A makefile has the basic structure:


	target: dependencies

		command 1

		command 2

	          .

	          .

	          .

		command n

Which brings us to the actual point of this post; how to use Makefiles in bioinformatics. There’s a discussion on nodalpoint from 2007 that calls for the use of `make` more often when programming pipelines. This made perfect sense. In bioinformatics we do pipelines all the time.

Sequence analysis

Blast search -> Multiple sequence alignment -> Phylogenetic analysis

Homology Modeling

Find Template -> Align target-template -> Build model

Molecular Dynamics

Solvate -> Equilibrate -> Simulate -> Analyze

Those aren’t the most detailed examples but hopefully you get the idea. Each step is dependent on the previous step. If one single step takes a lot of computation time, it would be nice to skip that step if it’s already been done. There’s also a benefit to encoding expert knowledge. For example, how do you convert a .fasta sequence file to a .pir sequence file? By specifying a rule, a build system will know what to do everytime is sees a ‘*.fasta’ file in your project.


	%.pir: %.fasta

	./fasta2pir $< $@

But Makefile syntax can be tricky (is that a tab or a space?), and it's not a full blown programming language by itself. Which is why I fell in love Rake.

Anyone who has tried out Ruby on Rails probably typed something like "rake db:migrate" without realizing what rake is all about. Rake is Ruby Make. Rake was designed to be just like make, but with all the power and flexibility of the Ruby programming language. A Rakefile is simply a set of tasks, which can have one or more dependencies. Unlike make, rake is an internal DSL since it morphs Ruby into a build language without losing it's utility as a general purpose language.

A simple Rakefile in your bioinformatics project could do something like this:


	task :queryDatabase do

	  puts "Fetched Records"

	end

	task :formatData => :queryDatabase do

	  puts "Converted to XXX format"

	end

	task :createPlot => :formatData do

	  puts "Generated a Figure"

	end	

This says "before I formatData I must queryDatabase", and "before I createPlot I must formatData". So as you might expect, when you type:


	$ rake queryDatabase

	Fetched Records

	$ rake formatData

	Fetched Records

	Converted to XXX format

	$ rake createPlot

	Fetched Records

	Converted to XXX format

	Generated a Figure

And our Fasta rule in Rake would look like:


	rule '.pir' => ['.fasta'] do |t|

	  sh "./fasta2pir #{t.source} #{t.name}"

	end 

Pretty cool? Obviously these tasks don't actually do much other than show how rake resolves dependencies for you, which can be a pretty powerful thing for hacking together a pipeline.

Rake resources:

Dynameomics: Mass annotation of protein dynamics

Just in case you need another -omics in your biotech vocabulary. Dynameomics is an effort by the Dagget group at the University of Washington to

characterize the native-state dynamics and folding/unfolding pathways of representatives of all known protein folds by way of molecular dynamics simulations

Three successive articles have been published in Protein Engineering Design & Selection to describe over 3000 long molecular dynamics simulations, the computational workflow, and data mining capabilities of Dynameomics. Dynameomics has applets for visual analysis and even high-quality movies of their MD trajectories!

Papers:

Video: 4PGA unfolding movie

High-performance data appliances (Netezza)

This afternoon I sat through a presentation from a few guys at Netezza. They were here to discuss their system for high-performance data analytics. What they’ve effectively done is build a large database machine with some special hardware to accelerate database queries via parallel processing nodes. These are some notes I jotted down:

Architecture:

  • SMP Host
  • 100+ specialized processing units per cabinet (they named them SPU’s for “snippet processing units”)
  • SPU’s have their own PPC CPU, commodity disk, memory, and an FPGA
  • GigE networks between SPU’s
  • SMP Host partitions queries and broker activity to the processing nodes
  • Hardware fault-tolert (SPU’s can be hotswapped)

I’ll admit my skepticism tends to mount against any speaker that spends a lot of time at the outset with a marketing pitch when the audience is full of scientists. Do scientists need to be reminded that data sizes are growing? Or that enterprise X, Y, and Z are already using your product? Just show me how at works.

I did a quick search across my feeds to see if anyone has written about Netezza and (not surprisingly there is a post over at Computing at Scale. It appears there are similar efforts from Teradata, Greenplum, and DATAllegro in this space.  I can imagine how a systems like Netezza’s might complement more traditional supercomputing.  There’s certainly a big effort to commercialize the “new era of HPC” but the technologies that come out of it are business-driven and not science-driven.

Around the web 3/21/08

quarternion_jmol

Around the web, week of March 21, 2008

    Journals
    Big science from Andrei Sali and David Baker

  • The molecular architecture of the nuclear pore complex
  • De Novo Computational Design of Retro-Aldol Enzymes
  • Blogs

  • Nature archive visualized – a Processing sketch to visualize the keywords from Nature over the last 30 years. Some of the more spurious terms could probably be cleaned up but even as a draft the effect is pretty neat.
  • Research streaming is born. Mike from Bioinformatics Zen is auto-publishing his svn commit messages and uploading figures he generates to Flikr. This would be well suited to someone like me who has too many projects going on to stop and dedicate time to blog about them here.
  • Universal Parallel Computing Research Centers are being heavily funded by Microsoft and Intel. One at University of Illinois at Urbana-Champaign, well known for the CHARMM++ parallel library and the super-scalable NAMD molecular dynamics package built on top of it. The other will be located at UC Berkeley.
  • The End of the Relational era, is SQL dying? Bill McColl of Computing at Scale says it is. I would argue that relational databases have received the golden hammer treatment over the years. But I totally agree with his prediction that SQL will ultimately be replaced by DSL’s having implicit data-parallelism.
  • The Youtube API has been updated with some significant improvements for developers. Uploads, comments, and video playlists can all be manipulated outside of youtube. This makes a convincing case to leverage the massive youtube userbase if your site deals with video content.
  • Tech

  • I’ve finally moved most of my projects from SVN to Git. I’m now a ‘branch-a-holic’ and git definitely fits my workflow better than subversion now that I’m used to it.
  • Capistrano is typically used for Rails deployment, but I’m finding it’s good for just about anything you want to run across multiple remote hosts. This is a great mini-language for cluster admins who don’t want to struggle with something like mpirun

Biorobotics: Snake Robots! [Video]


These things are being developed by the Robotics Institute on our campus. I’m partially amazed and partially terrified. I’ve heard they work wirelessly and they want to have snakes where each module has a camera so they can break apart into independent pieces, spread, and reassemble automatically. Some of the climbing behavior is pretty impressive…

Read more about this technology at the Modsnake website.

Around the web 3/7/08

rb-processing

Around the web, week of March 7, 2008

A domain specific language for screencasting

castanaut

Two topics that I have been been reading a lot lately, Domain Specific Languages in Ruby and screencasting have converged to create a very cool little project called Castanaut. I found Castanaut via Peter Cooper of Ruby Inside. Castanaut is essentially a programming language for screencasts. So in castanaut you can write things like this:


launch "Safari", at(10, 10, 800, 600)

type "http://www.inventivelabs.com.au"

hit Enter

pause 2

move to(100, 100)

move to(200, 100)

move to(200, 200)

move to(100, 200)

move to(100, 100)

say "I drew a square!"

Thanks to the flexibility of Ruby, you can write your screenplay as a script and run it to automatically create a screencast. How cool is that? While this might take some of the personal touch away from screencasts, it could also be a powerful tool for those who need to create them in a more systematic way.

What would you do with a million CPU’s?

ps3folding

There’s a new podcast on Futures in Biotech with Dr. Pande from Folding@Home. Macresearch summarized it well:

  • How a bunch of Sony PS3s have become the largest component of the world’s fastest computer
  • The challenges of distributed computing, and in particular how data storage and CPU usage can actually complement each other
  • After the hype in the 80s around computational modeling of protein structure, the computational power available today could finally make that hype a reality
  • How to take a non-parallel task and transform it into a series of computational chunks (a.k.a. how to make a baby in 1 day with 270 women)
  • How modeling of protein structure will be able to get more into the dynamics of protein conformational changes
  • What would you do if you had 250,000 CPUs?
  • I really like the final point, “What would you do with 250,000 CPU’s”, because it’s an important question. Petascale computing has arrived but most applications aren’t ready to scale to thousands or millions of cores. Folding@Home is as a distributed computing project as it is biomedical. What they’ve been able to do is treat simulations as data and use bayesian data mining techniques to put together the whole picture with suprising efficiency. A clever workaround for Folding@Home’s “supercomputer”, which is severely limited by network latencies and individual agents with slow hardware compared to ‘real’ supercomputers. Finally he reports that PS3’s and GPU’s are achieving 20-30x acceleration. Exciting stuff!

    image taken from Flikr, CC licence

    The Low-Information Diet

    Learning to ignore things is one of the great paths to inner peace.
    -Robert J. Sawyer, Calculating God

    Over the holidays I used the time off to finally read the excellent book by tech entrepreneur Timothy Ferris entitled The 4-Hour Workweek. Among his many techniques for increasing effectiveness and lifestyle design, Tim prescribes a “Low-Information Diet”. Being away from the lab was a perfect opportunity to test out an immediate one-week media fast. The rules are pretty simple:

    • No newspapers, magazines, or nonmusic radio
    • No news at all
    • No television
    • No reading except one hour of fiction
    • No Web surfing

    This really exposed a bad habit of mine, unnecessary reading. My attention is almost constantly consumed by Google Reader as I unenthusiastically scour blogs, news, forums, and journals for several hours per day rendering me much less effective for the most important tasks. Following the rules above for over a week I feel rejuvenated.  There’s a 9-day information gap in my Google Reader stats that I am quite proud of

    google reader fast

    Most Innovative Use of HPC in Life Sciences

    The PSC and NRBSC have made the news again, this time in HPCwire. They’ve posted the Readers and Editors Choice awards for SC07 and the WiiMD demo earned us “Most Innovative Use of HPC in Life Sciences”.

    wiimd_bowling

    Further Reading:
    WiiMD: Bowling on Big Ben
    Engadget: wiimote used in buckyball bowling and other educational simulations