Wednesday, January 31, 2007

Proxies and Delegation vs. Existential Types

Haskell may not be especially object-oriented, but all of the core concepts are still there. They just look a little different.

Consider the following problem: you want to manage multiple kinds of "archive" files (e.g. zip, tar.gz, tar.bz2, .tar, etc.). You want to write one function that returns the most appropriate kind of archive, and functions that work on generic archive objects.

The problem is that the type checker forces you to return a concrete type. Something like this isn't allowed past the type checker:
makeArchive :: (Archive a) => String -> a
Why not? Because makeArchive makes no guarantees on what it's returning. Any function that uses it doesn't really know what type it's getting back. Sure, it looks similar to read:

read :: (Read a) => String -> a
Except that the type checker can infer which version of read you're trying to use. If you're expecting to read an integer, the type checker will infer that you want the version of read that returns an integer. If you're expecting to read some indeterminate type, the type checker will tell you give you an error ("Ambiguous type variable..."):
works = do s <- getLine  
let i = read s
print (i + 1) -- read must provide an Int

doesntWork = do s <- getLine
let i = read s
print i -- no constraint what read provides
The problem is that a function like makeArchive cannot return any instance of a typeclass like Archive, it needs to return a specific type. Another way of looking at it is that you cannot write a version of makeArchive that returns one instance of Archive in one place, and a different instance of Archive in another place. That would be like writing a function that sometimes returns strings, sometimes returns integers, and sometimes returns lists of tuples -- such a function is simply not allowed by the type system.

In object oriented programs, a common solution would be to return a proxy object that delegates all of its operations to some specific object. Fortunately, that's exactly what we need to do in Haskell. All of the necessary types, including the proxy type, will be related to each other because they will be members of the same typeclass:
class Archive a where
getContents :: a -> [String]
...

data ZipArchive = ...
instance Archive ZipArchive where
getContents = ...

data TarArchive = ...
instance Archive TarArchive where
getContents = ...

data TarGzArchive = ...
instance Archive TarGzArchive where
getContents = ...

data TarBz2Archive = ...
instance Archive TarBz2Archive where
getContents = ...
Next, we need to define a proxy type that will wrap any type that is an Archive. To do this, we need to use existential types:
{-# OPTIONS -fglasgow-exts #-}
data ArchiveProxy = forall a. (Archive a) => ArchiveProxy a
instance Archive ArchiveProxy where
getContents (ArchiveProxy a) = getContents a
Now, we can write makeArchive to create a specific concrete type (like ZipArchive, TarArchive, or whatnot), and wrap that in the ArchiveProxy constructor to return something of type ArchiveProxy:
makeArchive :: String -> ArchiveProxy
makeArchive fn
| ".zip" `isSuffixOf` fn = ArchiveProxy (ZipArchive ...)
| ".tar" `isSuffixOf` fn = ArchiveProxy (TarArchive ...)
| ...
The value that's returned by makeArchive is opaque, and can perform any methods that are defined by the Archive typeclass. The benefit is that functions can be written in terms of this generic proxy, but always execute the most specific operation for the archive under consideration.

All of this is explained in the Haskell Wiki under existential types, but that discussion approaches the topic from an academic and mathematical perspective. I suspect many professional programmers would be more comfortable with this object oriented view of the world, using proxies and delegates. The terms are different, but the concept is the same.

Ruby vs. Haskell: Choose what works

I'm working on a Ruby on Rails project for $WORK at the moment. The bulk of the work involves modeling a large corpus of XML documents, extracting metadata, and providing a browse-and-edit interface for that metadata. Standard webapp. Nothing special.

Except that the work I'm focused on is on the workflow system, the part that actually reads documents, parses the metadata, and shoves it into the database.

For the front end, it makes sense that the application uses the full Rails stack: ActiveRecord, ActiveController and the like. It's just a webapp, and the CPU overhead of running the application code in Ruby doesn't matter when compared to the overhead of talking to the database and the client. The advantages of using Ruby on Rails far outweigh the meager amount of hardware it requires.

However, the backend is all about throughput, and the added overhead of running Ruby is quite noticeable. Doubly so because it uses Rails' ActiveRecord models, which spend a lot of time doing reflection on the fly. Yet none of that is really necessary; all the backend is doing is reading the filesystem, parsing XML, and stuffing it into a database. The logic for all three parts is pretty static, and there's no real benefit to using a highly dynamic runtime, especially when the cost in CPU overhead is quite noticable.

In the bad old days, problems like this had one solution: rewrite the code in C.

Today, there's an alternative: rewrite the code in Haskell.

The goal then, as now, is to remove the interpreter overhead from a long running, high throughput process. In the bad old days, that meant leaving features like polymorphism and garbage collection behind, and dealing with segfaults and memory leaks just to get something close to acceptable performance.

Today, none of those tradeoffs are necessary. Haskell can compile down to machine code, and eliminates the interpreter overhead (Thank You, Thank You, Thank You, Simon Peyton Jones!) . But it also provides garbage collection, and polymorphism. It's the best of both worlds: a high level language that compiles down to a native executable. Additionally, once a program gets past the type checker, the compiler pretty much guarantees that segfaults go away. Time and space leaks can still occur, but the profiler can help isolate the misbehaving code with a little effort.

The last time I came across a problem like this, nearly a decade ago, there really was no way out. Rewriting a slow Perl program in C was generally seen as distasteful, tedious, error-prone, but necessary. Today, rewriting a slow Ruby program in Haskell is one of many options available, and certainly isn't that distasteful or tedious anymore. Perhaps in another decade, there will be more alternatives. Perhaps this kind of problem will just go away. Who knows.

While at Rails Edge, I mentioned this strategy to a few people, all of whom were working on Rails apps. No one disagreed that Rails can be a poor choice for backend processing, because of the high CPU overhead and low throughput. Rewriting a Ruby app in Haskell certainly seemed to make sense to everyone I talked to, at least in this one particular scenario.

Haskell: the open secret

It seems like everyone is turning onto Haskell these days.

At Rails Edge last week, I saw a few telltale signs that some of the speakers (including a few members of the Rails core team) were playing with Haskell. In one case, a speaker was flipping through TextMate projects, and briefly displayed one project named "Haskell". In another case, the presenter's browser had
a link to All About Monads centered in the middle of the bookmarks bar.

Of course, I had to take the opportunity to see why these speakers were interested in Haskell. One speaker was looking into Parsec for some insights into language design (for DSLs, probably), while another was revisiting the language after he tried to learn it a few years ago.

It turns out that a few members of the Rails team have informally chosen Haskell as their language of the year this year. Nothing formal, just a bunch of folks who hang out on irc periodically trading bits and pieces of Haskell.

Somehow, I think this bodes well for both Rails and Haskell. The more people who actively look into Haskell, the easier it gets for others to follow. And the more people who take ideas from Haskell in order to apply them to other projects, the stronger those projects get.

(It's interesting that the Pragmatic Programmers put forth the idea of language-of-the-year in 2002, dubbed Haskell the language of the year, and haven't updated it in 5 years. ;-)

Tuesday, January 30, 2007

Is eval really necessary?

Last week, I went to The Rails Edge mini-conference for $WORK. The event started at a pretty high tempo, with Dave Thomas’ talk on Metaprogramming Ruby. It’s a common topic in the Ruby community, and hyper-relevant at any Rails event, since Rails is all about metaprogramming.

Ruby provides a small but complete set of tools for metaprogramming: blocks (closures), lambda, define_method, and various forms of eval. However, Dave was quick to mention that eval is generally considered to be bad form, and should be used sparingly, when no alternative exists. Ruby also has a few nifty features that simplify metaprogramming, like an extensive set of hook methods and open classes.

Perhaps the biggest example of Ruby metaprogramming in action is the Rails ActiveRecord class. ActiveRecord works though inheritance, where the base class intuits the name of the table it models (based on the database used in the current project’s environment), and figures out the rest as needed.

Here’s a rudimentary ActiveRecord model:

class User < ActiveRecord::Base
end

If there are additional relationships in the database, those relationships can be added by invoking (inherited) class methods on the model, in a rather declarative fashion:

class User < ActiveRecord::Base
has_many :purchases
belongs_to :group
end

All the details are filled in through inheritance, method_missing, and/or direct modification of the subclass. Only through a strict discipline of metaprogramming is this magic possible.

But that’s not the big insight. The big insight is that all of this deep magic can be done without eval.

The big win here is metaprogramming, and Ruby and Rails demonstrate together that you don’t (always) need eval or macros to do sophisticated metaprogramming. Ruby’s tools boil down to various kinds of closures: blocks and lambda produce proc-flavored closures, while define_method creates a closure that is explicitly attached to a receiver that will provide a self reference when invoked.

It wasn’t until this point that I realized that Haskell’s support for metaprogramming was as complete and full featured as Ruby’s, and that eval isn’t strictly necessary for doing complex code generation. Even though Haskell doesn’t have eval, it does have lambdas, type classes, higher order functions, combinators, sections, partial application, and many more primitive operations that support metaprogramming. Not only do these features make metaprogramming possible, they make it pervasive all the way down to the prelude.

And for those times when Haskell’s various flavors of closures don’t do the trick, there’s always Template Haskell (which, I have to admit, I’ve never needed to use).

Wednesday, January 10, 2007

My Journey to Haskell

I started learning Haskell about two years ago, when I was working on a high visibility project that simply had to ship. My subproject was the linchpin of a significant multi-year effort to bring a big and important product to the web. Dozens of people spent years designing, refining, and building this application. Yet if my small contribution, written in Haskell, didn't work, then the entire effort would have been worthless.

If that wasn't enough pressure, this was my first project in Haskell. I was working alone, learning the language as I went, with no one to help me.

But before I talk about how Haskell saved the day, I need to talk about this application and the overall platform to set the story in its proper context.



As I mentioned, this project was a web application. It built on an internal platform of Perl, Tcl, C, SQL, and Javascript, and used copious amounts of HTML, XML and CSS for good measure. The platform was about five years old at the time, and it powered multiple live products. The platform did a good job satisfying its design criteria: supporting a family of products that let paying customers search and browse databases full of complex documents.

By this time, upper management realized they had invested five years building this platform, and now it was time to start seeing some returns. All of the building blocks were ready, so shouldn't be too difficult to add new content, make some tweaks, launch a new product, and watch the revenue pour in. Nothing out of the ordinary here.

Of course, it's never that simple. The new product stretched the platform in every conceivable direction: more documents, more kinds of documents, richer documents, and more ways to search, browse and view those documents. For the most part, each of these changes were quantifiable. Lots of work, but easy enough to put into a project plan.

All except for one small, insignificant piece: the query language.



Each product on this platform exposed a simplified full text query language to search within that product. These query languages may have varied slightly from product to product, but they were all more complex than simply stringing words together to make a search phrase, and less complex than, say, SQL. Users could use this language to construct a a query, or they could use a query builder to formulate a query more easily. Then, the browser sends a query back to the application on the web server where it gets translated into something the database understands. The translated query goes to the database, the database returns some results, and the results get formatted and sent back to the browser.

This setup is very common. People started building systems like this decades ago. The only difference between applications such as these and your average web application is that the query language is exposed as part of the user interface. Searching the database is an key part of the system, not something hidden behind a bunch of point-and-click forms.

The new product under development differed from others on the platform in a few important ways. Documents were larger, there were more of them, and they had more precise markup. Searching these documents demanded a query language would allow users to take advantage of the added markup. Without a rich query language, searches would be so imprecise that they (and by extension, the entire product) would be worthless.

The problem was that none of the existing query languages on this platform sufficed. Furthermore, none of the code to parse the existing query language(s) sufficed. And it wasn't really understood if this brand new, rich query language could be supported.

Now, I want to be clear that a lot of work went into this product, which did ship (more or less) on time. The amount of work the entire team did to build the user interface, to extend the database, and augment the core platform was substantial. The point I'm making is that all of that work was quantifiable. Even if the estimates were off by a factor of 2 or 4, it was possible to list all the tasks that needed to be done.

In contrast, there was no way to estimate the effort to write a program that translated the new rich query syntax into the database's query syntax. To make matters worse, the requirements for this language were quite fuzzy in some places and non-existent in others. For example, the specifications gave a few example queries, some desired results, but no grammar. When requirements were available, they were often contradictory.

The project architect was deeply concerned about this. Every other unknown on the project could be handled with a simple matter of programming. Everything except this brand new parser that translated an under-specified query language into database queries.

This was a big gaping hole in the project plan, and it threatened to derail the entire project.



As the deadline loomed, the architect and I started discussing how to write this query translator. The application code running on the web server was mostly Tcl, but writing a parser in Tcl seemed foolish. We could try, but we there was a big risk that (a) it might not work, (b) it would work, but it would be slow, or (c) it would work but be difficult to manage as requirements changed. Given all the possible downsides, we looked for alternatives.

The next path we examined was writing a parser in C. First, it's hard imagining a C-based parser not working, because it's been done soooo many times before. And it's hard imagining a C-based parser being slow. There was still a risk that writing a parser in C would lead to a brittle codebase, and opening Pandora's Box of manual memory management would lead to the odd core dump, memory leak or obscure bug. Nevertheless, we were confident that those risks, although real, were relatively minor.

Writing a parser from scratch in C would be tedious and cumbersome, but at least it filled the big gaping hole in the project plan. Even with all the unknowns, it was likely that the product would support this new query language.



Next, the project architect and I started looking at parser toolkits. Every few years, I peek into lex and yacc again, and I stop reading about the time it makes my eyes bleed. (I honestly can't tell the difference between LL, LR and LALR. It all seems, well, academic.) Instead of spending months evaluating the relative strengths and weaknesses of various tools, we quickly decided that brute-force recursive descent parser would be the best bet. Yes, it would be a lot of work, but it was hard imagining it not getting something working.

As the deadline loomed, I was the one person on the team assigned to parsing this as-yet-undefined query language and translating these expressions into database queries.



Fortunately, this was early 2005, when Audrey Tang got tired of 4 1/2 years of stagnant development with Perl 6 and started the pugs project. Audrey's early progress was something to behold. Every other day or so, she posted that she managed to double the number of features in pugs, double the performance, or more likely, both. Her secret weapons? Haskell and Parsec.

Encouraged by Audrey's amazing and consistent progress, I installed ghc and started playing around with Haskell. I learned enough of the syntax to get a few programs past the type checker, and with a dog-eared copy of the Parsec paper on my desk, I managed to write a functioning roman numeral parser within about a week.

After a long discussion with the project architect, I tried parsing this query language with Haskell + Parsec. If it didn't look promising after a few experiments, there was still plenty of time to go back to the original plan.

Of course, you can guess the result. Haskell worked. The requirements changed wildly many times during development. Eventually they stabilized, and the bottleneck turned out to be waiting for requirements, not implementing them.



What's the secret? It turns out that Parsec is a toolkit for writing recursive descent parsers! Even better, the code reads like the specification for the grammar being parsed.

Here's an example, the standard textbook BNF(-ish) grammar for simple algebraic expressions:

expression ::= term (('+' | '-') term)*
term ::= factor (('*' | '/') factor)*
factor ::= '(' expression ')' | number
number ::= digit digit*
digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

Note that the way terms parse, parentheses are handled first, then multiplication and division, then addition and subtraction. It's the same order of operations you learned in Algebra all those years ago.

Here is that same grammar translated into a parser in Haskell using Parsec:

import Text.ParserCombinators.Parsec

expression = do term
optional (sequence [choice [string "+", string "-"],
term])
return ()

term = do factor
optional (sequence [choice [string "*", string "/"],
factor])
return ""

open = string "("
close = string ")"

factor = do (between open close expression
<|> number)
return ""

number = do many1 digit
return ()

Once you get comfortable with Haskell syntax and Parsec primitives, the code actually reads like a faithful translation of the grammar. The parser above is obviously an expression of the desired grammar because there's not much unnecessary fluff distracting you away from the job at hand. No loops, factory classes, iterators, counter variables or pointer arithmetic. No little languages to run through a pre-processor. This code defines a parser and focuses on the job at hand -- parsing. What a concept!

This particular program defines a parser for expressions that doesn't actually do anything, but it does pass when the input adheres to the grammar, and fails when input deviates from the grammar:

parseExpr s = case (parse (sequence [expression, eof]) "" s) of
Left _ -> "Fail"
Right _ -> "Pass"

parseExpr "3" -- Pass
parseExpr "3+4" -- Pass
parseExpr "3 + 4" -- Fail (no spaces allowed)
parseExpr "3+4+" -- Fail

(I'll write more about this simple little parser in a future post.)



Unfortunately, I can't go into the details of the query language I developed for this project. But I can say that the parser read like simple translation of a specification and not like a sea of code. On multiple occasions, I received subtle requirements changes that inverted the meaning of a term in the query language. Once I understood the old behavior and the new requirement, updating the code seldom took more than a few minutes.

That's why Haskell is the language of choice for discriminating hackers.