Wednesday, January 10, 2007

My Journey to Haskell

I started learning Haskell about two years ago, when I was working on a high visibility project that simply had to ship. My subproject was the linchpin of a significant multi-year effort to bring a big and important product to the web. Dozens of people spent years designing, refining, and building this application. Yet if my small contribution, written in Haskell, didn't work, then the entire effort would have been worthless.

If that wasn't enough pressure, this was my first project in Haskell. I was working alone, learning the language as I went, with no one to help me.

But before I talk about how Haskell saved the day, I need to talk about this application and the overall platform to set the story in its proper context.

As I mentioned, this project was a web application. It built on an internal platform of Perl, Tcl, C, SQL, and Javascript, and used copious amounts of HTML, XML and CSS for good measure. The platform was about five years old at the time, and it powered multiple live products. The platform did a good job satisfying its design criteria: supporting a family of products that let paying customers search and browse databases full of complex documents.

By this time, upper management realized they had invested five years building this platform, and now it was time to start seeing some returns. All of the building blocks were ready, so shouldn't be too difficult to add new content, make some tweaks, launch a new product, and watch the revenue pour in. Nothing out of the ordinary here.

Of course, it's never that simple. The new product stretched the platform in every conceivable direction: more documents, more kinds of documents, richer documents, and more ways to search, browse and view those documents. For the most part, each of these changes were quantifiable. Lots of work, but easy enough to put into a project plan.

All except for one small, insignificant piece: the query language.

Each product on this platform exposed a simplified full text query language to search within that product. These query languages may have varied slightly from product to product, but they were all more complex than simply stringing words together to make a search phrase, and less complex than, say, SQL. Users could use this language to construct a a query, or they could use a query builder to formulate a query more easily. Then, the browser sends a query back to the application on the web server where it gets translated into something the database understands. The translated query goes to the database, the database returns some results, and the results get formatted and sent back to the browser.

This setup is very common. People started building systems like this decades ago. The only difference between applications such as these and your average web application is that the query language is exposed as part of the user interface. Searching the database is an key part of the system, not something hidden behind a bunch of point-and-click forms.

The new product under development differed from others on the platform in a few important ways. Documents were larger, there were more of them, and they had more precise markup. Searching these documents demanded a query language would allow users to take advantage of the added markup. Without a rich query language, searches would be so imprecise that they (and by extension, the entire product) would be worthless.

The problem was that none of the existing query languages on this platform sufficed. Furthermore, none of the code to parse the existing query language(s) sufficed. And it wasn't really understood if this brand new, rich query language could be supported.

Now, I want to be clear that a lot of work went into this product, which did ship (more or less) on time. The amount of work the entire team did to build the user interface, to extend the database, and augment the core platform was substantial. The point I'm making is that all of that work was quantifiable. Even if the estimates were off by a factor of 2 or 4, it was possible to list all the tasks that needed to be done.

In contrast, there was no way to estimate the effort to write a program that translated the new rich query syntax into the database's query syntax. To make matters worse, the requirements for this language were quite fuzzy in some places and non-existent in others. For example, the specifications gave a few example queries, some desired results, but no grammar. When requirements were available, they were often contradictory.

The project architect was deeply concerned about this. Every other unknown on the project could be handled with a simple matter of programming. Everything except this brand new parser that translated an under-specified query language into database queries.

This was a big gaping hole in the project plan, and it threatened to derail the entire project.

As the deadline loomed, the architect and I started discussing how to write this query translator. The application code running on the web server was mostly Tcl, but writing a parser in Tcl seemed foolish. We could try, but we there was a big risk that (a) it might not work, (b) it would work, but it would be slow, or (c) it would work but be difficult to manage as requirements changed. Given all the possible downsides, we looked for alternatives.

The next path we examined was writing a parser in C. First, it's hard imagining a C-based parser not working, because it's been done soooo many times before. And it's hard imagining a C-based parser being slow. There was still a risk that writing a parser in C would lead to a brittle codebase, and opening Pandora's Box of manual memory management would lead to the odd core dump, memory leak or obscure bug. Nevertheless, we were confident that those risks, although real, were relatively minor.

Writing a parser from scratch in C would be tedious and cumbersome, but at least it filled the big gaping hole in the project plan. Even with all the unknowns, it was likely that the product would support this new query language.

Next, the project architect and I started looking at parser toolkits. Every few years, I peek into lex and yacc again, and I stop reading about the time it makes my eyes bleed. (I honestly can't tell the difference between LL, LR and LALR. It all seems, well, academic.) Instead of spending months evaluating the relative strengths and weaknesses of various tools, we quickly decided that brute-force recursive descent parser would be the best bet. Yes, it would be a lot of work, but it was hard imagining it not getting something working.

As the deadline loomed, I was the one person on the team assigned to parsing this as-yet-undefined query language and translating these expressions into database queries.

Fortunately, this was early 2005, when Audrey Tang got tired of 4 1/2 years of stagnant development with Perl 6 and started the pugs project. Audrey's early progress was something to behold. Every other day or so, she posted that she managed to double the number of features in pugs, double the performance, or more likely, both. Her secret weapons? Haskell and Parsec.

Encouraged by Audrey's amazing and consistent progress, I installed ghc and started playing around with Haskell. I learned enough of the syntax to get a few programs past the type checker, and with a dog-eared copy of the Parsec paper on my desk, I managed to write a functioning roman numeral parser within about a week.

After a long discussion with the project architect, I tried parsing this query language with Haskell + Parsec. If it didn't look promising after a few experiments, there was still plenty of time to go back to the original plan.

Of course, you can guess the result. Haskell worked. The requirements changed wildly many times during development. Eventually they stabilized, and the bottleneck turned out to be waiting for requirements, not implementing them.

What's the secret? It turns out that Parsec is a toolkit for writing recursive descent parsers! Even better, the code reads like the specification for the grammar being parsed.

Here's an example, the standard textbook BNF(-ish) grammar for simple algebraic expressions:

expression ::= term (('+' | '-') term)*
term ::= factor (('*' | '/') factor)*
factor ::= '(' expression ')' | number
number ::= digit digit*
digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

Note that the way terms parse, parentheses are handled first, then multiplication and division, then addition and subtraction. It's the same order of operations you learned in Algebra all those years ago.

Here is that same grammar translated into a parser in Haskell using Parsec:

import Text.ParserCombinators.Parsec

expression = do term
optional (sequence [choice [string "+", string "-"],
return ()

term = do factor
optional (sequence [choice [string "*", string "/"],
return ""

open = string "("
close = string ")"

factor = do (between open close expression
<|> number)
return ""

number = do many1 digit
return ()

Once you get comfortable with Haskell syntax and Parsec primitives, the code actually reads like a faithful translation of the grammar. The parser above is obviously an expression of the desired grammar because there's not much unnecessary fluff distracting you away from the job at hand. No loops, factory classes, iterators, counter variables or pointer arithmetic. No little languages to run through a pre-processor. This code defines a parser and focuses on the job at hand -- parsing. What a concept!

This particular program defines a parser for expressions that doesn't actually do anything, but it does pass when the input adheres to the grammar, and fails when input deviates from the grammar:

parseExpr s = case (parse (sequence [expression, eof]) "" s) of
Left _ -> "Fail"
Right _ -> "Pass"

parseExpr "3" -- Pass
parseExpr "3+4" -- Pass
parseExpr "3 + 4" -- Fail (no spaces allowed)
parseExpr "3+4+" -- Fail

(I'll write more about this simple little parser in a future post.)

Unfortunately, I can't go into the details of the query language I developed for this project. But I can say that the parser read like simple translation of a specification and not like a sea of code. On multiple occasions, I received subtle requirements changes that inverted the meaning of a term in the query language. Once I understood the old behavior and the new requirement, updating the code seldom took more than a few minutes.

That's why Haskell is the language of choice for discriminating hackers.

No comments: