Wednesday, October 3, 2007

Defanging the Multi-Core Werewolf

Many developers are have a nagging sense of fear about the upcoming “multi-core apocalypse”. Most of the software we write and use is written in imperative languages, which are fundamentally serial in nature. Parallelizing a serialized program is painful, and usually involves semaphores, locks, and comes with problems like deadlock, livelock, starvation and irreproducible bugs.

If we’re doomed to live in a future where nearly every machine uses a multi-core architecture, then either (a) we have a lot of work ahead of us to convert our serialized programs into parallelized programs, or (b) we’re going to be wasting a lot of CPU power when our programs only use a single processor element on a 4-, 8-, or even 64-core CPU.

At least that’s the received wisdom. I don’t buy it.

Yes, software that knows how to exploit parallelism will be different. I don’t know that it will be much harder to write, given decent tools. And I certainly don’t expect it to be the norm.

Here is some evidence that the multi-core monster is more of a dust bunny than a werewolf.

First, there’s an offhand remark that Rob Pike made during his Google Tech Talk on Newsqueak, a programming language he created 20 years ago at Bell Labs to explore concurrent programming. It’s based on Tony Hoare’s CSP, the same model used in Erlang. During the talk, Rob mentioned that the fundamental concurrency primitives are semaphores and locks, which are necessary when adding concurrency to an operating system, but horrible to deal with in application code. A concurrent application really needs a better set of primitives that hide these low level details. Newsqueak and Erlang improve upon these troublesome primitives by offering channels and mailboxes, which make most of the pain of concurrent programming go away.

Then, there’s Timothy Mattson of Intel, who says that there are just too many languages, libraries and environments available today for writing parallelized software. Timothy is a researcher in the field of parallel computing, and when someone with such a deep background in the field says the tools are too complicated, I’ll take his word for it. The good news is that very few programmers work on the kinds of embarrassingly parallel problems that require these tools. Working on parallel machines isn’t going to change that for us, either. In the future, shell scripts will continue to execute one statement at a time, on a single CPU, regardless of how many CPUs are available, with or without libraries like Pthreads, PVM or MPI. Parallel programmers are still in a world of hurt, but at least most of us will continue to be spared that pain.

Then there’s Kevin Farnham, who posted an idea of wrapping existing computationally intensive libraries with Intel’s Thread Building Blocks, and loading those wrapped libraries into Parrot. If all goes well and the stars are properly aligned, this would allow computationally intensive libraries to be used from languages like Perl/Python/Ruby/etc. without the need to port M libraries to N languages. (Tim O’Reilly thought it was an important enough meme that he drew attention to it on the O’Reilly Radar.)

This sounds like a hard problem, but adding Parrot to the equation feels like replacing one bad problem with five worse problems. If we’re going to live in a world where CPUs are cheap and parallelism is the norm, then we need to think in those terms. If we need Perl/Python/Ruby/etc. programs to interact with parallelized libraries written in C/C++/Fortran, where’s the harm in spawning another process? Let the two halves of the program communicate over some IPC mechanism (sockets, or perhaps HTTP + REST). That model is well known, well tested, well-understood, widely deployed and has been shipping for decades. Plus, it is at least as language-agnostic as Parrot hopes to become. (+2 points if the solution uses JSON instead of XML.)

Fourth, there’s Patrick Logan, who rightly points out the issue simply isn’t about a multi-core future, but also a multi-node future. Some applications will run in parallel on a single machine, others will run across multiple nodes on a network, and still others will be a hybrid of both approaches. Running programs across a network of nodes is done today, with tools like MapReduce, Hadoop and their kin.

If you have a grid of dual-core machines today, and need to plan out how to best use the network of 64-core machines you will have a decade from now, here’s a very simple migration plan for you: run 32x as many processes on each node!

With that said, here is my recipe for taming the multi-core dust bunny:

  • Determine what kind of parallelism makes sense for you: none, flyweight, fine-grained or coarse grained.
  • Avoid troublesome low-level concurrency primitives wherever possible.
  • Use tools like GHC’s Nested Data Parallelism for flyweight concurrency (one program, lots of data, spread out over multiple CPUs on a single machine).
  • Use tools like GHC’s Software Transactional Memory for lightweight concurrency (many interoperating processes managing shared data on a single machine).
  • Use tools like MapReduce and friends for heavyweight concurrency (work spread out across multiple cooperating processes, running on one or many machines).

As Timothy Mattson points out, parallel programming is fundamentally hard, and no one language, tool or environment is going to slay the dragon. I cite NDP here not as a perfect solution, but as a placeholder for a whole class of tools that exhibit of SIMD parallelism. Similarly, STM is a placeholder for a whole class of tools that exhibit MIMD parallelism. Sometimes you need one, sometimes you need the other, sometimes you need both, and sometimes you need neither.

And then there is the issue of virtualization. Perhaps the best use of a multi-core system isn't to use it as a single multiprocessing computer, but as a host for a series of virtual machines. Such a usage sidesteps all of the thorny issues around parallelism entirely, focusing instead on cost savings that accrue from server consolidation and simplified management. This is a very old idea that becomes more and more important as power efficiency in our data centers becomes a hot button issue.

Finally, there’s a looming question about what to do about the desktop. If your laptop has 32 cores, what do you do with them? The simple answer is nothing. As CPUs get faster, they spend more and more of their time in an idle state. The only thing that changes in a multi-core world is that more CPUs are idle. Desktop programmers can spend a lot of time evenly distributing that idleness across all CPUs, or make very few changes, and use only as many CPUs as necessary. Operating systems and basic tools (emulators, compilers, VMs, database engines, web servers, etc.) will need to be multi-core aware and distribute their work across as many CPUs as are available. Some of that work is already done -- make -j has been around for years. Processing intensive applications, like audio/video codecs, image manipulation and the like, will also need to be multi-core aware. The vast majority of the programs we write will continue to be mostly serial, and rarely care about parallelism.

After all, authenticating a user doesn’t get 32x faster on a 32-core machine.

2 comments:

Henry Miller said...

After all, authenticating a user doesn’t get 32x faster on a 32-core machine.

Even if it could, that is a bad idea. Authentication is a problem subject to brute force attacks. 15 years the comments for Unix login already said something to the effect of "We know this isn't the most efficant algorithm, but we want to slow down users trying to brute force passwords across the network, please do not optimize this code".

sn said...

Quite an eye-opener of an article, at least for me. The folks over at Dr Dobbs are close to having seizures over how to handle the coming multi-core revolution. I am new to this area myself. However, I don't know if the nested data parallelism is necessarily going to make ordinary programmers sleep better at night. That's b/c much of OO programming isn't about data parallelism but object interactions. This is where we can expect to see some serious problems in the future (see Lee's article on "What's wrong with threads"). In that light, I am curious to know what do you think of O'Haskell/Timber's approach to this problem? Methods in O'Haskell are guaranteed to never block thus doing away with deadlock and starvation problems. Every object executes in its own virtual thread, so you never need fear more cores, since there will always be more objects than cores.