Computer, heal thyself
Why should humans have to do all the work? It's high time machines learned how to take care of themselves.
By Sam Williams
July 12, 2004 | In his 1992 book "To Engineer Is Human: The Role of Failure in Successful Design," Duke civil engineering professor Henry Petroski tosses out a little-known statistic from the history of bridge design: During the latter half of the 19th century, a period that introduced the locomotive train to most corners of the industrial world, roughly a quarter of all iron truss bridges failed.
The simplified reason: Bridge designers, unused to iron as a structural material and railroad trains as a service load, had yet to grasp the full impact of a minor miscalculation anywhere within their plans. It wasn't until designers started introducing a conservative fudge factor, now known as the margin of error, that bridge designs developed enough redundancy and robustness to account for the occasional errant crossbeam or overloaded rail car.
"Basically civil engineers made bridges safe by recognizing that humans would be involved in every step of the bridge-building process," says David Patterson, a Berkeley computer science professor who has cited Petroski's statistic in numerous papers. "With human involvement comes the risk of human failure."
For Patterson, the iron-truss story is more than just a quick attention grabber; it's a hint that today's software programmers, oft derided for their failure to deliver bug-free code, have yet to grasp the full weight of their own discipline.
Coauthor of the landmark 1987 paper that laid out the low-cost memory strategy now known as RAID (the acronym stands for "redundant array of inexpensive disks"), Patterson has long been a proponent of hardware architectures that treat component failure as a given yet still find a way to get the job done. Since 2002, he's been putting forward the same strategy in the realm of software systems, banding together with Stanford counterpart Armando Fox, head of that university's Software Infrastructures Group, to launch the Recovery Oriented Computing project.
In a June 2003 article for Scientific American, Fox and Patterson cited Petroski's observation and laid out their own project's philosophy and goals. "As digital systems have grown in complexity, their operation has become brittle and unreliable," they wrote. "Rather than trying to eliminate computer crashes -- probably an impossible task -- our team concentrates on designing systems that recover rapidly when mishaps do occur."
While somewhat fatalistic on the surface, treating failure as inevitable just might be the key to pushing software development out of its current malaise. From Berkeley to MIT and points in between, software engineers are buzzing over the prospect of "autonomic computing" -- systems built to recognize and recover from their own flaws without tying down a human administrator in the process. Such systems remain a few years over the current commercial horizon, of course, but the sense of collective mission, something akin to the mammoth World War II science projects that spawned computer science in the first place, is growing.
Next page: Chasing heisenbugs and running into the complexity barrier
