A paper by Steve Muir at the Usenix Workshop on Real, Large Distributed Systems (WORLDS'04) caught my eye: The Seven Deadly Sins of Distributed Systems catalogues a number of real problems that arose during the development of PlanetLab. PlanetLab is a very large distributed computing environment (I think we're supposed to call them Grids these days), allowed large scale distribution and virtualisation of apps. Muir's paper describes some very practical problems they encountered, and describes what they did about them. For example:
There’s No Such Thing as “One-in-a-Million”
In a distributed system with hundreds of nodes running 24/7, even the most improbable events start to occur on a not-too-infrequent basis. It’s no longer acceptable to ignore corner cases that probably never occur—those corner cases will occur and will break your application.
Very salient stuff. I'm currently adding web-services capability to Nuin. I have some facilities for handling errors, but it's nowhere near as robust or complete as I would like. After reading Muir's paper, I think that I ought to worry about this more than I already do. Much more!