Self-Stabilizing Distributed Systems

October 2 - 7, 2004

Organizers: Lisa Higham (University of Calgary), Anish Arora (Ohio State University), Faith Fich (University of Toronto), Maurice Herlihy (Brown University), Ted Herman (University of Iowa).

Objectives

Fundamental synchronization and coordination primitives lie at the heart of distributed computer systems. These systems rely on algorithms for synchronization and coordination problems such as mutual exclusion, dining philosophers, leader election, token-circulation and identifier assignment to manage the use of resources and to control communication. However, as the number of components in these systems grows the likelihood of some component failure also increases, causing the traditional solutions to these problems to fail. Hence, algorithm designers seek to make systems more reliable by building fault-tolerance into their distributed protocols.

Self-stabilization is a particularly robust and appealing model of fault-tolerance for distributed computation. A distributed system is self-stabilizing if, when started from an arbitrary initial configuration, it is guaranteed to reach a legitimate configuration as execution progresses, and once a legitimate configuration is achieved, all subsequent configurations remain legitimate. Thus a self-stabilizing system will recover from a burst of transient faults (moving the system to some arbitrary configuration) without the need for manual intervention, as long as no further faults occur. The definition of self-stabilizing systems implies that they need not be initialized. This is an additional advantage particularly for systems that are physically widely dispersed. Furthermore, frequently (but not always) the techniques used to make the system converge apply even when the system can change dynamically. In this case systems need not be reset when a processing node or communication channel is added, reconfigured or removed.

The possibility of self-stabilizing distributed computation was first pointed out and explored by Edsger Dijkstra in a paper in 1974 when he asked the question: would it be possible for a set of machines to stabilize their collective behavior in spite of unpredictable initial conditions and distributed control? This brief paper went largely unnoticed until Leslie Lamport's invited talk at PODC (Principles of Distributed Computing) 1983, where he said: ``I regard this as Dijkstra's most brilliant work --- at least, his most brilliant published paper ... a milestone in work on fault tolerance ... I regard self-stabilization to be a very important concept in fault tolerance, and to be a very fertile field for research.'' Subsequently, there has been a concerted research effort producing innovative theoretical results that can be used in practice. Some of the tasks that have been discussed in these result includes topology update, clock synchronization, multiprocessor synchronization, network flow control, leader election, and many different graph algorithms.

The main challenge associated with self-stabilization is complexity. It is difficult to design and prove correct protocols for asynchronous distributed systems. It is even more challenging to provide self-stabilizing solutions. One direction has been to rely on familiar paradigms of distributed programming --- mutual exclusion, leader election, logical time, snapshots, wave algorithms, and broadcast --- all of which now have stabilizing counterparts that can be combined for application to system design. Another research direction explores complexity in terms of resource requirements, including memory needed to satisfy particular task requirements, time needed for stabilization, and size requirements on messages. A third research direction attacks complexity by automating system design, and efforts along this line include composition methods, program transformers, program refinement, and systematic proof techniques for self-stabilizing programs.

Considerable progress has been made along all three of directions discussed above. However, there is still much that is not well understood, the set of known techniques for design of self-stabilizing algorithms is limited, and there are few practical applications of self-stabilization to existing systems. The BIRS workshop would focus on progress in these three directions.

A more formal and unified theory for self-stabilization would contribute to a deeper general understanding of the major issues. The existing self-stabilizing algorithms are designed for a plethora of models. Both shared-memory and message-passing models have been considered. Some shared-memory solutions assume that a processor can atomically read the entire state of all its neighbours and update its own state in one un-interruptible step. Others assume that an atomic step consists of just one read or one write to one shared variable. Design is easier under the first assumption but the second is more realistic. Some solutions assume that the computation proceeds under the control of a centralized scheduler, which selects exactly one processor to execute its next action at each step; others, more realistically, assume the scheduler can select any non-empty subset of processors to execute in each step. In either case, the scheduler may be constrained by any one of several possible fairness assumptions. Computation may be synchronous or asynchronous; processors may have distinct identifiers or may be indistinguishable; the system configuration may be static or subject to dynamic changes; protocols may be deterministic or may exploit randomization. A self-stabilizing solution to a specific problem may exist under one set of assumption but not under another. Some work has addressed the relationships between some of these models. Also, there are now some results that compile a solution that is correct assuming a strong model to a solution for a weaker model that makes fewer assumptions about the underlying system. But the complete picture of how all of these model properties are related for self-stabilizing systems is far from complete and frequently imprecise.

The workshop would first focus on constructing a framework capable of capturing the various assumptions and using the framework to precisely define the various models. Work would proceed to investigate the relationships between models. The goal is, for a set of model assumptions A and B, to develop a general transformation that converts an arbitrary self-stabilizing algorithm for strong model A into a self-stabilizing algorithm for the corresponding problem for weak model B, or to prove that no such transformation exits. Proving impossibility or lower bounds on the costs of desired transformations will serve to highlight the essential differences between models.

A surprising proportion of the existing self-stabilizing algorithms rely on a small collection of ad hoc techniques or small variants of these techniques. Recently, however, there has been some initial work that attempts to connect self-stabilization to other well-established research areas. The hope is that we can exploit the long history of research and the more highly developed insights in these domains to develop more sophisticated techniques for self-stabilization. Control theory appears to be one possibility. So far, however, the only progress has been to recast some existing algorithms into the framework of control theory and use this reformulation to produce a new proof of correctness. The acid test is to find self-stabilizing solutions to unsolved problems with control theory techniques. As a second example, self-stabilization is closely related to attractors. One paper has explored the relationship, but the connection has yet to been exploited for gain. Randomization is a tool that is now used widely in distributed systems. Typically, when a system has some costly situations, which a deterministic algorithm may not be able avoid, randomization is used to make these bad cases highly unlikely. Thus the expected performance can be quantified and controlled. Randomization has been used much less by the self-stabilization community, perhaps because its adds another level of difficulty onto an already highly complex situation. We would seek participation at BIRS from experts in all these research domains as an efficient way to get up to speed in these potentially valuable related research areas.

Finally, there has not yet been enough progress in moving the existing self-stabilization research into practice. Now, however, interest in robust systems that have built-in fault tolerance is high. So new activity is emerging that seeks to add fault-tolerance to several applications. Particularly obvious targets include embedded systems and Internet applications. To profitably apply self-stabilization to these areas, we need to combine self-stabilization with other practical goals. For example, it may be necessary to provide conditional safety properties during the period of convergence before a system stabilizes. It may be desirable to relate the cost of convergence to the scope of transient faults, by showing how self-stabilizing systems can confine repair to greater locality in time (faster repair) and space (fewer processes) for initial system states that represent only a "small" fault. In cases where self-stabilization is too costly, there may be practical ways to weaken the self-stabilization requirement to a realistic and achievable compromise. In other cases, it may be necessary to combine self-stabilization with other notions of fault-tolerance for even more robustness. Our list of proposed attendees contains a good balance between researchers with theoretical expertise and expert practitioners to ensure the best climate for progress toward this final goal.