Self-Stabilization Bibliography: Access Guide

Ted Herman

Last revised: 27 December 2002

Abstract:

This document, a companion to the bibliography on self-stabilization, is intended as a high-level aid to searching the literature. Not intended to be a survey of the field, the document classifies citations according to selected topics covered by papers included in the bibliography. For the HTML form of the Access Guide, we invite authors to supply URL addresses for home pages and pointers to postscript versions of the papers; these URLs can then be integrated into the bibliography.

Introduction

Research related to self-stabilization spans areas of engineering, computer science, and physical sciences. Searches of scientific literature yield papers in chemical and electrical engineering, optics, and biology that mention self-stabilization either in the title or abstract. The essential idea of self-stabilization is similar in these disparate research fields. Methods from applied mathematics, especially the robustly convergent approximation algorithms, use ideas related to self-stabilization. In computer science literature, the term typically has specific connotations: research tends to be more theoretical than implementation-based; and the setting is most often one of distributed or concurrent computing. This document is an ``access guide'' to the bibliography on self-stabilization. It is intended to assist researchers in the field and readers interested in following a particular topic in depth. The sections that follow cite papers that are related by the applications they consider, the direction of the research, or historical relation. Work on self-stabilization is complex and investigates many questions about distributed algorithms, fault tolerance, and computational models. The literature therefore defies easy or definitive categorization. Papers could be classified by problem domains, research goals, or even lineage in the literature. The only apology offered for the following classification is that one must begin somewhere!

Definitions

There is no universally agreed upon formal definition of self-stabilization. Some papers propose definitions tailored to specific applications and models, while other papers use less formal, high-level characterizations. The proliferation of definitions is likely to continue. The generally acknowledged initial definition for self-stabilization (``in spite of distributed control'') originates with [D74] [D73a] [D73b].

This initial definition is tied to specific protocols and an abstract machine architecture based on guarded assignments; the initial definition remained in force for many years following the initial publication (also influencing the selection of research problems). Recent papers present more abstract definitions of self-stabilization [AG93] [G95b].

It is also interesting to observe that discussion of self-stabilization appears in Systems Science literature [L72] prior to Dijkstra's first publication. (In fact, ten years before [D73a], the paper [B62] appears; it is likely that self-stabilization was discussed in the late 1950's and we seek contributions to the bibliography that predate [B62]. ) Research on self-stabilization in the areas of Systems Science or Control Theory continues [S83] [BH93] [YG93] [WMP94].

The formal definition of self-stabilization can be state-based or based on behavior as represented by a trajectory or history of states [LL90] (see also [BGM93a], where the distinction between state-based and behavioral definitions is shown to be non-trivial). Alternatively, it is possible to characterize self-stabilization by a ``forgetful'' property [S93a]. Variants of self-stabilization have been proposed for control theoretic application [OWA91], and for situations where the initial state has a limited degree of arbitrariness [FMRT90]. A variant may also be needed to address self-stabilization of shared objects [LV92] [HPT95]. Hopfield nets and similar forms of neural networks may also have special definitions of the term [DWW86] [PD95].


Advocacy, Surveys and Criticism

The most frequently cited paper advocating self-stabilization is Lamport's invited address [L84]. Advantages of self-stabilization are mentioned in a large number of papers contained in the bibliography; two recent promotions of the topic are [G95b] [D95a]. The surveys and introductory articles [S93c] [FDG94] [T95] [HW95] also advocate self-stabilization. Several authors take a critical view of self-stabilization [JT90] [BG95], however it should be noted that one aspect of self-stabilization, that a system ``eventually'' reaches a legitimate state or behavior, is seen with increasing frequency in the literature of distributed computing [CT91] (many recent papers could as well be cited, but they are outside the scope of the bibliography).


Rings

The ring topology, useful and beloved in general for distributed computing and network protocols, is the starting point for many of the bibliography's entries. Citations in this section are not comprehensive, but are referenced here for lack of a better categorization. The example of a ring has been used to illustrate techniques for self-stabilization, to show systematic methodology for devising self-stabilizing programs, to discriminate between different models of computation and assumptions about asynchrony and failure modes.

Scalability is a fundamental theme of distributed system and networking research, and one aspect of scalability is the space requirement for protocols. For instance, we generally prefer protocols with constant space over those that have memory requirements proportional to the size of the network. The question of space requirements for self-stabilizing programs was one of the earliest questions and continues to be of interest. Following the initial paper [D74], the question of minimizing space was further taken up in [HK85] and then subsequently in [GH96] [FD94] [BD95]. Is uniform deterministic self-stabilizing mutual exclusion possible in rings? What about other tasks such as leader election, ring orientation, and identity assignment? Such issues are topics in [BP89] [LG91] [H93] [IJ93] [WH94a] [WH94b] [H94] [H98a] [NMT94] [TH95] [HR96] [MOY96] [IJ90a] [BDK96] [UKY98] [BGJ99].

In some cases, a probabilistic algorithms provide uniform solutions to these tasks [IJ90b] [H90] [H92b] [BD94] [BCD95] [ILS95] [KY95] [KY97] [YB95], with other factors such as knowledge of the ring size, the availability of unbounded variables, primality of the ring size, and various atomicity assumptions [B87]

playing a role in the algorithm design. Dijkstra's initial paper associated a ``privilege'' with enabledness of a guarded command, but the usual interpretation of mutual exclusion motivates the notion of a token (influenced by token ring communication protocols); this was first observed in [BGW87] [BGW89]. Other problems considered in the context of ring architectures include load balancing [KB87] (see [FD93a] [AG95] [GP99] for more work on load balancing), the dining philosopher problem [HP92], and multiple token systems [FDS94].

The ring topology is enriched by additional communication links in [G91a] to reduce the state space and in [BB95] to tolerate faulty nodes. The paper [GE90] presents a protocol with parameters for token and stabilization performance. In [LS92a]

it is shown that an observer outside the ring can deduce legitimacy for the system state faster than the worst-case stabilization time.


Trees and Graphs

The paper [D74] presents, in addition to the ring topology, a protocol for a chain of machines (and a footnote suggests generalizations of the ring structure). In [K79] the chain topology is generalized to a tree in which token activity resembles the distributed computation found in termination detection algorithms (such as the diffusing computation paradigm). More recently, a number of papers show how various graph-theoretic problems have self-stabilizing solutions in tree networks [KPBG94] [GGKP95] [VAG95] [BDN95] [DGT95] [AS96] [AS97b] [A99] [BDPV99].

The problem of mutual exclusion on a general network was first considered in [T81]; subsequent papers considering the problem are [IJ90b] (with analysis in [TW91]) and [AHOP94], with depth-first ordering provided by [HC93] [JB95] [JABD97]; breadth-first ordering is considered in [J97]; a transactional approach to the problem is presented in [MNK96].

Spanning tree construction in a general network is important for methodological reasons (see Section 11), and many self-stabilizing solutions to this problem have been published [AG94] [AKY90] [CYH91] [SS92] [AK93] [DIM93] [AS95a] [AS95b] [GGP96] [AB97] [AS97a] [AB98] [AS98]. Specialized forms of trees (depth-first, breadth-first, maximum-flow, heaps) can also be constructed with self-stabilizing algorithms [HC92] [SS92] [CD94] [GS94] [BK95] [BLB95]. Other graph-theoretic problems such as matching, coloring, shortest-path, and network flow have also have self-stabilizing algorithms [HH92] [CS93a] [CS93b] [FD93b] [GK93a] [SS93] [GS94] [SRS94] [TH94] [BDDN95] [GGP95] [GS95] [BV97] [C99b] [C99c] [C99d] [KA99] [KC99]. Such problems may not have self-stabilizing solutions if processor names are anonymous or memory is limited [SRR95] [DGS96]. The amount of memory used by a self-stabilizing program can be considered not only from the point of view of the worst-case during execution, but instead one can analyze the memory usage when the state is legitimate -- and memory requirements at a legitimate state can be lower than that consumed during convergence to a legitimate state [AEH92]. Reducing the memory requirement at a legitimate state can also reduce the overhead for a system that continuously checks its state to determine whether or not its current state is legitimate [MOOY92] [PY94].

Synchronization and Clocks

The problems of clock synchronization and phase synchronization are natural to consider for research on self-stabilization, since synchronization is a fundamental tool for system implementations. Self-stabilizing synchronization has been the focus of much recent research [GH90] [LZM90] [ADG91] [CFG92] [AKMPV93] [DW93a] [C94] [CS94] [PT94] [DW95] [HG95] [LS95] [ER95] [PT97] [HL98] [HL99a] [HL99b] [C99a].

Fault Tolerance

Dijkstra's initial work on self-stabilization [D74] mentions no application to fault tolerance. Yet self-stabilization is most often seen as a technique for fault tolerance in the area of distributed computing. As such, there has been considerable effort to accommodate the fault-handling of self-stabilization into the larger picture of fault tolerance. The first paper to do this (implicitly) is [D73a], and this work is furthered in [BYC88] [BYZ89] [YBL91] [BY93].

Mixed forms of tolerance (for instance Byzantine failures, crashes, and transient faults) are considered in [ZB92] [BB95] [DH95a] for specific cases; the first general theory for combined forms of tolerance is presented in [AG93] with a different approach given in [GP93].

Some specialization to the task of stabilizing failure detection is studied in [LG94a] [LG94b]. In a strictly asynchronous model, a silent processor crash may exclude the possibility of self-stabilizing algorithms for many tasks [AH93] [M95] [BDK96] [BK97a] [BK97c]. A more recent trend in research is the focus on fault-containment combined with self-stabilization [GGHP96] [GG96] [KS97] [UKMF97] [GP97].

Communication Protocols

Self-stabilization has most obvious application to the network protocol area since communication protocols are especially tolerant of temporarily faulty results (retrying is par for the course). There is some overlap between communication protocol research and graph theoretic algorithms due to the natural interpretation of vertices as network nodes and edges as network links. Some of the research on self-stabilizing communication protocols considers classic problems of window control, session control, message sequencing, connection setup and sequencing, and routing [BP89a] [SG89] [S90] [AGH90] [H92a] [AB93a] [V93] [D93] [D94] [ACDKW94] [DTOF94] [APSV94] [ADW95] [APV96] [CV96] [DW97a] [CW97] [AAGMRS97] [CV99]. The paper [DIM97a] is a communication protocol for leader election; in [DPW96] mobile processes are considered, and [ACDKW94] [ACDKW96] specialize to high-speed networks with self-stabilizing protocols. Some interesting results relating to construction (or possibility) of self-stabilizing communication protocols are [JV96] [PM96]. Work on the ring topology (see Section 4) also has clear application to LAN protocols. Higher-layer network problems, such as group membership [AC93] have also been considered.

Models

A frequent line of research is to experiment with different models of computation or types of resources within a model to determine whether or not self-stabilization is possible; this exploration also depends on the task to be self-stabilized. Starting from the basic model proposed in [D74], one of the early (and continuing) questions is how atomicity and parallel execution of assignments effect the possibilities for self-stabilization [B87] [BGM89] [HWT94]. Varying the granularity of operations has an effect [DIM93], and one may also consider synchronous execution models [H90]. A number of other questions (uniformity, ring size) are mentioned in Sections 4 and 5. Self-stabilizing algorithms and protocols can also be specified in models quite different from that of [D74]. Petri nets [T89] [G91b] [CHR95] [PM96], conventional shared memory systems [L86] [LH91] [ADHK97], wait-free objects [HPT95], cellular automata [GKL78] [G95a], iteration systems [G82] [AAEG93], rule-based or constraint systems [BEGMMR90] [CDK91] [C92] [AGV94], neural networks [DWW86], genetic algorithms [DO93], and even digital [AG92] or continuous circuits [H95] are represented in the bibliography. The paper [BDT99] stabilizes a protocol under the constraint that nodes have a cut-through constraint in passing messages. Even database transactions have been viewed in terms of stabilization to consistency [ED98].


Proof Techniques

It is challenging, at least for a newcomer to the field, to verify that a program has desired properties when nothing can be assumed about the initial state in an execution -- and this verification is necessary to prove a self-stabilizing program correct. A direct verification for a program in [D74] is given in [D86], and it is interesting to compare this proof to a temporal logic verification presented in [TW86]. Temporal logic is also proposed as a verification technique in [LS97].

Self-stabilization proofs have been carried out in the UNITY formalism [R89]. Instead of a direct proof that a self-stabilizing program has the required property of convergence to legitimacy from an arbitrary initial state, the proof can be broken into stages using the ``staircase'' method proposed in [GM91]. Knuth [HK85] observes that, rather than showing convergence, one may show no divergent behavior is possible and thereby prove convergence. A more traditional approach (following standard convergence arguments of sequential program loops) is to use a variant function, as observed in [K88] and used since in many papers. For randomized self-stabilization, other techniques have been developed [DIM95] [HR96].

Another approach to verification is to attack the problem at an earlier stage in program development. Given the specification of a problem and some set of valid composition and refinement rules, the specification can be stepwise refined until a program, or a specification suitably close to a program, is obtained; in this way, the result's validity follows from the validity of the refinement steps. For instance, a specification of a self-stabilizing program is refined to obtain a UNITY program in [LS92b] [ATT95] [LS93a] [LS93b].

Principles underlying various kinds of program transformation steps and program composition rules are given in [GH91] [S93b] [SS94] [A94] [AP95] [V97] [DH99]. The dissertation [P95] (see also [P97]) demonstrates how mechanical theorem proving technology (using HOL) assists in the verification of self-stabilizing algorithms. Other experience with mechanical aids are reported in [SRS97] [KRS99].


General Methods

As noted in Section 10, it can be challenging to prove a self-stabilizing program correct, and this reflects the fact that it is difficult to devise self-stabilizing protocols. Having observed this, a number of researchers looked into the possibility of transforming a program that is not self-stabilizing into one that is self-stabilizing. This can be done by superimposing new modules, modifying the original program, or using a type of compiler to produce a self-stabilizing implementation of the program. The papers [AG94] [AKY90] [AKY97] [AV91] [APV91] [KP93a] [AO94] [APVD94] [VAG95] address the general problem of adding self-stabilization to a system or program, or contribute tools for this purpose. One may also consider transforming self-stabilizing programs specified for one model so that they work on another model [GHR90] [DIM97b] [MK96] [KMN97] [GH97] [GH99b] [AG97] [MN98], and this is helpful if the former model is simpler for programming than the latter. Transformations of self-stabilizing programs are also useful to add additional properties [DH95a] [DH97] [AK95] [KA97] [GGHP96] [AD97a] [AD97b] [KC98] beyond self-stabilization or to accelerate convergence in the case of more likely faults (see also [BGK98] [BGK99] [KS97] [KP98] [KP99]) . If we recall the definition of self-stabilization as a ``forgetful'' property [S93a], then the general principle behind many self-stabilizing programs is that they continually ``flush'' older state information, and this idea can be more formally seen in the work [V94a] [CV96].


Theoretical Papers

Since very few implementations of self-stabilizing algorithms are cited in the literature, it may be unfair to single out specific papers as being ``theoretical'' or more theoretical than others. We mention here papers concerned with Turing machines, computability results, taxonomy, and fundamental theoretic models: [GHR90] [AD94] [BD93] [AD97c] [IL94] [S95].

Frequently Asked Questions

We list here some of the more frequently asked questions by researchers outside the field of self-stabilization, but unlike a traditional ``FAQ'' we offer no answers.

New Directions

What, beyond the continuance of research classified in previous sections, is likely to appear in future versions of this bibliography? We hope to see growing connections between self-stabilization and other areas of fault tolerance. As noted in Section 3, there appears to be growing appreciation of programs and services that make ``eventuality'' guarantees [CT91] [FGLLS96] (see also [BDDT98]) ; such work may offer new application of self-stabilization beyond currently recognized potential application domains (communication protocols and load-balancing). The paper [AK95] shows how self-stabilization can be useful even in fault-tolerant systems with more stringent requirements than what stabilization itself provides. In addition to self-stabilization, there are other ways of dealing with transient faults or initial states that are somewhat (but not completely) arbitrary (see [FMRT90] [AGMT95] [TKM89] [P91] for example). Indeed, weakened forms of self-stabilization may be more attractive for many applications than pure self-stabilization.