Self-Stabilization Bibliography: Access Guide
Ted Herman
Last revised: 27 December 2002
Abstract:
This document, a companion to the bibliography
on self-stabilization, is intended as a high-level
aid to searching the literature.
Not intended to be a survey of the field,
the document classifies citations according
to selected topics covered
by papers included in the bibliography.
For the HTML form of the Access Guide, we
invite authors to supply URL addresses
for home pages and pointers to postscript
versions of the papers; these URLs
can then be integrated into the bibliography.
Research related to self-stabilization spans
areas of engineering, computer science, and physical
sciences. Searches of scientific literature yield
papers in chemical and electrical engineering, optics,
and biology that mention self-stabilization either
in the title or abstract.
The essential idea of
self-stabilization is similar in
these disparate research fields.
Methods from applied mathematics,
especially the robustly convergent approximation algorithms,
use ideas related to self-stabilization.
In computer science literature, the term typically has
specific connotations: research tends to be more
theoretical than implementation-based;
and the setting is most often one of distributed or
concurrent computing.
This document is an ``access guide'' to the bibliography
on self-stabilization. It is
intended to assist researchers in the field and readers
interested in following a particular topic in depth.
The sections that follow
cite papers that are related by the applications they
consider, the direction of the research, or historical
relation.
Work on self-stabilization is complex and investigates
many questions about distributed algorithms, fault tolerance,
and computational models. The literature therefore defies
easy or definitive categorization. Papers could be classified
by problem domains, research goals, or even lineage in the
literature. The only apology offered for the following
classification is that one must begin somewhere!
There is no universally agreed upon formal definition
of self-stabilization. Some papers propose definitions tailored
to specific applications and models, while other papers
use less formal, high-level characterizations. The
proliferation of definitions is likely to continue.
The generally acknowledged initial definition for
self-stabilization (``in spite of distributed control'')
originates with
[D74]
[D73a]
[D73b].
This initial definition is tied to specific protocols
and an abstract machine architecture based on guarded
assignments; the initial definition remained in force
for many years following the initial publication (also
influencing the selection of research problems).
Recent papers present more abstract definitions of
self-stabilization
[AG93]
[G95b].
It is also interesting to observe that discussion of
self-stabilization appears in Systems Science literature
[L72]
prior to Dijkstra's first publication. (In fact,
ten years before
[D73a],
the paper
[B62]
appears;
it is likely that self-stabilization was discussed in the
late 1950's and we seek contributions to the bibliography
that predate
[B62].
) Research on
self-stabilization in the areas of Systems Science or
Control Theory continues
[S83]
[BH93]
[YG93]
[WMP94].
The formal definition of self-stabilization can be
state-based or based on behavior as represented by
a trajectory or history of states
[LL90]
(see
also
[BGM93a],
where the distinction between
state-based and behavioral definitions is shown to be
non-trivial). Alternatively,
it is possible to characterize self-stabilization by a
``forgetful'' property
[S93a].
Variants of self-stabilization have been proposed for
control theoretic application
[OWA91],
and for
situations where the initial state has a limited degree
of arbitrariness
[FMRT90].
A variant may also be
needed to address self-stabilization of shared objects
[LV92]
[HPT95].
Hopfield nets and similar forms of
neural networks may also have special definitions of
the term
[DWW86]
[PD95].
Advocacy, Surveys and Criticism
The most frequently cited paper advocating
self-stabilization is Lamport's invited address
[L84].
Advantages of self-stabilization are mentioned in a large
number of papers contained in the bibliography; two recent
promotions of the topic are
[G95b]
[D95a].
The surveys and introductory
articles
[S93c]
[FDG94]
[T95]
[HW95]
also advocate self-stabilization.
Several authors take a critical view of self-stabilization
[JT90]
[BG95],
however it should be noted that one aspect
of self-stabilization, that a system ``eventually'' reaches
a legitimate state or behavior, is seen with increasing
frequency in the literature of distributed computing
[CT91]
(many recent papers could as well be cited, but they are outside
the scope of the bibliography).
Rings
The ring topology, useful and beloved in general for distributed
computing and network protocols, is the starting point for many
of the bibliography's entries. Citations in this section are
not comprehensive, but are referenced here for lack of
a better categorization. The example of a ring has been used
to illustrate techniques for self-stabilization, to show systematic
methodology for devising self-stabilizing programs, to discriminate
between different models of computation and assumptions about
asynchrony and failure modes.
Scalability is a fundamental theme of distributed system and networking
research, and one aspect of scalability is the space requirement for
protocols. For instance, we generally prefer protocols with constant
space over those that have memory requirements proportional to the size
of the network. The question of space requirements for self-stabilizing
programs was one of the earliest questions and continues to be of interest.
Following the initial paper
[D74],
the question of minimizing
space was further taken up in
[HK85]
and then subsequently
in
[GH96]
[FD94]
[BD95].
Is uniform deterministic self-stabilizing mutual
exclusion possible in rings? What about other tasks such as
leader election, ring orientation, and identity assignment?
Such issues are topics in
[BP89]
[LG91]
[H93]
[IJ93]
[WH94a]
[WH94b]
[H94]
[H98a]
[NMT94]
[TH95]
[HR96]
[MOY96]
[IJ90a]
[BDK96]
[UKY98]
[BGJ99].
In some cases, a probabilistic algorithms
provide uniform solutions to these tasks
[IJ90b]
[H90]
[H92b]
[BD94]
[BCD95]
[ILS95]
[KY95]
[KY97]
[YB95],
with other factors such as knowledge of the ring size,
the availability of unbounded variables, primality of the ring size,
and various atomicity assumptions
[B87]
playing a role in the algorithm design.
Dijkstra's initial paper associated a ``privilege'' with
enabledness of a guarded command, but the usual interpretation
of mutual exclusion motivates the notion of a token (influenced
by token ring communication protocols); this was first
observed in
[BGW87]
[BGW89].
Other problems considered in the context of ring architectures
include load balancing
[KB87]
(see
[FD93a]
[AG95]
[GP99]
for more
work on load balancing), the dining philosopher problem
[HP92],
and multiple token systems
[FDS94].
The ring topology is enriched by additional communication
links in
[G91a]
to reduce the state space and in
[BB95]
to
tolerate faulty nodes. The paper
[GE90]
presents a protocol
with parameters for token and stabilization performance. In
[LS92a]
it is shown that an observer outside the ring can deduce legitimacy
for the system state faster than the worst-case stabilization time.
Trees and Graphs
The paper
[D74]
presents, in addition to the ring topology,
a protocol for a chain of machines (and a footnote suggests generalizations
of the ring structure). In
[K79]
the chain topology is generalized
to a tree in which token activity resembles the distributed computation
found in termination detection algorithms (such as the diffusing computation
paradigm). More recently, a number
of papers show how various graph-theoretic
problems have self-stabilizing solutions in tree networks
[KPBG94]
[GGKP95]
[VAG95]
[BDN95]
[DGT95]
[AS96]
[AS97b]
[A99]
[BDPV99].
The problem of mutual exclusion on a general network was first
considered in
[T81];
subsequent papers considering the
problem are
[IJ90b]
(with analysis in
[TW91])
and
[AHOP94],
with depth-first ordering provided by
[HC93]
[JB95]
[JABD97];
breadth-first ordering is considered
in
[J97];
a transactional approach to the problem is
presented in
[MNK96].
Spanning tree construction in a general network is important
for methodological reasons (see Section 11), and
many self-stabilizing solutions to this problem have been
published
[AG94]
[AKY90]
[CYH91]
[SS92]
[AK93]
[DIM93]
[AS95a]
[AS95b]
[GGP96]
[AB97]
[AS97a]
[AB98]
[AS98].
Specialized forms of trees (depth-first, breadth-first,
maximum-flow, heaps)
can also be constructed with
self-stabilizing algorithms
[HC92]
[SS92]
[CD94]
[GS94]
[BK95]
[BLB95].
Other graph-theoretic problems such as matching, coloring,
shortest-path, and network flow have also have self-stabilizing algorithms
[HH92]
[CS93a]
[CS93b]
[FD93b]
[GK93a]
[SS93]
[GS94]
[SRS94]
[TH94]
[BDDN95]
[GGP95]
[GS95]
[BV97]
[C99b]
[C99c]
[C99d]
[KA99]
[KC99].
Such problems may not have self-stabilizing solutions if processor
names are anonymous or memory is limited
[SRR95]
[DGS96].
The amount of memory used by a self-stabilizing program can be
considered not only from the point
of view of the worst-case during execution,
but instead one can analyze the memory usage when the state is
legitimate -- and memory requirements at a legitimate state can be
lower than that consumed during convergence to a legitimate state
[AEH92].
Reducing the memory requirement at a legitimate state
can also reduce the overhead for a system that continuously
checks its state to determine whether or not its current state
is legitimate
[MOOY92]
[PY94].
The problems of clock synchronization and phase synchronization
are natural to consider for research on self-stabilization, since
synchronization is a fundamental tool for system implementations.
Self-stabilizing synchronization has been the focus of
much recent research
[GH90]
[LZM90]
[ADG91]
[CFG92]
[AKMPV93]
[DW93a]
[C94]
[CS94]
[PT94]
[DW95]
[HG95]
[LS95]
[ER95]
[PT97]
[HL98]
[HL99a]
[HL99b]
[C99a].
Dijkstra's initial work on self-stabilization
[D74]
mentions no application to fault tolerance. Yet self-stabilization
is most often seen as a technique for
fault tolerance in the area of distributed
computing. As such, there has been considerable effort to accommodate the
fault-handling of self-stabilization
into the larger picture of fault tolerance.
The first paper to do this (implicitly) is
[D73a],
and this work is
furthered in
[BYC88]
[BYZ89]
[YBL91]
[BY93].
Mixed forms of tolerance (for instance
Byzantine failures, crashes, and transient faults)
are considered in
[ZB92]
[BB95]
[DH95a]
for specific cases; the
first general theory for combined forms of tolerance is presented
in
[AG93]
with a different approach given in
[GP93].
Some specialization to the task of stabilizing failure detection is
studied in
[LG94a]
[LG94b].
In a strictly asynchronous model,
a silent processor crash may exclude the possibility of
self-stabilizing algorithms for many tasks
[AH93]
[M95]
[BDK96]
[BK97a]
[BK97c].
A more recent trend in research is the focus on fault-containment
combined with self-stabilization
[GGHP96]
[GG96]
[KS97]
[UKMF97]
[GP97].
Self-stabilization has most obvious application to the network
protocol area since communication protocols are especially tolerant
of temporarily faulty results (retrying is par for the course).
There is some overlap between communication protocol research and
graph theoretic algorithms due to the natural interpretation of
vertices as network nodes and edges as network links. Some of
the research on self-stabilizing communication protocols considers classic
problems of window control, session control, message sequencing,
connection setup and sequencing, and routing
[BP89a]
[SG89]
[S90]
[AGH90]
[H92a]
[AB93a]
[V93]
[D93]
[D94]
[ACDKW94]
[DTOF94]
[APSV94]
[ADW95]
[APV96]
[CV96]
[DW97a]
[CW97]
[AAGMRS97]
[CV99].
The paper
[DIM97a]
is a communication
protocol for leader election; in
[DPW96]
mobile processes are
considered, and
[ACDKW94]
[ACDKW96]
specialize to high-speed networks
with self-stabilizing protocols.
Some interesting results relating
to construction (or possibility) of self-stabilizing communication
protocols are
[JV96]
[PM96].
Work on the ring topology (see
Section 4) also has clear application to LAN protocols.
Higher-layer network problems, such as group membership
[AC93]
have also been considered.
A frequent line of research is to experiment with different models
of computation or types of resources within a model to determine
whether or not self-stabilization is possible; this exploration
also depends on the task to be self-stabilized.
Starting from the basic model proposed in
[D74],
one of the early
(and continuing) questions is how atomicity and parallel execution
of assignments effect the possibilities for self-stabilization
[B87]
[BGM89]
[HWT94].
Varying the granularity of operations has
an effect
[DIM93],
and one may also consider synchronous
execution models
[H90].
A number of other questions
(uniformity, ring size) are mentioned in Sections 4
and 5.
Self-stabilizing algorithms and protocols can also be
specified in models quite
different from that of
[D74].
Petri nets
[T89]
[G91b]
[CHR95]
[PM96],
conventional shared memory systems
[L86]
[LH91]
[ADHK97],
wait-free objects
[HPT95],
cellular automata
[GKL78]
[G95a],
iteration systems
[G82]
[AAEG93],
rule-based or constraint systems
[BEGMMR90]
[CDK91]
[C92]
[AGV94],
neural networks
[DWW86],
genetic algorithms
[DO93],
and even digital
[AG92]
or continuous circuits
[H95]
are represented in the bibliography.
The paper
[BDT99]
stabilizes a protocol under the
constraint that nodes have a cut-through constraint in
passing messages. Even database transactions have been
viewed in terms of stabilization to consistency
[ED98].
Proof Techniques
It is challenging, at least for a newcomer to the field,
to verify that a program has desired properties when
nothing can be assumed about the initial state in an
execution -- and this verification is necessary to
prove a self-stabilizing program correct. A direct
verification for a program in
[D74]
is given in
[D86],
and it is interesting to compare this proof to a temporal
logic verification presented in
[TW86].
Temporal logic
is also proposed as a verification technique in
[LS97].
Self-stabilization proofs have been
carried out in the UNITY formalism
[R89].
Instead of a direct
proof that a self-stabilizing program has the required
property of convergence to legitimacy from an arbitrary initial
state, the proof can be broken into stages using the
``staircase'' method proposed in
[GM91].
Knuth
[HK85]
observes that, rather than showing convergence, one may show
no divergent behavior is possible and thereby prove convergence.
A more traditional approach (following standard convergence
arguments of sequential program loops) is to use a variant
function, as observed in
[K88]
and used since in many
papers. For randomized self-stabilization, other techniques
have been developed
[DIM95]
[HR96].
Another approach to verification is to attack the problem at
an earlier stage in program development. Given the specification
of a problem and some set of valid composition and refinement rules,
the specification can be stepwise refined until a program, or
a specification suitably close to a program, is obtained; in this
way, the result's validity follows from the validity of the
refinement steps. For instance,
a specification of a self-stabilizing program is refined to
obtain a UNITY program in
[LS92b]
[ATT95]
[LS93a]
[LS93b].
Principles underlying various kinds of program transformation
steps and program composition rules are given in
[GH91]
[S93b]
[SS94]
[A94]
[AP95]
[V97]
[DH99].
The dissertation
[P95]
(see also
[P97])
demonstrates how mechanical theorem proving
technology (using HOL)
assists in the verification of self-stabilizing algorithms. Other
experience with mechanical aids are reported in
[SRS97]
[KRS99].
General Methods
As noted in Section 10, it can be challenging to
prove a self-stabilizing program correct, and this reflects
the fact that it is difficult to devise self-stabilizing protocols.
Having observed this, a number of researchers looked into the
possibility of transforming a program that is not self-stabilizing
into one that is self-stabilizing. This can be done by superimposing
new modules, modifying the original program, or using a type of
compiler to produce a self-stabilizing implementation of the program.
The papers
[AG94]
[AKY90]
[AKY97]
[AV91]
[APV91]
[KP93a]
[AO94]
[APVD94]
[VAG95]
address the general problem of adding self-stabilization to a system
or program, or contribute tools for this purpose.
One may also consider transforming self-stabilizing programs specified
for one model so that they work
on another model
[GHR90]
[DIM97b]
[MK96]
[KMN97]
[GH97]
[GH99b]
[AG97]
[MN98],
and this is helpful if the former model is simpler for programming
than the latter. Transformations of self-stabilizing programs are
also useful to add additional properties
[DH95a]
[DH97]
[AK95]
[KA97]
[GGHP96]
[AD97a]
[AD97b]
[KC98]
beyond
self-stabilization or to accelerate convergence
in the case of more likely faults
(see also
[BGK98]
[BGK99]
[KS97]
[KP98]
[KP99])
.
If we recall the definition of self-stabilization
as a ``forgetful'' property
[S93a],
then the general principle
behind many self-stabilizing programs is that they continually ``flush''
older state information, and this idea can be more formally seen
in the work
[V94a]
[CV96].
Theoretical Papers
Since very few implementations of self-stabilizing algorithms
are cited in the literature, it may be unfair to single out specific
papers as being ``theoretical'' or more theoretical than others.
We mention here papers concerned with Turing machines, computability
results, taxonomy, and fundamental theoretic models:
[GHR90]
[AD94]
[BD93]
[AD97c]
[IL94]
[S95].
We list here some of the more frequently asked questions
by researchers outside the field of self-stabilization,
but unlike a traditional ``FAQ'' we offer no answers.
- How does one know when a self-stabilizing system is in
a legitimate state?
- When does a self-stabilizing protocol or algorithm terminate?
- What is the relation between self-stabilization and
areas of
- fault-tolerance
- coding theory
- self-checking circuits
- self-organizing systems
- auto-controlled systems
- ergodic systems
- Can a system be partly self-stabilizing?
- What is self-stabilization good for?
- What are the most useful applications of self-stabilization?
- What are the big open questions in the area of self-stabilization?
- What would be a surprising result in research on self-stabilization?
What, beyond the continuance of research classified in previous sections,
is likely to appear in future versions of this bibliography? We hope to
see growing connections between self-stabilization and other areas of
fault tolerance. As noted in Section 3, there appears to be
growing appreciation of programs and services that make ``eventuality''
guarantees
[CT91]
[FGLLS96]
(see also
[BDDT98])
;
such work may offer new application
of self-stabilization beyond currently recognized potential application
domains (communication protocols and load-balancing). The paper
[AK95]
shows how self-stabilization can be useful even in
fault-tolerant systems with more stringent requirements than
what stabilization itself provides.
In addition to self-stabilization,
there are other ways of dealing with transient faults or initial
states that are somewhat (but not completely) arbitrary
(see
[FMRT90]
[AGMT95]
[TKM89]
[P91]
for example).
Indeed, weakened forms
of self-stabilization may be more attractive for many applications than
pure self-stabilization.