Network Working Group N. Barry
Internet-Draft Stellar Development Foundation
Intended status: Experimental D. Mazieres
Expires: October 1, 2018 Stanford University
J. McCaleb
Stellar Development Foundation
S. Polu
Stripe Inc.
March 30, 2018

The Stellar Consensus Protocol (SCP)
draft-mazieres-dinrg-scp-01

Abstract

SCP is an open Byzantine agreement protocol resistant to Sybil attacks. It allows Internet infrastructure stakeholders to reach agreement on a series of values without unanimous agreement on what constitutes the set of important stakeholders. A big differentiator from other Byzantine agreement protocols is that, in SCP, nodes determine the composition of quorums in a decentralized way: each node selects sets of nodes it considers large or important enough to speak for the whole network, and a quorum must contain such a set for each of its members.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on October 1, 2018.

Copyright Notice

Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.


Table of Contents

1. Introduction

Various aspects of Internet infrastructure depend on irreversible and transparent updates to data sets such as authenticated mappings [cite Li-Man-Watson draft]. Examples include public key certificates and revocations, transparency logs [RFC6962], preload lists for HSTS [RFC6797] and HPKP [RFC7469], and IP address delegation [I-D.paillisse-sidrops-blockchain].

The Stellar Consensus Protocol (SCP) specified in this draft allows Internet infrastructure stakeholders to collaborate in applying irreversible transactions to public state. SCP is an open Byzantine agreement protocol that resists Sybil attacks by allowing individual parties to specify minimum quorum memberships in terms of specific trusted peers. Each participant chooses combinations of peers on which to depend such that these combinations can be trusted in aggregate. The protocol guarantees safety so long as these dependency sets transitively overlap and contain sufficiently many honest nodes correctly obeying the protocol.

Though bad configurations are theoretically possible, several analogies provide an intuition for why transitive dependencies overlap in practice. For example, given multiple entirely disjoint Internet-protocol networks, people would have no trouble agreeing on the fact that the network containing the world's top web sites is the Internet. Such a consensus can hold even without unanimous agreement on what constitute the world's top web sites. Similarly, if network operators listed all the ASes from whom they would consider peering or transit worthwhile, the transitive closures of these sets would contain significant overlap, even without unanimous agreement on the "tier-1 ISP" designation. Finally, while different browsers and operating systems have slightly different lists of valid certificate authorities, there is significant overlap in the sets, so that a hypothetical system requiring validation from "all CAs" would be unlikely to diverge.

A more detailed abstract description of SCP and its rationale, including an English-language proof of safety, is available in [SCP]. In particular, that reference shows that a necessary property for safety, termed quorum intersection despite ill-behaved nodes, is sufficient to guarantee safety under SCP, making SCP optimally safe against Byzantine node failure for any given configuration.

This document specifies the end-system logic and wire format of the messages in SCP.

2. The Model

This section describes the configuration and input/output values of the consensus protocol.

2.1. Configuration

Each participant or node in the SCP protocol has a digital signature key and is named by the corresponding public key, which we term a NodeID.

Each node also selects one or more sets of nodes (each of which includes itself) called quorum slices. A quorum slice represents a large or important enough set of peers that the node selecting the quorum slice believes the slice collectively speaks for the whole network.

A quorum is a non-empty set of nodes containing at least one quorum slice of each of its members. For instance, suppose v1 has the single quorum slice {v1, v2, v3}, while each of v2, v3, and v4 has the single quorum slice {v2, v3, v4}. In this case, {v2, v3, v4} is a quorum because it contains a slice for each member. On the other hand {v1, v2, v3} is not a quorum, because it does not contain a quorum slice for v2 or v3. The smallest quorum including v1 in this example is the set of all nodes {v1, v2, v3, v4}.

Unlike traditional Byzantine agreement protocols, nodes in SCP only care about quorums to which they belong themselves (and hence that contain at least one of their quorum slices). Intuitively, this is what protects nodes from Sybil attacks. In the example above, if v3 deviates from the protocol, maliciously inventing 96 Sybils v5, v6, ..., v100, the honest nodes' quorums will all still include one another, ensuring that v1, v2, and v4 continue to agree on output values.

Every message in the SCP protocol specifies the sender's quorum slices. Hence, by collecting messages, a node dynamically learns what constitutes a quorum and can decide when a particular message has been sent by a quorum to which it belongs. (Again, nodes do not care about quorums to which they do not belong themselves.)

2.2. Input and output

SCP produces a series of output values for consecutively numbered slots. At the start of a slot, higher-layer software on each node supplies a candidate input value. SCP's goal is to ensure that non-faulty nodes agree on one or a combination of nodes' input values for the slot's output. 5 seconds after completing one slot, the protocol runs again for the next slot.

A value typically encodes a set of actions to apply to a replicated state machine. During the pause between slots, nodes to accumulate the next set of actions, thus amortizing the cost of consensus over arbitrarily many individual state machine operations.

In practice, only one or a small number of nodes' input values actually affect the output value for any given slot. As discussed in Section 3.4, which nodes' input values to use depends on a cryptographic hash of the slot number, output history, and node public keys. A node's chances of affecting the output value depend on how often it appears in other nodes' quorum slices.

From SCP's perspective, values are just opaque byte arrays whose interpretation is left to higher-layer software. However, SCP requires a combining function that reduces multiple candidate values into a single composite value. When nodes nominate multiple values for a slot, SCP nodes invoke this function to converge on a single composite value. By way of example, in an application where values consist of sets of transactions, the combining function could take the union of transaction sets. Alternatively, if values represent a timestamp and a set of transactions, the combining function might pair the highest nominated timestamp with the transaction set that has the highest hash value.

3. Protocol

The protocol consists of exchanging digitally-signed messages bound to nodes' quorum slices. The format of all messages is specified using XDR [RFC4506]. In addition to quorum slices, messages compactly convey votes on sets of conceptual statements. The core technique of voting with quorum slices is termed federated voting. We describe federated voting next, then detail protocol messages in the subsections that follow.

3.1. Federated voting

Federated voting is a process through which nodes confirm statements. Not every attempt at federated voting may succeed--an attempt to vote on some statement a may get stuck, with the result that nodes can confirm neither a nor its negation !a. However, when a node succeeds in confirming a statement a, federated voting guarantees two things:

  1. No two well-behaved nodes will confirm contradictory statements in any configuration and failure scenario in which any protocol can guarantee safety for the two nodes (i.e., quorum intersection for the two nodes holds despite ill-behaved nodes).
  2. If a node that is guaranteed safety by #1 confirms a statement a, and that node is a member of one or more quorums consisting entirely of well-behaved nodes, then eventually every member of every such a quorum will also confirm a.

Intuitively, these conditions are key to ensuring agreement among nodes as well as a weak form of liveness (absence of stuck states) that is compatible with the FLP impossibility result [FLP].

As a node v collects signed copies of a federated voting message m from peers, two thresholds trigger state transitions in v depending on the message. We define these thresholds as follows:

Each node v can send several types of message with respect to a statement a during federated voting:

Figure 1 illustrates the federated voting process. A node v votes for a valid statement a that doesn't contradict statements in past vote or accept messages sent by v. When the vote message reaches quorum threshold, the node accepts a. In fact, v accepts a if the vote-or-accept message reaches quorum threshold, as some nodes may accept a without first voting for it. Specifically, a node that cannot vote for a because it has voted for a's negation !a still accepts a when the message accept a reaches blocking threshold (meaning assertions about !a have no hope of reaching quorum threshold barring catastrophic Byzantine failure).

If and when the message accept a reaches quorum threshold, then v has confirmed a and the federated vote has succeeded. In effect, the accept messages constitute a second vote on the fact that the initial vote messages succeeded. Once v enters the confirmed state, it may issue a confirm a message to help other nodes confirm a more efficiently by pruning their quorum search at v.

                "vote-or-accept a"          "accept a"
                     reaches                 reaches
                 quorum threshold        quorum threshold
                +-----------------+     +-----------------+
                |                 |     |                 |
                |                 V     |                 V
             +-----------+     +-----------+     +-----------+
  a is +---->|  voted a  |     |accepted a |     |confirmed a|
 valid |     +-----------+     +-----------+     +-----------+
       |           |                 ^
+-----------+      |                 | "accept a" reaches
|uncommitted|------+-----------------+ blocking threshold
+-----------+      |
       |           |
       |     +-----------+
       +---->|  voted !a |
             +-----------+

Figure 1: Federated voting process

3.2. Basic types

SCP employs 32- and 64-bit integers, as defined below.

typedef unsigned int uint32;
typedef int int32;
typedef unsigned hyper uint64;
typedef hyper int64;

SCP uses the SHA-256 cryptograhpic hash function [RFC6234], and represents hash values as a simple array of 32 bytes.

typedef opaque Hash[32];

SCP employes the Ed25519 digital signature algorithm [RFC8032]. For cryptographic agility, however, public keys are represented as a union type that can later be compatibly extended with other key types.

typedef opaque uint256[32];

enum PublicKeyType
{
    PUBLIC_KEY_TYPE_ED25519 = 0
};

union PublicKey switch (PublicKeyType type)
{
case PUBLIC_KEY_TYPE_ED25519:
    uint256 ed25519;
};

// variable size as the size depends on the signature scheme used
typedef opaque Signature<64>;

Nodes are public keys, while values are simply opaque arrays of byes.

typedef PublicKey NodeID;

typedef opaque Value<>;

3.3. Quorum slices

Theoretically a quorum slice can be an arbitrary set of sets of nodes. However, arbitrary predicates on sets cannot be encoded concisely. Instead we specify quorum slices as any set of k-of-n members, where each of the n members can either be an individual node ID, or, recursively, another k-of-n set.

// supports things like: A,B,C,(D,E,F),(G,H,(I,J,K,L))
// only allows 2 levels of nesting
struct SCPQuorumSet
{
    uint32 threshold;            // the k in k-of-n
    PublicKey validators<>;
    SCPQuorumSet1 innerSets<>;
};
struct SCPQuorumSet1
{
    uint32 threshold;            // the k in k-of-n
    PublicKey validators<>;
    SCPQuorumSet2 innerSets<>;
};
struct SCPQuorumSet2
{
    uint32 threshold;            // the k in k-of-n
    PublicKey validators<>;
};

Let k be the value of threshold and n the sum of the sizes of the validators and innerSets vectors in a message sent by some node v. A message m sent by v reaches quorum threshold at v when three things hold:

  1. v itself has issued (digitally signed) the message,
  2. The number of nodes in validators who have signed m plus the number innerSets that (recursively) meet this condition is at least k, and
  3. These three conditions apply (recursively) at some combination of nodes sufficient for condition #2.

A message reaches blocking threshold at v when the number of validators making the statement plus (recursively) the number innerSets reaching blocking threshold exceeds n-k. (Blocking threshold is a local property that does not require a recursive check on other nodes like step #3 above.)

As described in Section 3.9, every protocol message is paired with a cryptographic hash of the sender's SCPQuorumSet and digitally signed. Inner protocol messages described in the next few sections should be understood to be received in alongside such a quorum slice specification and digital signature.

3.4. Nomination

For each slot, the SCP protocol begins in a NOMINATION phase whose goal is to devise one or more candidate output values for the consensus protocol. Nodes send nomination messages that contain a monotonically growing set of values in the following format:

struct SCPNomination
{
    Value votes<>;      // X
    Value accepted<>;   // Y
};

The votes and accepted sets are disjoint; any value that is eligible for for both sets is placed only in the accepted set.

votes consists of candidate values nominated by the sender. Each node progresses through a series of nomination rounds in which it may increase the set of values in its own votes field by adding the contents of the votes and accepted fields of SCPNomination messages received from a growing set of peers. In round n of slot i, each node determines an additional peer whose nominated values it should incorporate in its own SCPNomination message as follows:

For each round n until nomination has finished (see below), a node starts echoing the available peer v with the highest value of priority(n, v) from among the nodes in neighbors(n). To echo v, the node merges any valid values from v's votes and accepted sets to its own votes set. The notion of message validity is dictated by higher-layer software, but should be mostly independent of replicated state. For instance, nodes might check that values can be syntactically parsed, that signed transactions have correct digital signatures, and that timestamp fields are not in the future.

Nodes must not send an SCPNomination message until at least one of the votes or accepted fields is non-empty. When these fields are both empty, a node that has the highest priority among its neighbors in the current round (and hence should be echoing its own votes) adds the higher-layer software's input value to its votes field. Nodes that do not have the highest priority wait to hear SCPNomination messages from the nodes whose nominations they are echoing.

If a particular valid value x reaches quorum threshold in the messages sent by peers (meaning that every node in a quorum contains x either in the votes or the accepted field), then the node at which this happens moves x from its votes field to its accepted field and broadcasts a new SCPNomination message. Similarly, if x reaches blocking threshold in a node's peers' accepted field (meaning every one of a node's quorum slices contains at least one node with x in its accepted field), then the node adds x to its own accepted field (removing it from votes if applicable). These two cases correspond to the two conditions for entering the accepted state in Figure 1.

A node finishes the NOMINATION phase whenever any value x reaches quorum threshold in the accepted fields. Following the terminology of Section 3.1, this condition corresponds to when the node confirms x as nominated. A node that has finished the NOMINATION phase stops adding new values to its votes set. However, the node continues adding new values to accepted as appropriate. Doing so may lead to more values becoming confirmed nominated in the background.

Round n lasts for 2+n seconds, after which, if the NOMINATION phase has not finished, a node proceeds to round n+1. Note that a node continues to echo votes from the highest priority neighbor in prior rounds as well as the current round. In particular, a node continues expanding its votes field with values nominated by highest priority neighbors from prior rounds even when those values were added after the end of the round.

XXX - expand votes with only the 10 values with lowest SHA-256 hash in any given round to avoid blowing out the message size?

3.5. Ballots

After completing the NOMINATION phase (meaning after at least one candidate value is confirmed nominated), a node moves through three phases of balloting: PREPARE, COMMIT, and EXTERNALIZE. Balloting employs federated voting to chose between commit and abort statements for ballots. A ballot is a pair, consisting of a counter and candidate value:

// Structure representing ballot <n, x>
struct SCPBallot
{
    uint32 counter; // n
    Value value;    // x
};

We use the notation <n, x> to represent a ballot with counter == n and value == x.

Ballots are totally ordered with counter more significant than value. Hence, we write b1 < b2 to mean that either b1.counter < b2.counter or b1.counter == b2.counter && b1.value < b2.value. (Values are compared lexicographically as a strings of unsigned octets.)

The protocol moves through federated voting on successively higher ballots until nodes confirm commit b for some ballot b, at which point consensus terminates and outputs b.value for the slot. To ensure that only one value can be chosen for a slot and that the protocol cannot get stuck if individual ballots get stuck, there are two restrictions on voting:

  1. A node cannot vote for both commit b and abort b on the same ballot (the two outcomes are contradictory), and
  2. A node may not vote for commit b for any ballot b unless it has aborted every lesser ballot with a different value.

The second condition requires voting to abort large numbers of ballots before voting to commit a ballot b. We call this preparing ballot b, and introduce the following notation for the associated set of abort statements.

Using this terminology, a node must confirm prepare(b) before issuing a vote message for the statement commit b.

3.6. Prepare messages

The first phase of balloting is the PREPARE phase. During this phase, nodes send the following message:

struct SCPPrepare
{
    SCPBallot ballot;         // b
    SCPBallot *prepared;      // p
    SCPBallot *preparedPrime; // p'
    uint32 hCounter;          // h.counter or 0 if h == NULL
    uint32 cCounter;          // c.counter or 0 if c == NULL
};

This message compactly conveys the following (conceptual) federated voting messages:

Note that to be valid, an SCPPrepare message must satisfy preparedPrime < prepared <= ballot (for any non-NULL prepared and preparedPrime), and cCounter <= hCounter <= ballot.counter.

Based on the federated vote messages received, each node keeps track of what ballots have been accepted and confirmed prepared. It uses these ballots to set the following fields of its own SCPPrepare messages as follows.

prepared

The highest accepted prepared ballot or or NULL if no ballot has been accepted prepared
preparedPrime

The highest accepted prepared ballot such that preparedPrime.value != prepared.value, or NULL if there is no such ballot
hCounter

The counter field of the highest confirmed prepared ballot, or 0 if no ballot has been confirmed prepared (Only the counter is included because the value of the highest confirmed prepared ballot is always the same as ballot.value.)
ballot

The current ballot that a node is attempting to prepare and commit. The rules for setting each field are detailed below.
ballot.counter

The counter is set according to the following rules:
ballot.value

Each time the ballot counter is changed, the value is also recomputed as follows: If any ballot has been confirmed prepared, then the ballot.value is taken to to be h.value for the highest confirmed prepared ballot h. Otherwise (if hCounter == 0), the value is taken as the output of the deterministic combining function applied to all confirmed nominated values. Note the that of confirmed nominated values may continue to grow in the background during the balloting phase, so ballot.value may change even while hCounter == 0.
cCounter

The value cCounter is maintained based on an internally-maintained "commit ballot" c, initiall NULL. cCounter is 0 when while c is NULL and c.counter otherwise. c is updated as follows:

The PREPARE phase ends at a node when the statement commit b reaches the accept state in federated voting for some ballot b.

3.7. Commit messages

In the COMMIT phase, a node has accepted commit b for some ballot b, and must confirm that statement to act on the value in b.counter. A node sends the following message in this phase:

struct SCPCommit
{
    SCPBallot ballot;       // b
    uint32 preparedCounter; // prepared.counter
    uint32 hCounter;        // h.counter
    uint32 cCounter;        // c.counter
};

The message conveys the following federated vote messages, where where infinity is 2^{32} (a value greater than any ballot counter representable in marshaled form):

A node computes the fields in the SCPCommit messages it sends as follows:

ballot

This field is maintained identically to how it is maintained in the PREPARE phase, though ballot.value can no longer change, only ballot.counter. Note that the value ballot.counter does not figure in any of the federated voting messages. The purpose of continuing to update and send this field is to assist other nodes still in the PREPARE phase in synchronizing their counters.
preparedCounter

This field is the counter of the highest accepted prepared ballot--maintained identically to the prepared field in the PREPARE phase. Since the value field will always be the same as ballot, only the counter is sent in the COMMIT phase.
cCounter

The counter of the lowest ballot c for which the node has accepted commit c. (No value is included in messages since c.value == ballot.value.)
hCounter

The counter of the highest ballot h for which the node has accepted commit c. (No value is included in messages since h.value == ballot.value.)

As soon as a node confirms commit b for any ballot b, it moves to the EXTERNALIZE stage.

3.8. Externalize messages

A node enters the EXTERNALIZE when it confirms commit b for any ballot b. As soon as this happens, SCP outputs b.value as the value of the current slot. In order to help other nodes achieve consensus on the slot more quickly, a node reaching this phase also sends the following message:

struct SCPExternalize
{
    SCPBallot commit;         // c
    uint32 hCounter;          // h.counter
} externalize;

An SCPExternalize message conveys the following federated voting messages:

commit

The lowest confirmed committed ballot.
hCounter

The counter of the highest confirmed committed ballot.

confirmed committed ballot.

3.9. Message envelopes

In order to provide full context for each signed message, all signed messages are part of an SCPStatement union type that includes the slotIndex naming the slot to which the message applies, as well as the type of the message. A signed message and its signature are packed together in an SCPEnvelope structure.

enum SCPStatementType
{
    SCP_ST_PREPARE = 0,
    SCP_ST_COMMIT = 1,
    SCP_ST_EXTERNALIZE = 2,
    SCP_ST_NOMINATE = 3
};

struct SCPStatement
{
    NodeID nodeID;    // v (node signing message)
    uint64 slotIndex; // i

    union switch (SCPStatementType type)
    {
    case SCP_ST_PREPARE:
        SCPPrepare prepare;
    case SCP_ST_COMMIT:
        SCPCommit confirm;
    case SCP_ST_EXTERNALIZE:
        SCPExternalize externalize;
    case SCP_ST_NOMINATE:
        SCPNomination nominate;
    }
    pledges;
};

struct SCPEnvelope
{
    SCPStatement statement;
    Signature signature;
};

4. Security considerations

If nodes do not pick quorum slices well, the protocol will not be safe.

5. Acknowledgments

The Stellar development foundation supported development of the protocol and produced the first production deployment of SCP. The IRTF DIN group including Dirk Kutscher, Sydney Li, Colin Man, Melinda Shore, and Jean-Luc Watson helped with the framing and motivation for this specification.

6. References

6.1. Normative References

[RFC4506] Eisler, M., "XDR: External Data Representation Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 2006.
[RFC6234] Eastlake 3rd, D. and T. Hansen, "US Secure Hash Algorithms (SHA and SHA-based HMAC and HKDF)", RFC 6234, DOI 10.17487/RFC6234, May 2011.
[RFC8032] Josefsson, S. and I. Liusvaara, "Edwards-Curve Digital Signature Algorithm (EdDSA)", RFC 8032, DOI 10.17487/RFC8032, January 2017.

6.2. Informative References

[FLP] Fischer, M., Lynch, N. and M. Lynch, "Impossibility of Distributed Consensus with One Faulty Process", Journal of the ACM 32(2):374-382, 1985.
[I-D.paillisse-sidrops-blockchain] Paillisse, J., Rodriguez-Natal, A., Ermagan, V., Maino, F. and A. Cabellos-Aparicio, "An analysis of the applicability of blockchain to secure IP addresses allocation, delegation and bindings.", Internet-Draft draft-paillisse-sidrops-blockchain-01, October 2017.
[RFC6797] Hodges, J., Jackson, C. and A. Barth, "HTTP Strict Transport Security (HSTS)", RFC 6797, DOI 10.17487/RFC6797, November 2012.
[RFC6962] Laurie, B., Langley, A. and E. Kasper, "Certificate Transparency", RFC 6962, DOI 10.17487/RFC6962, June 2013.
[RFC7469] Evans, C., Palmer, C. and R. Sleevi, "Public Key Pinning Extension for HTTP", RFC 7469, DOI 10.17487/RFC7469, April 2015.
[SCP] Mazieres, D., "The Stellar Consensus Protocol: A Federated Model for Internet-level Consensus", Stellar Development Foundation whitepaper , 2015.

Authors' Addresses

Nicolas Barry Stellar Development Foundation 170 Capp St., Suite A San Francisco, CA, 94110 US EMail: nicolas@stellar.org
David Mazieres Stanford University 353 Serra Mall, Room 290 Stanford, CA, 94305 US EMail: dm@uun.org
Jed McCaleb Stellar Development Foundation 170 Capp St., Suite A San Francisco, CA, 94110 US EMail: jed@stellar.org
Stanislas Polu Stripe Inc. 185 Berry Street, Suite 550 San Francisco, CA, 94107 US EMail: stan@stripe.com