Replication

Replication means that multiple copies of each key/value pair are stored across multiple nodes in a cluster. Each copy of a key/value pair is called a replica. Each replica is stored on a different node where each replica node is in a different rack.

Replication ensures that your application will continue to work when a server, rack or entire data center (DC) fails. In other words, replication is a core component of high availability (HA).

Replicas and replica nodes

A replica node is a node that owns the data token range of a key/value pair’s data token. A data token is generated by hashing the key portion of a key/value pair with a consistent hashing algorithm. Each replica (i.e. copy of a key/value pair) is written to all replica nodes across a cluster.

DynomiteDB has a masterless, peer-to-peer architecture which means that each replica node is equal to every other replica node. This fact is important as a server or rack failure that affects one replica node has no effect the other replica nodes.

Data token ownership

Token ownership

Data token ownership

The diagram above focuses on a single rack with three servers. s1’s node token is 0, s2’s node token is 1431655765 and s3’s node token is 2863311530.

We can determine which nodes are replica nodes for a given key/value pair as follows:

  1. Calculate the data token by hashing the key portion of the key/value pair with a consistent hashing algorithm
  2. For each node set the data token ownership range from nodeToken to nextNodeToken - 1
  3. Check if the data token is within the range of data tokens owned by each node

Token ownership (ring view)

Data token ownership (token ring)

The diagram above displays the same token ownership information as the earlier diagram, except that a token ring is used for display.

To continue the example in the diagram above we will determine the replica node for a key/value pair with data token = 100 and another key/value pair with data token = 4000000000.

s1 owns all data tokens from 0 to 1431655764. Therefore, a key/value pair with a data token = 100 will be written to node s1.

s3 owns all data tokens from 2863311530 to 4294967295. Therefore, a key/value pair with a data token = 4000000000 will be written to node s3.

Don’t worry too much about the readability of the data tokens as it is an implementation detail managed internally by DynomiteDB.

Importantly, an application developer never sees or thinks about data tokens. Instead, an application developer focuses on key/value pairs that make sense for the given application, such as firstName: Jane or age: 23. DynomiteDB’s use of data tokens is an implementation detail that is not exposed to the application layer.

Replication factor (RF)

Replication factor

3 racks means replication factor (RF) = 3

Replication factor (RF) determines how many replicas exist per DC. The RF for a DC is equal to the number of racks in a DC. For example, if a DC has 3 racks then we know that RF = 3 which means that there are 3 replicas of each key/value pair.

Each rack contains the entire token range from 0 to 4,294,967,295. Each node within the rack is assigned a node token within the token range where all nodes put together own the entire token range. Therefore, a data token is always owned by exactly one node within each rack which explains why the number of racks per DC determines the RF.

Replication

Replication happens on write and is a mix of synchronous and asynchronous operations. The number of nodes involved in the synchronous portion of the write is determined by the write consistency level (CL). If the write CL = dc_one, then one replica node within the local DC is written to synchronously while all other replica nodes are written to asynchronously. If the write CL = dc_quorum, then a quorum (i.e. a simple majority) of replica nodes within the local DC are written to synchronously while all other replica nodes are written to asynchronously. Replica nodes in remote DCs are always written to asynchronously.

In the next few sections we will discuss a few variations of replication:

  • Coordinator’s role in replication
  • Symmetric replication: When all servers across multiple racks are of equal size and number
  • Asymmetric replication: When the number and size of servers differ across racks
  • Cross-DC replication: Replication from a local DC to one or more remote DCs

Coordinator’s role in replication

Coordinator

Coordinator

The diagram above shows how a write request flows from a client to Dynomite to a backend. Importantly, the client (which is typically an API server) uses a standard Redis client to communicate with the Redis-compatible API on a dynomite instance. From the client’s perspective, it thinks it is communicating to a single server.

dynomite receives the request which is passed to the coordinator subsystem. The coordinator hashes the key from the key/value pair to generate a data token. The coordinator uses the data token and its knowledge of the network topology to determine which nodes are the replica nodes for this request.

Write consistency level (CL) = dc_one

Write CL dc_one with topology awareness

Write CL dc_one with topology awareness

When write CL = dc_one and the coordinator’s node is a replica node, then a write request flows as follows:

  • the coordinator synchronously sends the write request to its backend
  • the coordinator waits for a response from its backend
  • if the coordinator receives a success response from its backend then:
    • the coordinator sends a success response to the client
    • the coordinator asynchronously replicates the write request to all other replica nodes
  • if the coordinator receives a failure response from its backend then:
    • the coordinator sends a failure message to the client

Write CL dc_one without topology awareness

Write CL dc_one without topology awareness

When write CL = dc_one and the coordinator’s node is not a replica node, then a write request flows as follows:

  • the coordinator synchronously sends the write request to the replica node within its rack
  • the coordinator waits for a response from the replica node
  • if the coordinator receives a success response from the replica node then:
    • the coordinator sends a success response to the client
    • the coordinator asynchronously replicates the write request to all other replica nodes
  • if the coordinator receives a failure response from the replica node then:
    • the coordinator sends a failure message to the client

Write consistency level (CL) = dc_quorum

Write CL dc_quorum with topology awareness

Write CL dc_quorum with topology awareness

When write CL = dc_quorum and the coordinator’s node is a replica node, then a write request flows as follows:

  • the coordinator synchronously sends the write request to its backend and enough replica nodes in other racks to obtain a quorum
  • the coordinator waits for a response from a quorum of replica nodes
  • if the coordinator receives a success response from a quorum of replica nodes then:
    • the coordinator sends a success response to the client
    • the coordinator asynchronously replicates the write request to all other replica nodes
  • if the coordinator does not receive a success response from a quorum of replica nodes then:
    • the coordinator sends a failure message to the client

Write CL dc_quorum without topology awareness

Write CL dc_quorum without topology awareness

When write CL = dc_quorum and the coordinator’s node is not a replica node, then a write request flows as follows:

  • the coordinator synchronously sends the write request to enough replica nodes to obtain a quorum
  • the coordinator waits for a response from a quorum of replica nodes
  • if the coordinator receives a success response from a quorum of replica nodes then:
    • the coordinator sends a success response to the client
    • the coordinator asynchronously replicates the write request to all other replica nodes
  • if the coordinator does not receive a success response from a quorum of replica nodes then:
    • the coordinator sends a failure message to the client

Topology-aware Dyno client

Topology-aware Dyno Client

Topology-aware Dyno Client

The Dyno client is topology aware which means that it always sends each request to a replica node. The benefit of this feature is that the coordinator will always be on a replica node which reduces latency and intra-cluster communication.

Symmetric replication

Symmetric replication

Symmetric replication

The diagram above shows symmetric replication as the single DC has two racks where each rack has three equal size servers (i.e. same CPU, memory, disk, etc.).

A server in each rack owns the same data token range as a server in the other rack. The table below shows each of the three data token ranges and the two nodes (one from each rack) that owns the range.

Node token Min data token Max data token Replica nodes
0 0 1431655764 sfo-r1-s1, sfo-r2-s1
1431655765 1431655765 2863311529 sfo-r1-s2, sfo-r2-s2
2863311530 2863311530 4294967295 sfo-r1-s3, sfo-r2-s3

When the client sends a write request for a key/value pair with data token = 3000000000 with write CL = dc_one, then the data is written synchronously to one node and asynchronously to the other replica node.

Asymmetric replication

Asymmetric replication

Asymmetric replication

The diagram above shows asymmetric replication as the single DC has two racks where one rack has three servers and the other rack has six servers.

Each node in rack r1 owns 13 of the data token range as shown in the table below.

Node token Min data token Max data token Replica nodes
0 0 1431655764 sfo-r1-s1
1431655765 1431655765 2863311529 sfo-r1-s2
2863311530 2863311530 4294967295 sfo-r1-s3

Each node in rack r2 owns 16 of the data token range as shown in the table below.

Node token Min data token Max data token Replica nodes
0 0 715827881 sfo-r2-s1
715827882 715827882 1431655764 sfo-r2-s2
1431655765 1431655765 2147483646 sfo-r2-s3
2147483647 2147483647 2863311529 sfo-r2-s4
2863311530 2863311530 3579139411 sfo-r2-s5
3579139412 3579139412 4294967295 sfo-r2-s6

If the client connects to a node in r1, then when the client sends a write request for a key/value pair with data token = 3000000000 with write CL = dc_one, then the data is written synchronously to r1s3 and asynchronously to r2s5.

In general, the data on a node in r1 will be split among two nodes in r2. Conversely, data stored separately on two nodes in r2 is consolidated on one node in r1.

Multi-DC replication

Multi-DC symmetric replication

Multi-DC symmetric replication

The diagram above show a cluster with a two DCs that each contain three racks with three servers per rack. The servers are all equal size. Therefore this example shows multi-DC symmetric replication.

Multi-DC replication is nearly identical to the replication within a single DC with multiple racks. During multi-DC replication, the coordinator in the local DC picks a coordinator in the remote DC. The local coordinator then forwards the request to the remote coordinator asynchronously. The remote coordinator then forwards the request to all replica nodes in the remote DC.

The local coordinator uses a remote coordinator to reduce cross-DC traffic. in our example, the remote DC has three replicas, yet only a single request was sent over the network from the local coordinator to the remote coordinator. The remote coordinator is then responsible for all communication within its DC.

Summary

DynomiteDB replicates each key/value pair across multiple servers and racks. A copy of a key/value pair is called a replica. Replicas are stored on nodes that own the data token range that matches the replica’s data token. The data token is generated by hashing the key portion of a key/value pair with a consistent hashing algorithm.

Each rack owns the complete token range. The nodes within a rack each own a portion of the range. As a result, each key/value pair is stored on one node per rack.

The number of racks per DC defines the replication factor (RF). The RF is the number of replicas per DC.

During a write request, the write consistency level (CL) determines how many replica nodes are written to synchronously, while the remainder are written to asynchronously. Writes to replica nodes in remote DCs are always written to asynchronously.

DynomiteDB supports both symmetric replication and asymmetric replication across racks and DCs.