Causal Consistency Specification
- Status: Accepted
- Minimum Server Version: 3.6
Abstract
Version 3.6 of the server introduces support for causal consistency. This spec builds upon the Sessions Specification to define how an application requests causal consistency and how a driver interacts with the server to implement causal consistency.
Definitions
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Terms
Causal consistency
A property that guarantees that an application can read its own writes and that a later read
will never observe a version of the data that is older than an earlier read.
ClientSession
The driver object representing a client session and the operations that can be performed on it.
Cluster time
The current cluster time. The server reports its view of the current cluster time in the
$clusterTime
field in responses from the server and the driver participates in distributing the current cluster time
to all nodes (called "gossipping the cluster time") by sending the highest $clusterTime
it has seen so far in messages
it sends to mongos servers. The current cluster time is a logical time, but is digitally signed to prevent malicious
clients from propagating invalid cluster times. Cluster time is only used in replica sets and sharded clusters.
Logical time
A time-like quantity that can be used to determine the order in which events occurred. Logical time is
represented as a BsonTimestamp.
MongoClient
The root object of a driver's API. MAY be named differently in some drivers.
MongoCollection
The driver object representing a collection and the operations that can be performed on it. MAY be
named differently in some drivers.
MongoDatabase
The driver object representing a database and the operations that can be performed on it. MAY be
named differently in some drivers.
Operation time
The logical time at which an operation occurred. The server reports the operation time in the
response to all commands, including error responses. The operation time by definition is always less than or equal to
the cluster time. Operation times are tracked on a per ClientSession
basis, so the operationTime
of each
ClientSession
corresponds to the time of the last operation performed in that particular ClientSession
.
ServerSession
The driver object representing a server session.
Session
A session is an abstract concept that represents a set of sequential operations executed by an application
that are related in some way. This specification defines how sessions are used to implement causal consistency.
Unacknowledged writes
Unacknowledged writes are write operations that are sent to the server without waiting for a
reply acknowledging the write. See the "Unacknowledged Writes" section below for information on how unacknowledged
writes interact with causal consistency.
Specification
An application requests causal consistency by creating a ClientSession
with options that specify that causal
consistency is desired. An application then passes the session as an argument to methods in the MongoDatabase
and
MongoCollection
classes. Any operations performed against that session will then be causally consistent.
Naming variations
This specification defines names for new methods and types. To the extent possible you SHOULD use these names in your
driver. However, where your driver's and/or language's naming conventions differ you SHOULD continue to use them
instead. For example, you might use StartSession
or start_session
instead of startSession
.
High level summary of the API changes for causal consistency
Causal consistency is built on top of client sessions.
Applications will start a new client session for causal consistency like this:
options = new SessionOptions(causalConsistency = true);
session = client.startSession(options);
All read operations performed using this session will now be causally consistent.
If no value is provided for causalConsistency
and snapshot reads are not requested a value of true is implied. See the
causalConsistency
section.
MongoClient changes
There are no API changes to MongoClient
to support causal consistency. Applications indicate whether they want causal
consistency by setting the causalConsistency
field in the options passed to the startSession
method.
SessionOptions changes
SessionOptions
change summary
class SessionOptions {
Optional<bool> causalConsistency;
// other options defined by other specs
}
In order to support causal consistency a new property named causalConsistency
is added to SessionOptions
.
Applications set causalConsistency
when starting a client session to indicate whether they want causal consistency.
All read operations performed using that client session are then causally consistent.
Each new member is documented below.
causalConsistency
Applications set causalConsistency
when starting a session to indicate whether they want causal consistency.
Note that the causalConsistency
property is optional. For explicit sessions, the default value of this property is
not supplied
. If no value is supplied for causalConsistency
the value will be inherited. Currently it is inherited
from the global default which is defined to be true. In the future it might be inherited from client settings. For
implicit sessions, the value of this property MUST be set to false
in order to avoid potential conflicts with an
operation's read concern level.
Causal consistency is provided at the session level by tracking the clusterTime
and operationTime
for each session.
In some cases an application may wish subsequent operations in one session to be causally consistent with operations
that were executed in a different session. In that case the application can call the advanceClusterTime
and
advanceOperationTime
methods in ClientSession
to advance the clusterTime
and operationTime
of one session to the
clusterTime
and operationTime
from another session.
ClientSession changes
ClientSession
changes summary
interface ClientSession {
Optional<BsonTimestamp> operationTime;
void advanceOperationTime(BsonTimestamp operationTime);
// other members as defined in other specs
}
Each new member is documented below.
operationTime
This property returns the operation time of the most recent operation performed using this session. If no operations
have been performed using this session the value will be null unless advanceOperationTime
has been called. This value
will also be null when the cluster does not report operation times.
advanceOperationTime
This method advances the operationTime
for a session. If the new operationTime
is greater than the session's current
operationTime
then the session's operationTime
MUST be advanced to the new operationTime
. If the new
operationTime
is less than or equal to the session's current operationTime
then the session's operationTime
MUST
NOT be changed.
Drivers MUST NOT attempt to validate the supplied operationTime
. While the server requires that operationTime
be
less than or equal to clusterTime
we don't want to check that when advanceOperationTime
is called. This allows an
application to call advanceClusterTime
and advanceOperationTime
in any order, or perhaps to not call
advanceClusterTime
at all and let the clusterTime
that is sent to the server be implied by the clusterTime
in
MongoClient
.
MongoDatabase changes
There are no additional API changes to MongoDatabase
beyond those specified in the Sessions Specification. All
MongoDatabase
methods that talk to the server have been overloaded to take a session parameter. If that session was
started with causalConsistency = true
then all operations using that session will be causally consistent.
MongoCollection changes
There are no additional API changes to MongoCollection
beyond those specified in the Sessions Specification. All
MongoCollection
methods that talk to the server have been overloaded to take a session parameter. If that session was
started with causalConsistency = true
then all operations using that session will be causally consistent.
Server Commands
There are no new server commands related to causal consistency. Instead, causal consistency is implemented by:
- Saving the
operationTime
returned by 3.6+ servers for all operations in a property of theClientSession
object. The server reports theoperationTime
whether the operation succeeded or not and drivers MUST save theoperationTime
in theClientSession
whether the operation succeeded or not. - Passing that
operationTime
in theafterClusterTime
field of thereadConcern
field for subsequent causally consistent read operations (for all commands that support areadConcern
) - Gossiping clusterTime (described in the Driver Session Specification)
Server Command Responses
To support causal consistency the server returns the operationTime
in responses it sends to the driver (for both read
and write operations).
{
ok : 1 or 0,
... // the rest of the command reply
operationTime : <BsonTimestamp>
$clusterTime : <BsonDocument> // only in deployments that support cluster times
}
The operationTime
MUST be stored in the ClientSession
to later be passed as the afterClusterTime
field of the
readConcern
field in subsequent read operations. The operationTime
is returned whether the command succeeded or not
and MUST be stored in either case.
Drivers MUST examine all responses from the server for the presence of an operationTime
field and store the value in
the ClientSession
.
When connected to a standalone node command replies do not include an operationTime
field. All operations against a
standalone node are causally consistent automatically because there is only one node.
When connected to a deployment that supports cluster times the command response also includes a field called
$clusterTime
that drivers MUST use to gossip the cluster time. See the Sessions Specification for details.
Causally consistent read commands
For causal consistency the driver MUST send the operationTime
saved in the ClientSession
as the value of the
afterClusterTime
field of the readConcern
field:
{
find : <string>, // or other read command
... // the rest of the command parameters
readConcern :
{
level : ..., // from the operation's read concern (only if specified)
afterClusterTime : <BsonTimestamp>
}
}
For the lists of commands that support causally consistent reads, see ReadConcern spec.
The driver MUST merge the ReadConcern
specified for the operation with the operationTime
from the ClientSession
(which goes in the afterClusterTime
field) to generate the combined readConcern
to send to the server. If the level
property of the read concern for the operation is null then the driver MUST NOT include a level
field alongside the
afterClusterTime
of the readConcern
value sent to the server. Drivers MUST NOT attempt to verify whether the server
supports causally consistent reads or not for a given read concern level. The server will return an error if a given
level does not support causal consistency.
The Read and Write Concern specification states that when a user has not specified a ReadConcern
or has specified the
server's default ReadConcern
, drivers MUST omit the ReadConcern
parameter when sending the command. For causally
consistent reads this requirement is modified to state that when the ReadConcern
parameter would normally be omitted
drivers MUST send a ReadConcern
after all because that is how the afterClusterTime
value is sent to the server.
The Read and Write Concern Specification states that drivers MUST NOT add a readConcern
field to commands that are run
using a generic runCommand
method. The same is true for causal consistency, so commands that are run using
runCommand
MUST NOT have an afterClusterTime
field added to them.
When executing a causally consistent read, the afterClusterTime
field MUST be sent when connected to a deployment that
supports cluster times, and MUST NOT be sent when connected to a deployment that does not support cluster times.
Unacknowledged writes
The implementation of causal consistency relies on the operationTime
returned by the server in the acknowledgement of
a write. Since unacknowledged writes don't receive a response from the server (or don't wait for a response) the
ClientSession
's operationTime
is not updated after an unacknowledged write. That means that a causally consistent
read after an unacknowledged write cannot be causally consistent with the unacknowledged write. Rather than prohibiting
unacknowledged writes in a causally consistent session we have decided to accept this limitation. Drivers MUST document
that causally consistent reads are not causally consistent with unacknowledged writes.
Test Plan
Below is a list of test cases to write.
Note: some tests are only relevant to certain deployments. For the purpose of deciding which tests to run assume that any deployment that is version 3.6 or higher and is either a replica set or a sharded cluster supports cluster times.
- When a
ClientSession
is first created theoperationTime
has no value.session = client.startSession()
- assert
session.operationTime
has no value
- The first read in a causally consistent session must not send
afterClusterTime
to the server (because theoperationTime
has not yet been determined)session = client.startSession(causalConsistency = true)
document = collection.anyReadOperation(session, ...)
- capture the command sent to the server (using APM or other mechanism)
- assert that the command does not have an
afterClusterTime
- The first read or write on a
ClientSession
should update theoperationTime
of theClientSession
, even if there is an error.- skip this test if connected to a deployment that does not support cluster times
session = client.startSession() // with or without causal consistency
collection.anyReadOrWriteOperation(session, ...) // test with errors also if possible
- capture the response sent from the server (using APM or other mechanism)
- assert
session.operationTime
has the same value that is in the response from the server
- A
findOne
followed by any other read operation (test them all) should include theoperationTime
returned by the server for the first operation in theafterClusterTime
parameter of the second operation- skip this test if connected to a deployment that does not support cluster times
session = client.startSession(causalConsistency = true)
collection.findOne(session, {})
operationTime = session.operationTime
collection.anyReadOperation(session, ...)
- capture the command sent to the server (using APM or other mechanism)
- assert that the command has an
afterClusterTime
field with a value ofoperationTime
- Any write operation (test them all) followed by a
findOne
operation should include theoperationTime
of the first operation in theafterClusterTime
parameter of the second operation, including the case where the first operation returned an error.- skip this test if connected to a deployment that does not support cluster times
session = client.startSession(causalConsistency = true)
collection.anyWriteOperation(session, ...) // test with errors also where possible
operationTime = session.operationTime
collection.findOne(session, {})
- capture the command sent to the server (using APM or other mechanism)
- assert that the command has an
afterClusterTime
field with a value ofoperationTime
- A read operation in a
ClientSession
that is not causally consistent should not include theafterClusterTime
parameter in the command sent to the server.- skip this test if connected to a deployment that does not support cluster times
session = client.startSession(causalConsistency = false)
collection.anyReadOperation(session, {})
operationTime = session.operationTime
- capture the command sent to the server (using APM or other mechanism)
- assert that the command does not have an
afterClusterTime
field
- A read operation in a causally consistent session against a deployment that does not support cluster times does not
include the
afterClusterTime
parameter in the command sent to the server.- skip this test if connected to a deployment that does support cluster times
session = client.startSession(causalConsistency = true)
collection.anyReadOperation(session, {})
- capture the command sent to the server (using APM or other mechanism)
- assert that the command does not have an
afterClusterTime
field
- When using the default server
ReadConcern
thereadConcern
parameter in the command sent to the server should not include alevel
field.- skip this test if connected to a deployment that does not support cluster times
session = client.startSession(causalConsistency = true)
- configure
collection
to use default serverReadConcern
collection.findOne(session, {})
operationTime = session.operationTime
collection.anyReadOperation(session, ...)
- capture the command sent to the server (using APM or other mechanism)
- assert that the command does not have a
`level
field - assert that the command has a
afterClusterTime
field with a value ofoperationTime
- When using a custom
ReadConcern
thereadConcern
field in the command sent to the server should be a merger of theReadConcern
value and theafterClusterTime
field.- skip this test if connected to a deployment that does not support cluster times
session = client.startSession(causalConsistency = true)
- configure collection to use a custom ReadConcern
collection.findOne(session, {})
operationTime = session.operationTime
collection.anyReadOperation(session, ...)
- capture the command sent to the server (using APM or other mechanism)
- assert that the command has a
level
field with a value matching the custom readConcern - assert that the command has an
afterClusterTime
field with a value ofoperationTime
- Removed
- When connected to a deployment that does not support cluster times messages sent to the server should not include
$clusterTime
.- skip this test when connected to a deployment that does support cluster times
document = collection.findOne({})
- capture the command sent to the server
- assert that the command does not include a
$clusterTime
field
- When connected to a deployment that does support cluster times messages sent to the server should include
$clusterTime
.- skip this test when connected to a deployment that does not support cluster times
document = collection.findOne({})
- capture the command sent to the server
- assert that the command includes a
$clusterTime
field
Motivation
To support causal consistency. Only supported with server version 3.6 or newer.
Design Rationale
The goal is to modify the driver API as little as possible so that existing programs that don't need causal consistency
don't have to be changed. This goal is met by defining a SessionOptions
field that applications use to start a
ClientSession
that can be used for causal consistency. Any operations performed with such a session are then causally
consistent.
The operationTime
is tracked on a per ClientSession
basis. This allows each ClientSession
to have an
operationTime
that is sufficiently new to guarantee causal consistency for that session, but no newer. Using an
operationTime
that is newer than necessary can cause reads to block longer than necessary when sent to a lagging
secondary. The goal is to block for just long enough to guarantee causal consistency and no longer.
Backwards Compatibility
The API changes to support sessions extend the existing API but do not introduce any backward breaking changes. Existing programs that don't use causal consistency continue to compile and run correctly.
Reference Implementation
A reference implementation must be completed before any spec is given status "Final", but it need not be completed before the spec is "Accepted". While there is merit to the approach of reaching consensus on the specification and rationale before writing code, the principle of "rough consensus and running code" is still useful when it comes to resolving many discussions of spec details. A final reference implementation must include test code and documentation.
Q&A
Changelog
-
2024-02-08: Migrated from reStructuredText to Markdown.
-
2022-11-11: Require
causalConsistency=false
for implicit sessions. -
2022-10-05: Remove spec front matter and reformat changelog.
-
2022-01-28: Fix formatting for prose tests
-
2022-01-22: Remove outdated prose test #10
-
2021-06-26: Default value for causalConsistency is influenced by snapshot reads
-
2017-11-17: Added link to ReadConcern spec which lists commands that support readConcern
-
2017-10-06: advanceOperationTime MUST NOT validate operationTime
-
2017-10-05: How to handle default read concern
-
2017-10-04: Added advanceOperationTime
-
2017-09-28: Remove remaining references to collections being associated with
sessions. Update spec to reflect that replica sets use $clusterTime also now. -
2017-09-13: Renamed "causally consistent reads" to "causal consistency". If no
value is supplied forcausallyConsistent
assume true.