MongoDB Driver Specifications
The modern MongoDB driver consists of a number of components, each of which are thoroughly documented in this repository. Though this information is readily available and extremely helpful, what it lacks is a high level overview to tie the specs together into a cohesive picture of what a MongoDB driver is.
Architecturally an implicit hierarchy exists within the drivers, so expressing drivers in terms of an onion model feels appropriate.
Layers of the Onion
The "drivers onion" is meant to represent how various concepts, components and APIs can be layered atop each other to build a MongoDB driver from the ground up, or to help understand how existing drivers have been structured. Hopefully this representation of MongoDB’s drivers helps provide some clarity, as the complexity of these libraries - like the onion above - could otherwise bring you to tears.
Serialization
At their lowest level all MongoDB drivers will need to know how to work with BSON. BSON (short for "Binary JSON") is a binary-encoded serialization of JSON-like documents, and like JSON, it supports the nesting of arrays and documents. BSON also contains extensions that allow representation of data types that are not part of the JSON spec.
Specifications: BSON, ObjectId, Decimal128, UUID, DBRef, Extended JSON
Communication
Once BSON documents can be created and manipulated, the foundation for interacting with a MongoDB host process has been laid. Drivers communicate by sending database commands as serialized BSON documents using MongoDB’s wire protocol.
From the provided connection string and options a socket connection is established to a host, which an initial handshake verifies is in fact a valid MongoDB connection by sending a simple hello
. Based on the response to this first command a driver can continue to establish and authenticate connections.
Specifications:
OP_MSG
, Command Execution, Connection String, URI Options, OCSP, Initial Handshake, Wire Compression, SOCKS5, Initial DNS Seedlist Discovery
Connectivity
Now that a valid host has been found, the cluster’s topology can be discovered and monitoring connections can be established. Connection pools can then be created and populated with connections. The monitoring connections will subsequently be used for ensuring operations are routed to available hosts, or hosts that meet certain criteria (such as a configured read preference or acceptable latency window).
Specifications: SDAM, CMAP, Load Balancer Support
Authentication
Establishing and monitoring connections to MongoDB ensures they’re available, but MongoDB server processes typically will require the connection to be authenticated before commands will be accepted. MongoDB offers many authentication mechanisms such as SCRAM, x.509, Kerberos, LDAP, OpenID Connect and AWS IAM, which MongoDB drivers support using the Simple Authentication and Security Layer (SASL) framework.
Specifications: Authentication
Availability
All client operations will be serialized as BSON and sent to MongoDB over a connection that will first be checked out of a connection pool. Various monitoring processes exist to ensure a driver’s internal state machine contains an accurate view of the cluster’s topology so that read and write requests can always be appropriately routed according to MongoDB’s server selection algorithm.
Specifications: Server Monitoring,
SRV
Polling for mongos Discovery, Server Selection, Max Staleness
Resilience
At their core, database drivers are client libraries meant to facilitate interactions between an application and the database. MongoDB’s drivers are no different in that regard, as they abstract away the underlying serialization, communication, connectivity, and availability functions required to programmatically interact with your data.
To further enhance the developer experience while working with MongoDB, various resilience features can be added based on logical sessions such as retryable writes, causal consistency, and transactions.
Specifications: Retryability (Reads, Writes), CSOT, Consistency (Sessions, Causal Consistency, Snapshot Reads, Transactions, Convenient Transactions API)
Programmability
Now that we can serialize commands and send them over the wire through an authenticated connection we can begin actually manipulating data. Since all database interactions are in the form of commands, if we wanted to remove a single document we might issue a delete
command such as the following:
db.runCommand(
{
delete: "orders",
deletes: [ { q: { status: "D" }, limit: 0 } ]
}
)
Though not exceedingly complex, a better developer experience can be achieved through more single-purpose APIs. This would allow the above example to be expressed as:
db.orders.deleteMany({ status: "D" })
To provide a cleaner and clearer developer experience, many specifications exist to describe how these APIs should be consistently presented across driver implementations, while still providing the flexibility to make APIs more idiomatic for each language.
Advanced security features such as client-side field level encryption are also defined at this layer.
Specifications: Resource Management (Databases, Collections, Indexes), Data Management (CRUD, Collation, Write Commands, Bulk API, Bulk Write, R/W Concern), Cursors (Change Streams,
find
/getMore
/killCursors
), GridFS, Stable API, Security (Client Side Encryption, BSON Binary Subtype 6)
Observability
With database commands being serialized and sent to MongoDB servers and responses being received and deserialized, our driver can be considered fully functional for most read and write operations. As MongoDB drivers abstract away most of the complexity involved with creating and maintaining the connections these commands will be sent over, providing mechanisms for introspection into a driver’s functionality can provide developers with added confidence that things are working as expected.
The inner workings of connection pools, connection lifecycle, server monitoring, topology changes, command execution and other driver components are exposed by means of events developers can register listeners to capture. This can be an invaluable troubleshooting tool and can help facilitate monitoring the health of an application.
const { MongoClient, BSON: { EJSON } } = require('mongodb');
function debugPrint(label, event) {
console.log(`${label}: ${EJSON.stringify(event)}`);
}
async function main() {
const client = new MongoClient("mongodb://localhost:27017", { monitorCommands: true });
client.on('commandStarted', (event) => debugPrint('commandStarted', event));
client.on('connectionCheckedOut', (event) => debugPrint('connectionCheckedOut', event));
await client.connect();
const coll = client.db("test").collection("foo");
const result = await coll.findOne();
client.close();
}
main();
Given the example above (using the Node.js driver) the specified connection events and command events would be logged as they’re emitted by the driver:
connectionCheckedOut: {"time":{"$date":"2024-05-17T15:18:18.589Z"},"address":"localhost:27018","name":"connectionCheckedOut","connectionId":1}
commandStarted: {"name":"commandStarted","address":"127.0.0.1:27018","connectionId":1,"serviceId":null,"requestId":5,"databaseName":"test","commandName":"find","command":{"find":"foo","filter":{},"limit":1,"singleBatch":true,"batchSize":1,"lsid":{"id":{"$binary":{"base64":"4B1kOPCGRUe/641MKhGT4Q==","subType":"04"}}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1715959097,"i":1}},"signature":{"hash":{"$binary":"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"00"}},"keyId":0}},"$db":"test"},"serverConnectionId":140}
The preferred method of observing internal behavior would be through standardized logging once it is available in all drivers (DRIVERS-1204), however until that time only event logging is consistently available. In the future additional observability tooling such as Open Telemetry support may also be introduced.
Specifications: Command Logging and Monitoring, SDAM Logging and Monitoring, Standardized Logging, Connection Pool Logging
Testability
Ensuring existing as well as net-new drivers can be effectively tested for correctness and performance, most specifications define a standard set of tests using YAML tests to improve driver conformance. This allows specification authors and maintainers to describe functionality once with the confidence that the tests can be executed alike by language-specific test runners across all drivers.
Though the unified test format greatly simplifies language-specific implementations, not all tests can be represented in this fashion. In those cases the specifications may describe tests to be manually implemented as prose. By limiting the number of prose tests that each driver must implement, engineers can deliver functionality with greater confidence while also minimizing the burden of upstream verification.
Specifications: Unified Test Format, Atlas Data Federation Testing, Performance Benchmarking, BSON Corpus, Replication Event Resilience, FAAS Automated Testing, Atlas Serverless Testing
Conclusion
Most (if not all) the information required to build a new driver or maintain existing drivers technically exists within the specifications, however without a mental mode of their composition and architecture it can be extremely challenging to know where to look.
Peeling the "drivers onion" should hopefully make reasoning about them a little easier, especially with the understanding that everything can be tested to validate individual implementations are "up to spec".
Driver Mantras
When developing specifications -- and the drivers themselves -- we follow the following principles:
Strive to be idiomatic, but favor consistency
Drivers attempt to provide the easiest way to work with MongoDB in a given language ecosystem, while specifications attempt to provide a consistent behavior and experience across all languages. Drivers should strive to be as idiomatic as possible while meeting the specification and staying true to the original intent.
No Knobs
Too many choices stress out users. Whenever possible, we aim to minimize the number of configuration options exposed to users. In particular, if a typical user would have no idea how to choose a correct value, we pick a good default instead of adding a knob.
Topology agnostic
Users test and deploy against different topologies or might scale up from replica sets to sharded clusters. Applications should never need to use the driver differently based on topology type.
Where possible, depend on server to return errors
The features available to users depend on a server's version, topology, storage engine and configuration. So that drivers don't need to code and test all possible variations, and to maximize forward compatibility, always let users attempt operations and let the server error when it can't comply. Exceptions should be rare: for cases where the server might not error and correctness is at stake.
Minimize administrative helpers
Administrative helpers are methods for admin tasks, like user creation. These are rarely used and have maintenance costs as the server changes the administrative API. Don't create administrative helpers; let users rely on "RunCommand" for administrative commands.
Check wire version, not server version
When determining server capabilities within the driver, rely only on the maxWireVersion in the hello response, not on the X.Y.Z server version. An exception is testing server development releases, as the server bumps wire version early and then continues to add features until the GA.
When in doubt, use "MUST" not "SHOULD" in specs
Specs guide our work. While there are occasionally valid technical reasons for drivers to differ in their behavior, avoid encouraging it with a wishy-washy "SHOULD" instead of a more assertive "MUST".
Defy augury
While we have some idea of what the server will do in the future, don't design features with those expectations in mind. Design and implement based on what is expected in the next release.
Case Study: In designing OP_MSG, we held off on designing support for Document Sequences in Replies in drivers until the server would support it. We subsequently decided not to implement that feature in the server.
The best way to see what the server does is to test it
For any unusual case, relying on documentation or anecdote to anticipate the server's behavior in different versions/topologies/etc. is error-prone. The best way to check the server's behavior is to use a driver or the shell and test it directly.
Drivers follow semantic versioning
Drivers should follow X.Y.Z versioning, where breaking API changes require a bump to X. See semver.org for more.
Backward breaking behavior changes and semver
Backward breaking behavior changes can be more dangerous and disruptive than backward breaking API changes. When thinking about the implications of a behavior change, ask yourself what could happen if a user upgraded your library without carefully reading the changelog and/or adequately testing the change.
Server Wire version and Feature List
Server version | Wire version | Feature List |
---|---|---|
2.6 | 1 | Aggregation cursor Auth commands |
2.6 | 2 | Write commands (insert/update/delete) Aggregation $out pipeline operator |
3.0 | 3 | listCollections listIndexes SCRAM-SHA-1 explain command |
3.2 | 4 | (find/getMore/killCursors) commands currentOp command fsyncUnlock command findAndModify take write concern Commands take read concern Document-level validation explain command supports distinct and findAndModify |
3.4 | 5 | Commands take write concern Commands take collation |
3.6 | 6 | Supports OP_MSG Collection-level ChangeStream support Retryable Writes Causally Consistent Reads Logical Sessions update "arrayFilters" option |
4.0 | 7 | ReplicaSet transactions Database and cluster-level change streams and startAtOperationTime option |
4.2 | 8 | Sharded transactions Aggregation $merge pipeline operator update "hint" option |
4.4 | 9 | Streaming protocol for SDAM ResumableChangeStreamError error label delete "hint" option findAndModify "hint" option createIndexes "commitQuorum" option |
5.0 | 13 | $out and $merge on secondaries (technically FCV 4.4+) |
5.1 | 14 | |
5.2 | 15 | |
5.3 | 16 | |
6.0 | 17 | Support for Partial Indexes Sharded Time Series Collections FCV set to 5.0 |
6.1 | 18 | Update Perl Compatible Regular Expressions version to PCRE2 Add |
6.2 | 19 | Collection validation ensures BSON documents conform to BSON spec Collection validation checks time series collections for internal consistency |
7.0 | 21 | Atlas Search Index Management
Compound Wildcard Indexes Support large change stream events via
Slot Based Query Execution |
7.1 | 22 | Improved Index Builds Exhaust Cursors Enabled for Sharded Clusters New Sharding Statistics for Chunk Migrations Self-Managed Backups of Sharded Clusters |
7.2 | 23 | Database Validation on
Default Chunks Per Shard |
7.3 | 24 | Compaction Improvements New |
8.0 | 25 | Range Encryption GA OIDC authentication mechanism New
|
In server versions 5.0 and earlier, the wire version was defined as a numeric literal in src/mongo/db/wire_version.h. Since server version 5.1 (SERVER-58346), the wire version is derived from the number of releases since 4.0 (using src/mongo/util/version/releases.h.tpl and src/mongo/util/version/releases.yml).
BSON
Latest version of the specification can be found at https://bsonspec.org/spec.html.
Specification Version 1.1
BSON is a binary format in which zero or more ordered key/value pairs are stored as a single entity. We call this entity a document.
The following grammar specifies version 1.1 of the BSON standard. We've written the grammar using a pseudo-BNF syntax. Valid BSON data is represented by the document non-terminal.
Basic Types
The following basic types are used as terminals in the rest of the grammar. Each type must be serialized in little-endian format.
byte 1 byte (8-bits)
signed_byte(n) 8-bit, two's complement signed integer for which the value is n
unsigned_byte(n) 8-bit unsigned integer for which the value is n
int32 4 bytes (32-bit signed integer, two's complement)
int64 8 bytes (64-bit signed integer, two's complement)
uint64 8 bytes (64-bit unsigned integer)
double 8 bytes (64-bit IEEE 754-2008 binary floating point)
decimal128 16 bytes (128-bit IEEE 754-2008 decimal floating point)
Non-terminals
The following specifies the rest of the BSON grammar. Note that we use the * operator as shorthand for repetition (e.g. (byte*2) is byte byte). When used as a unary operator, * means that the repetition can occur 0 or more times.
document ::= int32 e_list unsigned_byte(0) BSON Document. int32 is the total number of bytes comprising the document.
e_list ::= element e_list
| ""
element ::= signed_byte(1) e_name double 64-bit binary floating point
| signed_byte(2) e_name string UTF-8 string
| signed_byte(3) e_name document Embedded document
| signed_byte(4) e_name document Array
| signed_byte(5) e_name binary Binary data
| signed_byte(6) e_name Undefined (value) — Deprecated
| signed_byte(7) e_name (byte*12) ObjectId
| signed_byte(8) e_name unsigned_byte(0) Boolean - false
| signed_byte(8) e_name unsigned_byte(1) Boolean - true
| signed_byte(9) e_name int64 UTC datetime
| signed_byte(10) e_name Null value
| signed_byte(11) e_name cstring cstring Regular expression - The first cstring is the regex pattern, the second is the regex options string. Options are identified by characters, which must be stored in alphabetical order. Valid option characters are i for case insensitive matching, m for multiline matching, s for dotall mode ("." matches everything), x for verbose mode, and u to make "\w", "\W", etc. match Unicode.
| signed_byte(12) e_name string (byte*12) DBPointer — Deprecated
| signed_byte(13) e_name string JavaScript code
| signed_byte(14) e_name string Symbol — Deprecated
| signed_byte(15) e_name code_w_s JavaScript code with scope — Deprecated
| signed_byte(16) e_name int32 32-bit integer
| signed_byte(17) e_name uint64 Timestamp
| signed_byte(18) e_name int64 64-bit integer
| signed_byte(19) e_name decimal128 128-bit decimal floating point
| signed_byte(-1) e_name Min key
| signed_byte(127) e_name Max key
e_name ::= cstring Key name
string ::= int32 (byte*) unsigned_byte(0) String - The int32 is the number of bytes in the (byte*) plus one for the trailing null byte. The (byte*) is zero or more UTF-8 encoded characters.
cstring ::= (byte*) unsigned_byte(0) Zero or more modified UTF-8 encoded characters followed by the null byte. The (byte*) MUST NOT contain unsigned_byte(0), hence it is not full UTF-8.
binary ::= int32 subtype (byte*) Binary - The int32 is the number of bytes in the (byte*).
subtype ::= unsigned_byte(0) Generic binary subtype
| unsigned_byte(1) Function
| unsigned_byte(2) Binary (Old)
| unsigned_byte(3) UUID (Old)
| unsigned_byte(4) UUID
| unsigned_byte(5) MD5
| unsigned_byte(6) Encrypted BSON value
| unsigned_byte(7) Compressed BSON column
| unsigned_byte(8) Sensitive
| unsigned_byte(128)—unsigned_byte(255) User defined
code_w_s ::= int32 string document Code with scope — Deprecated
Notes
- Array - The document for an array is a normal BSON document with integer values for the keys, starting with 0 and continuing sequentially. For example, the array ['red', 'blue'] would be encoded as the document {'0': 'red', '1': 'blue'}. The keys must be in ascending numerical order.
- UTC datetime - The int64 is UTC milliseconds since the Unix epoch.
- Timestamp - Special internal type used by MongoDB replication and sharding. First 4 bytes are an increment, second 4 are a timestamp.
- Min key - Special type which compares lower than all other possible BSON element values.
- Max key - Special type which compares higher than all other possible BSON element values.
- Generic binary subtype - This is the most commonly used binary subtype and should be the 'default' for drivers and tools.
- Compressed BSON Column - Compact storage of BSON data. This data type uses delta and delta-of-delta compression and run-length-encoding for efficient element storage. Also has an encoding for sparse arrays containing missing values.
- The BSON "binary" or "BinData" datatype is used to represent arrays of bytes. It is somewhat analogous to the Java notion of a ByteArray. BSON binary values have a subtype. This is used to indicate what kind of data is in the byte array. Subtypes from 0 to 127 are predefined or reserved. Subtypes from 128 to 255 are user-defined.
- unsigned_byte(2) Binary (Old) - This used to be the default subtype, but was deprecated in favor of subtype 0. Drivers and tools should be sure to handle subtype 2 appropriately. The structure of the binary data (the byte* array in the binary non-terminal) must be an int32 followed by a (byte*). The int32 is the number of bytes in the repetition.
- unsigned_byte(3) UUID (Old) - This used to be the UUID subtype, but was deprecated in favor of subtype 4. Drivers and tools for languages with a native UUID type should handle subtype 3 appropriately.
- unsigned_byte(128)—unsigned_byte(255) User defined subtypes. The binary data can be anything.
- Code with scope - Deprecated. The int32 is the length in bytes of the entire code_w_s value. The string is JavaScript code. The document is a mapping from identifiers to values, representing the scope in which the string should be evaluated.
BSON Binary Subtype 9 - Vector
- Status: Accepted
- Minimum Server Version: N/A
Abstract
This document describes the subtype of the Binary BSON type used for efficient storage and retrieval of vectors. Vectors here refer to densely packed arrays of numbers, all of the same type.
Motivation
These representations correspond to the numeric types supported by popular numerical libraries for vector processing, such as NumPy, PyTorch, TensorFlow and Apache Arrow. Storing and retrieving vector data using the same densely packed format used by these libraries can result in significant memory savings and processing efficiency.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Specification
This specification introduces a new BSON binary subtype, the vector, with value 9
.
Drivers SHOULD provide idiomatic APIs to translate between arrays of numbers and this BSON Binary specification.
Data Types (dtypes)
Each vector can take one of multiple data types (dtypes). The following table lists the dtypes implemented.
Vector data type | Alias | Bits per vector element | Arrow Data Type (for illustration) |
---|---|---|---|
0x03 | INT8 | 8 | INT8 |
0x27 | FLOAT32 | 32 | FLOAT |
0x10 | PACKED_BIT | 1 * | BOOL |
*
A Binary Quantized (PACKED_BIT) Vector is a vector of 0s and 1s (bits), but it is represented in memory as a list of
integers in [0, 255]. So, for example, the vector [0, 255]
would be shorthand for the 16-bit vector
[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]
. The idea is that each number (a uint8) can be stored as a single byte. Of course,
some languages, Python for one, do not have an uint8 type, so must be represented as an int in memory, but not on disk.
Byte padding
As not all data types have a bit length equal to a multiple of 8, and hence do not fit squarely into a certain number of bytes, a second piece of metadata, the "padding" is included. This instructs the driver of the number of bits in the final byte that are to be ignored. The least-significant bits are ignored.
Binary structure
Following the binary subtype 9
, a two-element byte array of metadata precedes the packed numbers.
-
The first byte (dtype) describes its data type. The table above shows those that MUST be implemented. This table may increase. dtype is an unsigned integer.
-
The second byte (padding) prescribes the number of bits to ignore in the final byte of the value. It is a non-negative integer. It must be present, even in cases where it is not applicable, and set to zero.
-
The remainder contains the actual vector elements packed according to dtype.
All values use the little-endian format.
Example
Let's take a vector [238, 224]
of dtype PACKED_BIT (\x10
) with a padding of 4
.
In hex, it looks like this: b"\x10\x04\xee\xe0"
: 1 byte for dtype, 1 for padding, and 1 for each uint8.
We can visualize the binary representation like so:
1st byte: dtype (from list in previous table) | 2nd byte: padding (values in [0,7]) | 1st uint8: 238 | 2nd uint8: 224 | ||||||||||||||||||||||||||||
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
Finally, after we remove the last 4 bits of padding, the actual bit vector has a length of 12 and looks like this!
1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 |
---|
API Guidance
Drivers MUST implement methods for explicit encoding and decoding that adhere to the pattern described below while following idioms of the language of the driver.
Encoding
Function from_vector(vector: Iterable<Number>, dtype: DtypeEnum, padding: Integer = 0) -> Binary
# Converts a numeric vector into a binary representation based on the specified dtype and padding.
# :param vector: A sequence or iterable of numbers (either float or int)
# :param dtype: Data type for binary conversion (from DtypeEnum)
# :param padding: Optional integer specifying how many bits to ignore in the final byte
# :return: A binary representation of the vector
Declare binary_data as Binary
# Process each number in vector and convert according to dtype
For each number in vector
binary_element = convert_to_binary(number, dtype)
binary_data.append(binary_element)
End For
# Apply padding to the binary data if needed
If padding > 0
apply_padding(binary_data, padding)
End If
Return binary_data
End Function
Note: If a driver chooses to implement a Vector
type (or numerous) like that suggested in the Data Structure
subsection below, they MAY decide that from_vector
that has a single argument, a Vector.
Decoding
Function as_vector() -> Vector
# Unpacks binary data (BSON or similar) into a Vector structure.
# This process involves extracting numeric values, the data type, and padding information.
# :return: A BinaryVector containing the unpacked numeric values, dtype, and padding.
Declare binary_vector as BinaryVector # Struct to hold the unpacked data
# Extract dtype (data type) from the binary data
binary_vector.dtype = extract_dtype_from_binary()
# Extract padding from the binary data
binary_vector.padding = extract_padding_from_binary()
# Unpack the actual numeric values from the binary data according to the dtype
binary_vector.data = unpack_numeric_values(binary_vector.dtype)
Return binary_vector
End Function
Validation
Drivers MUST validate vector metadata and raise an error if any invariant is violated:
- Padding MUST be 0 for all dtypes where padding doesn’t apply, and MUST be within [0, 7] for PACKED_BIT.
- A PACKED_BIT vector MUST NOT be empty if padding is in the range [1, 7].
- When unpacking binary data into a FLOAT32 Vector structure, the length of the binary data following the dtype and padding MUST be a multiple of 4 bytes.
Drivers MUST perform this validation when a numeric vector and padding are provided through the API, and when unpacking binary data (BSON or similar) into a Vector structure.
Data Structures
Drivers MAY find the following structures to represent the dtype and vector structure useful.
Enum Dtype
# Enum for data types (dtype)
# FLOAT32: Represents packing of list of floats as float32
# Value: 0x27 (hexadecimal byte value)
# INT8: Represents packing of list of signed integers in the range [-128, 127] as signed int8
# Value: 0x03 (hexadecimal byte value)
# PACKED_BIT: Special case where vector values are 0 or 1, packed as unsigned uint8 in range [0, 255]
# Packed into groups of 8 (a byte)
# Value: 0x10 (hexadecimal byte value)
# Documentation:
# Each value is a byte (length of one), a convenient choice for decoding.
End Enum
Struct Vector
# Numeric vector with metadata for binary interoperability
# Fields:
# data: Sequence of numeric values (either float or int)
# dtype: Data type of vector (from enum BinaryVectorDtype)
# padding: Number of bits to ignore in the final byte for alignment
data # Sequence of float or int
dtype # Type: DtypeEnum
padding # Integer: Number of padding bits
End Struct
Reference Implementation
- PYTHON (PYTHON-4577)
Test Plan
See the README for tests.
FAQ
- What MongoDB Server version does this apply to?
- Files in the "specifications" repository have no version scheme. They are not tied to a MongoDB server version.
- In PACKED_BIT, why would one choose to use integers in [0, 256)?
- This follows a well-established precedent for packing binary-valued arrays into bytes (8 bits), This technique is widely used across different fields, such as data compression, communication protocols, and file formats, where you want to store or transmit binary data more efficiently by grouping 8 bits into a single byte (uint8). For an example in Python, see numpy.unpackbits.
Changelog
-
2025-02-04: Update validation for decoding into a FLOAT32 vector.
-
2024-11-01: BSON Binary Subtype 9 accepted DRIVERS-2926 (#1708)
BSON ObjectID
- Status: Accepted
- Minimum Server Version: N/A
Abstract
This specification documents the format and data contents of ObjectID BSON values that the drivers and the server
generate when no field values have been specified (e.g. creating an ObjectID BSON value when no _id
field is present
in a document). It is primarily aimed to provide an alternative to the historical use of the MD5 hashing algorithm for
the machine information field of the ObjectID, which is problematic when providing a FIPS compliant implementation. It
also documents existing best practices for the timestamp and counter fields.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Specification
The ObjectID BSON type is a 12-byte value consisting of three different portions (fields):
- a 4-byte value representing the seconds since the Unix epoch in the highest order bytes,
- a 5-byte random number unique to a machine and process,
- a 3-byte counter, starting with a random value.
4 byte timestamp 5 byte process unique 3 byte counter
|<----------------->|<---------------------->|<------------>|
[----|----|----|----|----|----|----|----|----|----|----|----]
0 4 8 12
Timestamp Field
This 4-byte big endian field represents the seconds since the Unix epoch (Jan 1st, 1970, midnight UTC). It is an ever increasing value that will have a range until about Jan 7th, 2106.
Drivers MUST create ObjectIDs with this value representing the number of seconds since the Unix epoch.
Drivers MUST interpret this value as an unsigned 32-bit integer when conversions to language specific date/time values are created, and when converting this to a timestamp.
Drivers SHOULD have an accessor method on an ObjectID class for obtaining the timestamp value.
Random Value
A 5-byte field consisting of a random value generated once per process. This random value is unique to the machine and process.
Drivers MUST NOT have an accessor method on an ObjectID class for obtaining this value.
The random number does not have to be cryptographic. If possible, use a PRNG with OS supplied entropy that SHOULD NOT block to wait for more entropy to become available. Otherwise, seed a deterministic PRNG to ensure uniqueness of process and machine by combining time, process ID, and hostname.
Counter
A 3-byte big endian counter.
This counter MUST be initialised to a random value when the driver is first activated. After initialisation, the counter MUST be increased by 1 for every ObjectID creation.
When the counter overflows (i.e., hits 16777215+1), the counter MUST be reset to 0.
Drivers MUST NOT have an accessor method on an ObjectID class for obtaining this value.
The random number does not have to be cryptographic. If possible, use a PRNG with OS supplied entropy that SHOULD NOT block to wait for more entropy to become available. Otherwise, seed a deterministic PRNG to ensure uniqueness of process and machine by combining time, process ID, and hostname.
Test Plan
Drivers MUST:
- Ensure that the Timestamp field is represented as an unsigned 32-bit representing the number of seconds since the
Epoch for the Timestamp values:
0x00000000
: To match"Jan 1st, 1970 00:00:00 UTC"
0x7FFFFFFF
: To match"Jan 19th, 2038 03:14:07 UTC"
0x80000000
: To match"Jan 19th, 2038 03:14:08 UTC"
0xFFFFFFFF
: To match"Feb 7th, 2106 06:28:15 UTC"
- Ensure that the Counter field successfully overflows its sequence from
0xFFFFFF
to0x000000
. - Ensure that after a new process is created through a fork() or similar process creation operation, the "random number unique to a machine and process" is no longer the same as the parent process that created the new process.
Motivation for Change
Besides the specific exclusion of MD5 as an allowed hashing algorithm, the information in this specification is meant to align the ObjectID generation algorithm of both drivers and the server.
Design Rationale
Timestamp: The timestamp is a 32-bit unsigned integer, as it allows us to extend the furthest date that the timestamp can represent from the year 2038 to 2106. There is no reason why MongoDB would generate a timestamp to mean a date before 1970, as MongoDB did not exist back then.
Random Value: Originally, this field consisted of the Machine ID and Process ID fields. There were numerous divergences between drivers due to implementation choices, and the Machine ID field traditionally used the MD5 hashing algorithm which can't be used on FIPS compliant machines. In order to allow for a similar behaviour among all drivers and the MongoDB Server, these two fields have been collated together into a single 5-byte random value, unique to a machine and process.
Counter: The counter makes it possible to have multiple ObjectIDs per second, per server, and per process. As the counter can overflow, there is a possibility of having duplicate ObjectIDs if you create more than 16 million ObjectIDs per second in the same process on a single machine.
Endianness: The Timestamp and Counter are big endian because we can then use memcmp
to order ObjectIDs, and we
want to ensure an increasing order.
Backwards Compatibility
This specification requires that the existing Machine ID and Process ID fields are merged into a single 5-byte value. This will change the behaviour of ObjectID generation, as well as the behaviour of drivers that currently have getters and setters for the original Machine ID and Process ID fields.
Reference Implementation
Currently there is no full reference implementation yet.
Changelog
-
2024-07-30: Migrated from reStructuredText to Markdown.
-
2022-10-05: Remove spec front matter and reformat changelog.
-
2019-01-14: Clarify that the random numbers don't need to be cryptographically secure. Add a test to test that the unique value is different in forked processes.
-
2018-10-11: Clarify that the Timestamp and Counter fields are big endian, and add the reason why.
-
2018-07-02: Replaced Machine ID and Process ID fields with a single 5-byte unique value
-
2018-05-22: Initial Release
BSON Decimal128
- Status: Accepted
- Minimum Server Version: 3.4
Abstract
MongoDB 3.4 introduces a new BSON type representing high precision decimal ("\x13"
), known as Decimal128. 3.4
compatible drivers must support this type by creating a Value Object for it, possibly with accessor functions for
retrieving its value in data types supported by the respective languages.
Round-tripping Decimal128 types between driver and server MUST not change its value or representation in any way. Conversion to and from native language types is complicated and there are many pitfalls to represent Decimal128 precisely in all languages
While many languages offer a native decimal type, the precision of these types often does not exactly match that of the
MongoDB implementation. To ensure error-free conversion and consistency between official MongoDB drivers, this
specification does not allow automatically converting the BSON Decimal128
type into a language-defined decimal type.
Language drivers will wrap their native type in value objects by default and SHOULD offer accessor functions for
retrieving its value represented by language-defined types if appropriate. A driver that offers the ability to configure
mappings to/from BSON types to native types MAY allow the option to automatically convert the BSON Decimal128
type to
a native type. It should however be made abundantly clear to the user that converting to native data types risks
incurring data loss.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Terminology
IEEE 754-2008 128-bit decimal floating point (Decimal128)
The Decimal128 specification supports 34 decimal digits of precision, a max value of approximately 10^6145
, and min
value of approximately -10^6145
. This is the new BSON Decimal128
type ("\x13"
).
Clamping
Clamping happens when a value's exponent is too large for the destination format. This works by adding zeros to the coefficient to reduce the exponent to the largest usable value. An overflow occurs if the number of digits required is more than allowed in the destination format.
Binary Integer Decimal (BID)
MongoDB uses this binary encoding for the coefficient as specified in IEEE 754-2008
section 3.5.2 using method 2
"binary encoding" rather than method 1 "decimal encoding". The byte order is little-endian, like the rest of the BSON
types.
Value Object
An immutable container type representing a value (e.g. Decimal128). This Value Object MAY provide accessors that
retrieve the abstracted value as a different type (e.g. casting it). double x = valueObject.getAsDouble();
Specification
BSON Decimal128 implementation details
The BSON Decimal128
data type implements the
Decimal Arithmetic Encodings specification, with certain exceptions
around value integrity and the coefficient encoding. When a value cannot be represented exactly, the value will be
rejected.
The coefficient MUST be stored as an unsigned binary integer (BID) rather than the densely-packed decimal (DPD) shown in
the specification. See either the IEEE Std 754-2008
spec or the driver examples for further detail.
The specification defines several statuses which are meant to signal exceptional circumstances, such as when overflowing occurs, and how to handle them.
BSON Decimal128
Value Objects MUST implement these actions for these exceptions:
-
Overflow
- When overflow occurs, the operation MUST emit an error and result in a failure
-
Underflow
- When underflow occurs, the operation MUST emit an error and result in a failure
-
Clamping
- Since clamping does not change the actual value, only the representation of it, clamping MUST occur without emitting an error.
-
Rounding
- When the coefficient requires more digits then Decimal128 provides, rounding MUST be done without emitting an error, unless it would result in inexact rounding, in which case the operation MUST emit an error and result in a failure.
-
Conversion Syntax
- Invalid strings MUST emit an error and result in a failure.
It should be noted that the given exponent is a preferred representation. If the value cannot be stored due to the value
of the exponent being too large or too small, but can be stored using an alternative representation by clamping and or
rounding, a BSON Decimal128
compatible Value Object MUST do so, unless such operation results in an inexact rounding
or other underflow or overflow.
Reading from BSON
A BSON type "\x13"
MUST be represented by an immutable Value Object by default and MUST NOT be automatically converted
into language native numeric type by default. A driver that offers users a way to configure the exact type mapping to
and from BSON types MAY allow the BSON Decimal128
type to be converted to the user configured type.
A driver SHOULD provide accessors for this immutable Value Object, which can return a language-specific representation
of the Decimal128 value, after converting it into the respective type. For example, Java may choose to provide
Decimal128.getBigDecimal()
.
All drivers MUST provide an accessor for retrieving the value as a string. Drivers MAY provide other accessors, retrieving the value as other types.
Serializing and writing BSON
Drivers MUST provide a way of constructing the Value Object, as the driver representation of the BSON Decimal128
is an
immutable Value Object by default.
A driver MUST have a way to construct this Value Object from a string. For example, Java MUST provide a method similar
to Decimal128.valueOf("2.000")
.
A driver that has accessors for different types SHOULD provide a way to construct the Value Object from those types.
Reading from Extended JSON
The Extended JSON representation of Decimal128 is a document with the key $numberDecimal
and a value of the Decimal128
as a string. Drivers that support Extended JSON formatting MUST support the $numberDecimal
type specifier.
When an Extended JSON $numberDecimal
is parsed, its type should be the same as that of a deserialized
BSON Decimal128
, as described in Reading from BSON.
The Extended JSON $numberDecimal
value follows the same stringification rules as defined in
From String Representation.
Writing to Extended JSON
The Extended JSON type identifier is $numberDecimal
, while the value itself is a string. Drivers that support
converting values to Extended JSON MUST be able to convert its Decimal128 value object to Extended JSON.
Converting a Decimal128 Value Object to Extended JSON MUST follow the conversion rules in To String Representation, and other stringification rules as when converting Decimal128 Value Object to a String.
Operator overloading and math on Decimal128 Value Objects
Drivers MUST NOT allow any mathematical operator overloading for the Decimal128 Value Objects. This includes adding two Decimal128 Value Objects and assigning the result to a new object.
If a user wants to perform mathematical operations on Decimal128 Value Objects, the user must explicitly retrieve the native language value representations of the objects and perform the operations on those native representations. The user will then create a new Decimal128 Value Object and optionally overwrite the original Decimal128 Value Object.
From String Representation
For finite numbers, we will use the definition at https://speleotrove.com/decimal/daconvs.html. It has been modified to account for a different NaN representation and whitespace rules and copied here:
Strings which are acceptable for conversion to the abstract representation of
numbers, or which might result from conversion from the abstract representation
to a string, are called numeric strings.
A numeric string is a character string that describes either a finite
number or a special value.
* If it describes a finite number, it includes one or more decimal digits,
with an optional decimal point. The decimal point may be embedded in the
digits, or may be prefixed or suffixed to them. The group of digits (and
optional point) thus constructed may have an optional sign ('+' or '-')
which must come before any digits or decimal point.
* The string thus described may optionally be followed by an 'E'
(indicating an exponential part), an optional sign, and an integer
following the sign that represents a power of ten that is to be applied.
The 'E' may be in uppercase or lowercase.
* If it describes a special value, it is one of the case-independent names
'Infinity', 'Inf', or 'NaN' (where the first two represent infinity and
the second represent NaN). The name may be preceded by an optional sign,
as for finite numbers.
* No blanks or other whitespace characters are permitted in a numeric string.
Formally
sign ::= '+' | '-'
digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' |
'8' | '9'
indicator ::= 'e' | 'E'
digits ::= digit [digit]...
decimal-part ::= digits '.' [digits] | ['.'] digits
exponent-part ::= indicator [sign] digits
infinity ::= 'Infinity' | 'Inf'
nan ::= 'NaN'
numeric-value ::= decimal-part [exponent-part] | infinity
numeric-string ::= [sign] numeric-value | [sign] nan
where the characters in the strings accepted for 'infinity' and 'nan' may be in
any case. If an implementation supports the concept of diagnostic information
on NaNs, the numeric strings for NaNs MAY include one or more digits, as shown
above.[3] These digits encode the diagnostic information in an
implementation-defined manner; however, conversions to and from string for
diagnostic NaNs should be reversible if possible. If an implementation does not
support diagnostic information on NaNs, these digits should be ignored where
necessary. A plain 'NaN' is usually the same as 'NaN0'.
Drivers MAY choose to support signed NaN (sNaN), along with sNaN with
diagnostic information.
Examples::
Some numeric strings are:
"0" -- zero
"12" -- a whole number
"-76" -- a signed whole number
"12.70" -- some decimal places
"+0.003" -- a plus sign is allowed, too
"017." -- the same as 17
".5" -- the same as 0.5
"4E+9" -- exponential notation
"0.73e-7" -- exponential notation, negative power
"Inf" -- the same as Infinity
"-infinity" -- the same as -Infinity
"NaN" -- not-a-Number
Notes:
1. A single period alone or with a sign is not a valid numeric string.
2. A sign alone is not a valid numeric string.
3. Significant (after the decimal point) and insignificant leading zeros
are permitted.
To String Representation
For finite numbers, we will use the definition at https://speleotrove.com/decimal/daconvs.html. It has been copied here:
The coefficient is first converted to a string in base ten using the characters
0 through 9 with no leading zeros (except if its value is zero, in which case a
single 0 character is used).
Next, the adjusted exponent is calculated; this is the exponent, plus the
number of characters in the converted coefficient, less one. That is,
exponent+(clength-1), where clength is the length of the coefficient in decimal
digits.
If the exponent is less than or equal to zero and the adjusted exponent is
greater than or equal to -6, the number will be converted to a character form
without using exponential notation. In this case, if the exponent is zero then
no decimal point is added. Otherwise (the exponent will be negative), a decimal
point will be inserted with the absolute value of the exponent specifying the
number of characters to the right of the decimal point. '0' characters are
added to the left of the converted coefficient as necessary. If no character
precedes the decimal point after this insertion then a conventional '0'
character is prefixed.
Otherwise (that is, if the exponent is positive, or the adjusted exponent is
less than -6), the number will be converted to a character form using
exponential notation. In this case, if the converted coefficient has more than
one digit a decimal point is inserted after the first digit. An exponent in
character form is then suffixed to the converted coefficient (perhaps with
inserted decimal point); this comprises the letter 'E' followed immediately by
the adjusted exponent converted to a character form. The latter is in base ten,
using the characters 0 through 9 with no leading zeros, always prefixed by a
sign character ('-' if the calculated exponent is negative, '+' otherwise).
This corresponds to the following code snippet:
var adjusted_exponent = _exponent + (clength - 1);
if (_exponent > 0 || adjusted_exponent < -6) {
// exponential notation
} else {
// character form without using exponential notation
}
For special numbers such as infinity or the not a number (NaN) variants, the below table is used:
Value | String |
---|---|
Positive Infinite | Infinity |
Negative Infinite | -Infinity |
Positive NaN | NaN |
Negative NaN | NaN |
Signaled NaN | NaN |
Negative Signaled NaN | NaN |
NaN with a payload | NaN |
Finally, there are certain other invalid representations that must be treated as zeros, as per IEEE 754-2008
. The
tests will verify that each special value has been accounted for.
The server log files as well as the Extended JSON Format for Decimal128 use this format.
Motivation for Change
BSON already contains support for double
("\x01"
), but this type is insufficient for certain values that require
strict precision and representation, such as money, where it is necessary to perform exact decimal rounding.
The new BSON type is the 128-bit IEEE 754-2008
decimal floating point number, which is specifically designed to cope
with these issues.
Design Rationale
For simplicity and consistency between drivers, drivers must not automatically convert this type into a native type by default. This also ensures original data preservation, which is crucial to Decimal128. It is however recommended that drivers offer a way to convert the Value Object to a native type through accessors, and to create a new BSON type from native types. This forces the user to explicitly do the conversion and thus understand the difference between the MongoDB type and possible language precision and representation. Representations via conversions done outside MongoDB are not guaranteed to be identical.
Backwards Compatibility
There should be no backwards compatibility concerns. This specification merely deals with how to encode and decode BSON/Extended JSON Decimal128.
Reference Implementations
Tests
See the BSON Corpus for tests.
Most of the tests are converted from the General Decimal Arithmetic Testcases.
Q&A
-
Is it true Decimal128 doesn't normalize the value?
- Yes. As a result of non-normalization rules of the Decimal128 data type, precision is represented exactly. For example, '2.00' always remains stored as 200E-2 in Decimal128, and it differs from the representation of '2.0' (20E-1). These two values compare equally, but represent different ideas.
-
How does Decimal128 "2.000" look in the shell?
- NumberDecimal("2.000")
-
Should a driver avoid sending Decimal128 values to pre-3.4 servers?
- No
-
Is there a wire version bump or something for Decimal128?
- No
Changelog
- 2024-02-08: Migrated from reStructuredText to Markdown.
- 2022-10-05: Remove spec front matter.
BSON Binary UUID
- Status: Accepted
- Minimum Server Version: N/A
Abstract
The Java, C#, and Python drivers natively support platform types for UUID, all of which by default encode them to and decode them from BSON binary subtype 3. However, each encode the bytes in a different order from the others. To improve interoperability, BSON binary subtype 4 was introduced and defined the byte order according to RFC 4122, and a mechanism to configure each driver to encode UUIDs this way was added to each driver. The legacy representation remained as the default for each driver.
This specification moves MongoDB drivers further towards the standard UUID representation by requiring an application relying on native UUID support to explicitly specify the representation it requires.
Drivers that support native UUID types will additionally create helpers on their BsonBinary class that will aid in conversion to and from the platform native UUID type.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Specification
Terms
UUID
A Universally Unique IDentifier
BsonBinary
An object that wraps an instance of a BSON binary value
Naming Deviations
All drivers MUST name operations, objects, and parameters as defined in the following sections.
The following deviations are permitted:
- Drivers can use the platform's name for a UUID. For instance, in C# the platform class is Guid, whereas in Java it is UUID.
- Drivers can use a "to" prefix instead of an "as" prefix for the BsonBinary method names.
Explicit encoding and decoding
Any driver with a native UUID type MUST add the following UuidRepresentation enumeration, and associated methods to its BsonBinary (or equivalent) class:
/**
enum UuidRepresentation {
/**
* An unspecified representation of UUID. Essentially, this is the null
* representation value. This value is not required for languages that
* have better ways of indicating, or preventing use of, a null value.
*/
UNSPECIFIED("unspecified"),
/**
* The canonical representation of UUID according to RFC 4122,
* section 4.1.2
*
* It encodes as BSON binary subtype 4
*/
STANDARD("standard"),
/**
* The legacy representation of UUID used by the C# driver.
*
* In this representation the order of bytes 0-3 are reversed, the
* order of bytes 4-5 are reversed, and the order of bytes 6-7 are
* reversed.
*
* It encodes as BSON binary subtype 3
*/
C_SHARP_LEGACY("csharpLegacy"),
/**
* The legacy representation of UUID used by the Java driver.
*
* In this representation the order of bytes 0-7 are reversed, and the
* order of bytes 8-15 are reversed.
*
* It encodes as BSON binary subtype 3
*/
JAVA_LEGACY("javaLegacy"),
/**
* The legacy representation of UUID used by the Python driver.
*
* As with STANDARD, this representation conforms with RFC 4122, section
* 4.1.2
*
* It encodes as BSON binary subtype 3
*/
PYTHON_LEGACY("pythonLegacy")
}
class BsonBinary {
/*
* Construct from a UUID using the standard UUID representation
* [Specification] This constructor SHOULD be included but MAY be
* omitted if it creates backwards compatibility issues
*/
constructor(Uuid uuid)
/*
* Construct from a UUID using the given UUID representation.
*
* The representation must not be equal to UNSPECIFIED
*/
constructor(Uuid uuid, UuidRepresentation representation)
/*
* Decode a subtype 4 binary to a UUID, erroring when the subtype is not 4.
*/
Uuid asUuid()
/*
* Decode a subtype 3 or 4 to a UUID, according to the UUID
* representation, erroring when subtype does not match the
* representation.
*/
Uuid asUuid(UuidRepresentation representation)
}
Implicit decoding and encoding
A new driver for a language with a native UUID type MUST NOT implicitly encode from or decode to the native UUID type. Rather, explicit conversion MUST be used as described in the previous section.
Drivers that already do such implicit encoding and decoding SHOULD support a URI option, uuidRepresentation, which controls the default behavior of the UUID codec. Alternatively, a driver MAY specify the UUID representation via global state.
Value | Default? | Encode to | Decode subtype 4 to | Decode subtype 3 to |
---|---|---|---|---|
unspecified | yes | raise error | BsonBinary | BsonBinary |
standard | no | BSON binary subtype 4 | native UUID | BsonBinary |
csharpLegacy | no | BSON binary subtype 3 with C# legacy byte order | BsonBinary | native UUID |
javaLegacy | no | BSON binary subtype 3 with Java legacy byte order | BsonBinary | native UUID |
pythonLegacy | no | BSON binary subtype 3 with standard byte order | BsonBinary | native UUID |
For scenarios where the application makes the choice (e.g. a POJO with a field of type UUID), or when serializers are strongly typed and are constrained to always return values of a certain type, the driver will raise an exception in cases where otherwise it would be required to decode to a different type (e.g. BsonBinary instead of UUID or vice versa).
Note also that none of the above applies when decoding to strictly typed maps, e.g. a Map<String, BsonValue>
like Java
or .NET's BsonDocument class. In those cases the driver is always decoding to BsonBinary, and applications would use the
asUuid methods to explicitly convert from BsonBinary to UUID.
Implementation Notes
Since changing the default UUID representation can reasonably be considered a backwards-breaking change, drivers that implement the full specification should stage implementation according to semantic versioning guidelines. Specifically, support for this specification can be added to a minor release, but with several exceptions:
The default UUID representation should be left as is (e.g. JAVA_LEGACY for the Java driver) rather than be changed to UNSPECIFIED. In a subsequent major release, the default UUID representation can be changed to UNSPECIFIED (along with appropriate documentation indicating the backwards-breaking change). Drivers MUST document this in a prior minor release.
Test Plan
The test plan consists of a series of prose tests. They all operate on the same UUID, with the String representation of "00112233-4455-6677-8899-aabbccddeeff".
Explicit encoding
- Create a BsonBinary instance with the given UUID
- Assert that the BsonBinary instance's subtype is equal to 4 and data equal to the hex-encoded string "00112233445566778899AABBCCDDEEFF"
- Create a BsonBinary instance with the given UUID and UuidRepresentation equal to STANDARD
- Assert that the BsonBinary instance's subtype is equal to 4 and data equal to the hex-encoded string "00112233445566778899AABBCCDDEEFF"
- Create a BsonBinary instance with the given UUID and UuidRepresentation equal to JAVA_LEGACY
- Assert that the BsonBinary instance's subtype is equal to 3 and data equal to the hex-encoded string "7766554433221100FFEEDDCCBBAA9988"
- Create a BsonBinary instance with the given UUID and UuidRepresentation equal to CSHARP_LEGACY
- Assert that the BsonBinary instance's subtype is equal to 3 and data equal to the hex-encoded string "33221100554477668899AABBCCDDEEFF"
- Create a BsonBinary instance with the given UUID and UuidRepresentation equal to PYTHON_LEGACY
- Assert that the BsonBinary instance's subtype is equal to 3 and data equal to the hex-encoded string "00112233445566778899AABBCCDDEEFF"
- Create a BsonBinary instance with the given UUID and UuidRepresentation equal to UNSPECIFIED
- Assert that an error is raised
Explicit Decoding
- Create a BsonBinary instance with subtype equal to 4 and data equal to the hex-encoded string
"00112233445566778899AABBCCDDEEFF"
- Assert that a call to BsonBinary.asUuid() returns the given UUID
- Assert that a call to BsonBinary.asUuid(STANDARD) returns the given UUID
- Assert that a call to BsonBinary.asUuid(UNSPECIFIED) raises an error
- Assert that a call to BsonBinary.asUuid(JAVA_LEGACY) raises an error
- Assert that a call to BsonBinary.asUuid(CSHARP_LEGACY) raises an error
- Assert that a call to BsonBinary.asUuid(PYTHON_LEGACY) raises an error
- Create a BsonBinary instance with subtype equal to 3 and data equal to the hex-encoded string
"7766554433221100FFEEDDCCBBAA9988"
- Assert that a call to BsonBinary.asUuid() raises an error
- Assert that a call to BsonBinary.asUuid(STANDARD) raised an error
- Assert that a call to BsonBinary.asUuid(UNSPECIFIED) raises an error
- Assert that a call to BsonBinary.asUuid(JAVA_LEGACY) returns the given UUID
- Create a BsonBinary instance with subtype equal to 3 and data equal to the hex-encoded string
"33221100554477668899AABBCCDDEEFF"
- Assert that a call to BsonBinary.asUuid() raises an error
- Assert that a call to BsonBinary.asUuid(STANDARD) raised an error
- Assert that a call to BsonBinary.asUuid(UNSPECIFIED) raises an error
- Assert that a call to BsonBinary.asUuid(CSHARP_LEGACY) returns the given UUID
- Create a BsonBinary instance with subtype equal to 3 and data equal to the hex-encoded string
"00112233445566778899AABBCCDDEEFF"
- Assert that a call to BsonBinary.asUuid() raises an error
- Assert that a call to BsonBinary.asUuid(STANDARD) raised an error
- Assert that a call to BsonBinary.asUuid(UNSPECIFIED) raises an error
- Assert that a call to BsonBinary.asUuid(PYTHON_LEGACY) returns the given UUID
Implicit encoding
- Set the uuidRepresentation of the client to "javaLegacy". Insert a document with an "_id" key set to the given
native UUID value.
- Assert that the actual value inserted is a BSON binary with subtype 3 and data equal to the hex-encoded string "7766554433221100FFEEDDCCBBAA9988"
- Set the uuidRepresentation of the client to "charpLegacy". Insert a document with an "_id" key set to the given
native UUID value.
- Assert that the actual value inserted is a BSON binary with subtype 3 and data equal to the hex-encoded string "33221100554477668899AABBCCDDEEFF"
- Set the uuidRepresentation of the client to "pythonLegacy". Insert a document with an "_id" key set to the given
native UUID value.
- Assert that the actual value inserted is a BSON binary with subtype 3 and data equal to the hex-encoded string "00112233445566778899AABBCCDDEEFF"
- Set the uuidRepresentation of the client to "standard". Insert a document with an "_id" key set to the given native
UUID value.
- Assert that the actual value inserted is a BSON binary with subtype 4 and data equal to the hex-encoded string "00112233445566778899AABBCCDDEEFF"
- Set the uuidRepresentation of the client to "unspecified". Insert a document with an "_id" key set to the given
native UUID value.
- Assert that a BSON serialization exception is thrown
Implicit Decoding
-
Set the uuidRepresentation of the client to "javaLegacy". Insert a document containing two fields. The "standard" field should contain a BSON Binary created by creating a BsonBinary instance with the given UUID and the STANDARD UuidRepresentation. The "legacy" field should contain a BSON Binary created by creating a BsonBinary instance with the given UUID and the JAVA_LEGACY UuidRepresentation. Find the document.
- Assert that the value of the "standard" field is of type BsonBinary and is equal to the inserted value.
- Assert that the value of the "legacy" field is of the native UUID type and is equal to the given UUID
Repeat this test with the uuidRepresentation of the client set to "csharpLegacy" and "pythonLegacy".
-
Set the uuidRepresentation of the client to "standard". Insert a document containing two fields. The "standard" field should contain a BSON Binary created by creating a BsonBinary instance with the given UUID and the STANDARD UuidRepresentation. The "legacy" field should contain a BSON Binary created by creating a BsonBinary instance with the given UUID and the PYTHON_LEGACY UuidRepresentation. Find the document.
- Assert that the value of the "standard" field is of the native UUID type and is equal to the given UUID
- Assert that the value of the "legacy" field is of type BsonBinary and is equal to the inserted value.
-
Set the uuidRepresentation of the client to "unspecified". Insert a document containing two fields. The "standard" field should contain a BSON Binary created by creating a BsonBinary instance with the given UUID and the STANDARD UuidRepresentation. The "legacy" field should contain a BSON Binary created by creating a BsonBinary instance with the given UUID and the PYTHON_LEGACY UuidRepresentation. Find the document.
- Assert that the value of the "standard" field is of type BsonBinary and is equal to the inserted value
- Assert that the value of the "legacy" field is of type BsonBinary and is equal to the inserted value.
Repeat this test with the uuidRepresentation of the client set to "csharpLegacy" and "pythonLegacy".
Note: the assertions will be different in the release prior to the major release, to avoid breaking changes. Adjust accordingly!
Q & A
What's the rationale for the deviations allowed by the specification?
In short, the C# driver has existing behavior that make it infeasible to work the same as other drivers.
The C# driver has a global serialization registry. Since it's global and not per-MongoClient, it's not feasible to override the UUID representation on a per-MongoClient basis, since doing so would require a per-MongoClient registry. Instead, the specification allows for a global override so that the C# driver can implement the specification.
Additionally, the C# driver has an existing configuration parameter that controls the behavior of BSON readers and writers at a level below the serializers. This configuration affects the semantics of the existing BsonBinary class in a way that doesn't allow for the constructor(UUID) mentioned in the specification. For this reason, that constructor is specified as optional.
Changelog
- 2024-08-01: Migrated from reStructuredText to Markdown.
- 2022-10-05: Remove spec front matter.
DBRef
- Status: Accepted
- Minimum Server Version: N/A
Abstract
DBRefs are a convention for expressing a reference to another document as an embedded document (i.e. BSON type 0x03). Several drivers provide a model class for encoding and/or decoding DBRef documents. This specification will both define the structure of a DBRef and provide guidance for implementing model classes in drivers that choose to do so.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
This specification presents documents as Extended JSON for readability and expressing special types (e.g. ObjectId). Although JSON fields are unordered, the order of fields presented herein should be considered pertinent. This is especially relevant for the Test Plan.
Specification
DBRef Structure
A DBRef is an embedded document with the following fields:
$ref
: required string field. Contains the name of the collection where the referenced document resides. This MUST be the first field in the DBRef.$id
: required field. Contains the value of the_id
field of the referenced document. This MUST be the second field in the DBRef.$db
: optional string field. Contains the name of the database where the referenced document resides. If specified, this MUST be the third field in the DBRef. If omitted, the referenced document is assumed to reside in the same database as the DBRef.- Extra, optional fields may follow after
$id
or$db
(if specified). There are no inherent restrictions on extra field names; however, older server versions may impose their own restrictions (e.g. no dots or dollars).
DBRefs have no relation to the deprecated DBPointer BSON type (i.e. type 0x0C).
Examples of Valid DBRefs
The following examples are all valid DBRefs:
// Basic DBRef with only $ref and $id fields
{ "$ref": "coll0", "$id": { "$oid": "60a6fe9a54f4180c86309efa" } }
// DBRef $id is not necessarily an ObjectId
{ "$ref": "coll0", "$id": 1 }
// DBRef with optional $db field
{ "$ref": "coll0", "$id": 1, "$db": "db0" }
// DBRef with extra, optional fields (with or without $db)
{ "$ref": "coll0", "$id": 1, "$db": "db0", "foo": "bar" }
{ "$ref": "coll0", "$id": 1, "foo": true }
// Extra field names have no inherent restrictions
{ "$ref": "coll0", "$id": 1, "$foo": "bar" }
{ "$ref": "coll0", "$id": 1, "foo.bar": 0 }
Examples of Invalid DBRefs
The following examples are all invalid DBRefs:
// Required fields are omitted
{ "$ref": "coll0" }
{ "$id": { "$oid": "60a6fe9a54f4180c86309efa" } }
// Invalid types for $ref or $db
{ "$ref": true, "$id": 1 }
{ "$ref": "coll0", "$id": 1, "$db": 1 }
// Fields are out of order
{ "$id": 1, "$ref": "coll0" }
Implementing a DBRef Model
Drivers MAY provide a model class for encoding and/or decoding DBRef documents. For those drivers that do, this section defines expected behavior of that class. This section does not prohibit drivers from implementing additional functionality, provided it does not conflict with any of these guidelines.
Constructing a DBRef model
Drivers MAY provide an API for constructing a DBRef model directly from its constituent parts. If so:
- Drivers MUST solicit a string value for
$ref
. - Drivers MUST solicit an arbitrary value for
$id
. Drivers SHOULD NOT enforce any restrictions on this value; however, this may be necessary if the driver is unable to differentiate between certain BSON types (e.g.null
,undefined
) and the parameter being unspecified. - Drivers SHOULD solicit an optional string value for
$db
. - Drivers MUST require
$ref
and$db
(if specified) to be strings but MUST NOT enforce any naming restrictions on the string values. - Drivers MAY solicit extra, optional fields.
Decoding a BSON document to a DBRef model
Drivers MAY support explicit and/or implicit decoding. An example of explicit decoding might be a DBRef model constructor that takes a BSON document. An example of implicit decoding might be configuring the driver's BSON codec to automatically convert embedded documents that comply with the DBRef Structure into a DBRef model.
Drivers that provide implicit decoding SHOULD provide some way for applications to opt out and allow DBRefs to be decoded like any other embedded document.
When decoding a BSON document to a DBRef model:
- Drivers MUST require
$ref
and$id
to be present. - Drivers MUST require
$ref
and$db
(if present) to be strings but MUST NOT enforce any naming restrictions on the string values. - Drivers MUST accept any BSON type for
$id
and MUST NOT enforce any restrictions on its value. - Drivers MUST preserve extra, optional fields (beyond
$ref
,$id
, and$db
) and MUST provide some way to access those fields via the DBRef model. For example, an accessor method that returns the original BSON document (including$ref
, etc.) would fulfill this requirement.
If a BSON document cannot be implicitly decoded to a DBRef model, it MUST be left as-is (like any other embedded document). If a BSON document cannot be explicitly decoded to a DBRef model, the driver MUST raise an error.
Since DBRefs are a special type of embedded document, a DBRef model class used for decoding SHOULD inherit the class used to represent an embedded document (e.g. Hash in Ruby). This will allow applications to always expect an instance of a common class when decoding an embedded document (if desired) and should also support the requirement for DBRef models to provide access to any extra, optional fields.
Encoding a DBRef model to a BSON document
Drivers MAY support explicit and/or implicit encoding. An example of explicit encoding might be a DBRef method that returns its corresponding representation as a BSON document. An example of implicit encoding might be configuring the driver's BSON codec to automatically convert DBRef models to the corresponding BSON document representation as needed.
If a driver supports implicit decoding of BSON to a DBRef model, it SHOULD also support implicit encoding. Doing so will allow applications to more easily round-trip DBRefs through the driver.
When encoding a DBRef model to BSON document:
- Drivers MUST encode all fields in the order defined in DBRef Structure.
- Drivers MUST encode
$ref
and$id
. If$db
was specified, it MUST be encoded after$id
. If any extra, optional fields were specified, they MUST be encoded after$id
or$db
. - If the DBRef includes any extra, optional fields after
$id
or$db
, drivers SHOULD attempt to preserve the original order of those fields relative to one another.
Test Plan
The test plan consists of a series of prose tests. These tests are only relevant to drivers that provide a DBRef model class.
The documents in these tests are presented as Extended JSON for readability; however, readers should consider the field order pertinent when translating to BSON (or their language equivalent). These tests are not intended to exercise a driver's Extended JSON parser. Implementations SHOULD construct the documents directly using native BSON types (e.g. Document, ObjectId).
Decoding
These tests are only relevant to drivers that allow decoding into a DBRef model. Drivers SHOULD implement these tests for both explicit and implicit decoding code paths as needed.
-
Valid documents MUST be decoded to a DBRef model. For each of the following:
{ "$ref": "coll0", "$id": { "$oid": "60a6fe9a54f4180c86309efa" } }
{ "$ref": "coll0", "$id": 1 }
{ "$ref": "coll0", "$id": null }
{ "$ref": "coll0", "$id": 1, "$db": "db0" }
Assert that each document is successfully decoded to a DBRef model. Assert that the
$ref
,$id
, and$db
(if applicable) fields have their expected value. -
Valid documents with extra fields MUST be decoded to a DBRef model and the model MUST provide some way to access those extra fields. For each of the following:
{ "$ref": "coll0", "$id": 1, "$db": "db0", "foo": "bar" }
{ "$ref": "coll0", "$id": 1, "foo": true, "bar": false }
{ "$ref": "coll0", "$id": 1, "meta": { "foo": 1, "bar": 2 } }
{ "$ref": "coll0", "$id": 1, "$foo": "bar" }
{ "$ref": "coll0", "$id": 1, "foo.bar": 0 }
Assert that each document is successfully decoded to a DBRef model. Assert that the
$ref
,$id
, and$db
(if applicable) fields have their expected value. Assert that it is possible to access all extra fields and that those fields have their expected value. -
Documents with out of order fields that are otherwise valid MUST be decoded to a DBRef model. For each of the following:
{ "$id": 1, "$ref": "coll0" }
{ "$db": "db0", "$ref": "coll0", "$id": 1 }
{ "foo": 1, "$id": 1, "$ref": "coll0" }
{ "foo": 1, "$ref": "coll0", "$id": 1, "$db": "db0" }
{ "foo": 1, "$ref": "coll0", "$id": 1, "$db": "db0", "bar": 1 }
Assert that each document is successfully decoded to a DBRef model. Assert that the
$ref
,$id
,$db
(if applicable), and any extra fields (if applicable) have their expected value. -
Documents missing required fields MUST NOT be decoded to a DBRef model. For each of the following:
{ "$ref": "coll0" }
{ "$id": { "$oid": "60a6fe9a54f4180c86309efa" } }
{ "$db": "db0" }
Assert that each document is not decoded to a DBRef model. In the context of implicit decoding, the document MUST be decoded like any other embedded document. In the context of explicit decoding, the DBRef decoding method MUST raise an error.
-
Documents with invalid types for
$ref
or$db
MUST NOT be decoded to a DBRef model. For each of the following:{ "$ref": true, "$id": 1 }
{ "$ref": "coll0", "$id": 1, "$db": 1 }
Assert that each document is not decoded to a DBRef model. In the context of implicit decoding, the document MUST be decoded like any other embedded document. In the context of explicit decoding, the DBRef decoding method MUST raise an error.
Encoding
These tests are only relevant to drivers that allow encoding a DBRef model. Drivers SHOULD implement these tests for both explicit and implicit encoding code paths as needed.
Drivers MAY use any method to create the DBRef model for each test (e.g. constructor, explicit decoding method).
Drivers MAY skip tests that cannot be implemented as written (e.g. DBRef model constructor does not support extra, optional fields and the driver also does not support explicit/implicit decoding).
-
Encoding DBRefs with basic fields.
-
For each of the following:
{ "$ref": "coll0", "$id": { "$oid": "60a6fe9a54f4180c86309efa" } }
{ "$ref": "coll0", "$id": 1 }
{ "$ref": "coll0", "$id": null }
{ "$ref": "coll0", "$id": 1, "$db": "db0" }
-
Assert that each DBRef model is successfully encoded to a BSON document. Assert that the
$ref
,$id
, and$db
(if applicable) fields appear in the correct order and have their expected values.
-
-
Encoding DBRefs with extra, optional fields.
-
For each of the following:
{ "$ref": "coll0", "$id": 1, "$db": "db0", "foo": "bar" }
{ "$ref": "coll0", "$id": 1, "foo": true, "bar": false }
{ "$ref": "coll0", "$id": 1, "meta": { "foo": 1, "bar": 2 } }
{ "$ref": "coll0", "$id": 1, "$foo": "bar" }
{ "$ref": "coll0", "$id": 1, "foo.bar": 0 }
-
Assert that each DBRef model is successfully encoded to a BSON document. Assert that the
$ref
,$id
,$db
(if applicable), and any extra fields appear in the correct order and have their expected values.
-
-
Encoding DBRefs re-orders any out of order fields during decoding. This test MUST NOT use a constructor that solicits fields individually.
-
For each of the following:
{ "$id": 1, "$ref": "coll0" }
{ "$db": "db0", "$ref": "coll0", "$id": 1 }
{ "foo": 1, "$id": 1, "$ref": "coll0" }
{ "foo": 1, "$ref": "coll0", "$id": 1, "$db": "db0" }
{ "foo": 1, "$ref": "coll0", "$id": 1, "$db": "db0", "bar": 1 }
-
Assert that each document is successfully decoded to a DBRef model and then successfully encoded back to a BSON document. Assert that the order of fields in each encoded BSON document matches the following, respectively:
{ "$ref": "coll0", "$id": 1 }
{ "$ref": "coll0", "$id": 1, "$db": "db0" }
{ "$ref": "coll0", "$id": 1, "foo": 1 }
{ "$ref": "coll0", "$id": 1, "$db": "db0", "foo": 1}
{ "$ref": "coll0", "$id": 1, "$db": "db0", "foo": 1, "bar": 1 }
-
Design Rationale
In contrast to always encoding DBRefs with the correct field order, decoding permits fields to be out of order (provided the document is otherwise valid). This follows the robustness principle in having the driver be liberal in what it accepts and conservative in what it emits. This does mean that round-tripping an out of order DBRef through a driver could result in its field order being changed; however, this behavior is consistent with existing behavior in drivers that model DBRefs (e.g. C#, Java, Node, Python, Ruby) and applications can opt out of implicit decoding if desired.
Changelog
- 2024-02-26: Migrated from reStructuredText to Markdown.
- 2022-10-05: Remove spec front matter.
Extended JSON
- Status: Accepted
- Minimum Server Version: N/A
Abstract
MongoDB Extended JSON is a string format for representing BSON documents. This specification defines the canonical format for representing each BSON type in the Extended JSON format. Thus, a tool that implements Extended JSON will be able to parse the output of any tool that emits Canonical Extended JSON. It also defines a Relaxed Extended JSON format that improves readability at the expense of type information preservation.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Naming
Acceptable naming deviations should fall within the basic style of the language. For example, CanonicalExtendedJSON
would be a name in Java, where camel-case method names are used, but in Ruby canonical_extended_json
would be
acceptable.
Terms
Type wrapper object - a JSON value consisting of an object with one or more $
-prefixed keys that collectively encode
a BSON type and its corresponding value using only JSON value primitives.
Extended JSON - A general term for one of many string formats based on the JSON standard that describes how to represent BSON documents in JSON using standard JSON types and/or type wrapper objects. This specification gives a formal definition to variations of such a format.
Relaxed Extended JSON - A string format based on the JSON standard that describes BSON documents. Relaxed Extended JSON emphasizes readability and interoperability at the expense of type preservation.
Canonical Extended JSON - A string format based on the JSON standard that describes BSON documents. Canonical Extended JSON emphasizes type preservation at the expense of readability and interoperability.
Legacy Extended JSON - A string format based on the JSON standard that describes a BSON document. The Legacy Extended JSON format does not describe a specific, standardized format, and many tools, drivers, and libraries implement Extended JSON in conflicting ways.
Specification
Extended JSON Format
The Extended JSON grammar extends the JSON grammar as defined in section 2 of the JSON specification by augmenting the possible JSON values as defined in Section 3. This specification defines two formats for Extended JSON:
- Canonical Extended JSON
- Relaxed Extended JSON
An Extended JSON value MUST conform to one of these two formats as described in the table below.
Notes on grammar
- Key order:
- Keys within Canonical Extended JSON type wrapper objects SHOULD be emitted in the order described.
- Keys within Relaxed Extended JSON type wrapper objects are unordered.
- Terms in italics represent types defined elsewhere in the table or in the JSON specification.
- JSON numbers (as defined in Section 6 of the JSON specification)
include both integer and floating point types. For the purpose of this document, we define the following subtypes:
- Type integer means a JSON number without frac or exp components; this is expressed in the JSON spec grammar
as
[minus] int
. - Type non-integer means a JSON number that is not an integer; it must include either a frac or exp component or both.
- Type pos-integer means a non-negative JSON number without frac or exp components; this is expressed in the
JSON spec grammar as
int
.
- Type integer means a JSON number without frac or exp components; this is expressed in the JSON spec grammar
as
- A hex string is a JSON string that contains only hexadecimal digits
[0-9a-f]
. It SHOULD be emitted lower-case, but MUST be read in a case-insensitive fashion. - < Angle brackets > detail the contents of a value, including type information.
- [Square brackets] specify a type constraint that restricts the specification to a particular range or set of values.
Conversion table
BSON 1.1 Type or Convention | Canonical Extended JSON Format | Relaxed Extended JSON Format |
---|---|---|
ObjectId | {"$oid": < ObjectId bytes as 24-character, big-endian hex string > } | < Same as Canonical Extended JSON > |
Symbol | {"$symbol": string } | < Same as Canonical Extended JSON > |
String | string | < Same as Canonical Extended JSON > |
Int32 | {"$numberInt": < 32-bit signed integer as a string > } | integer |
Int64 | {"$numberLong": < 64-bit signed integer as a string > } | integer |
Double [finite] | {"$numberDouble": < 64-bit signed floating point as a decimal string > } | non-integer |
Double [non-finite] | {"$numberDouble": < One of the strings: "Infinity" , "-Infinity" , or "NaN" . > } | < Same as Canonical Extended JSON > |
Decimal128 | {"$numberDecimal": < decimal as a string1 > } | < Same as Canonical Extended JSON > |
Binary | {"$binary": {"base64": < base64-encoded (with padding as = ) payload as a string > , "subType": < BSON binary type as a one- or two-character hex string > }} | < Same as Canonical Extended JSON > |
Code | {"$code": string } | < Same as Canonical Extended JSON > |
CodeWScope | {"$code": string , "$scope": Document } | < Same as Canonical Extended JSON > |
Document | object (with Extended JSON extensions) | < Same as Canonical Extended JSON > |
Timestamp | {"$timestamp": {"t": pos-integer , "i": pos-integer }} | < Same as Canonical Extended JSON > |
Regular Expression | {"$regularExpression": {pattern: string , "options": < BSON regular expression options as a string or "" 2 > }} | < Same as Canonical Extended JSON > |
DBPointer | {"$dbPointer": {"$ref": < namespace3 as a string > , "$id": ObjectId }} | < Same as Canonical Extended JSON > |
Datetime [year from 1970 to 9999 inclusive] | {"$date": {"$numberLong": < 64-bit signed integer giving millisecs relative to the epoch, as a string > }} | {"$date": ISO-8601 Internet Date/Time Format as described in RFC-33394 with maximum time precision of milliseconds5 as a string } |
Datetime [year before 1970 or after 9999] | {"$date": {"$numberLong": < 64-bit signed integer giving millisecs relative to the epoch, as a string > }} | < Same as Canonical Extended JSON > |
DBRef6 Note: this is not technically a BSON type, but it is a common convention. | {"$ref": < collection name as a string > , "$id": < Extended JSON for the id > } If the generator supports DBRefs with a database component, and the database component is nonempty: {"$ref": < collection name as a string > , "$id": < Extended JSON for the id > , "$db": < database name as a string > } DBRefs may also have other fields, which MUST appear after $id and $db (if supported). | < Same as Canonical Extended JSON > |
MinKey | {"$minKey": 1} | < Same as Canonical Extended JSON > |
MaxKey | {"$maxKey": 1} | < Same as Canonical Extended JSON > |
Undefined | {"$undefined": true} | < Same as Canonical Extended JSON > |
Array | array | < Same as Canonical Extended JSON > |
Boolean | true or false | < Same as Canonical Extended JSON > |
Null | null | < Same as Canonical Extended JSON > |
Representation of Non-finite Numeric Values
Following the Extended JSON format for the Decimal128 type, non-finite numeric values are encoded as follows:
Value | String |
---|---|
Positive Infinity | Infinity |
Negative Infinity | -Infinity |
NaN (all variants) | NaN |
For example, a BSON floating-point number with a value of negative infinity would be encoded as Extended JSON as follows:
{"$numberDouble": "-Infinity"}
Parsers
An Extended JSON parser (hereafter just "parser") is a tool that transforms an Extended JSON string into another representation, such as BSON or a language-native data structure.
By default, a parser MUST accept values in either Canonical Extended JSON format or Relaxed Extended JSON format as described in this specification. A parser MAY allow users to restrict parsing to only Canonical Extended JSON format or only Relaxed Extended JSON format.
A parser MAY also accept strings that adhere to other formats, such as Legacy Extended JSON formats emitted by old versions of mongoexport or other tools, but only if explicitly configured to do so.
A parser that accepts Legacy Extended JSON MUST be configurable such that a JSON text of a MongoDB query filter containing the regex query operator can be parsed, e.g.:
{ "$regex": {
"$regularExpression" : { "pattern": "foo*", "options": "" }
},
"$options" : "ix"
}
or:
{ "$regex": {
"$regularExpression" : { "pattern": "foo*", "options": "" }
}
}
A parser that accepts Legacy Extended JSON MUST be configurable such that a JSON text of a MongoDB query filter containing the type query operator can be parsed, e.g.:
{ "zipCode" : { $type : 2 } }
or:
{ "zipCode" : { $type : "string" } }
A parser SHOULD support at least 200 levels of nesting in an Extended JSON document but MAY set other limits on strings it can accept as defined in section 9 of the JSON specification.
When parsing a JSON object other than the top-level object, the presence of a $
-prefixed key indicates the object
could be a type wrapper object as described in the Extended JSON Conversion table. In such a case,
the parser MUST follow these rules, unless configured to allow Legacy Extended JSON, in which case it SHOULD follow
these rules:
-
Parsers MUST NOT consider key order as having significance. For example, the document
{"$code": "function(){}", "$scope": {}}
must be considered identical to{"$scope": {}, "$code": "function(){}"}
. -
If the parsed object contains any of the special keys for a type in the Conversion table (e.g.
"$binary"
,"$timestamp"
) then it must contain exactly the keys of the type wrapper. Any missing or extra keys constitute an error.DBRef is the lone exception to this rule, as it is only a common convention and not a proper type. An object that resembles a DBRef but fails to fully comply with its structure (e.g. has
$ref
but missing$id
) MUST be left as-is and MUST NOT constitute an error. -
If the keys of the parsed object exactly match the keys of a type wrapper in the Conversion table, and the values of the parsed object have the correct type for the type wrapper as described in the Conversion table, then the parser MUST interpret the parsed object as a type wrapper object of the corresponding type.
-
If the keys of the parsed object exactly match the keys of a type wrapper in the Conversion table, but any of the values are of an incorrect type, then the parser MUST report an error.
-
If the
$
-prefixed key does not match a known type wrapper in the Conversion table, the parser MUST NOT raise an error and MUST leave the value as-is. See Restrictions and limitations for additional information.
Special rules for parsing JSON numbers
The Relaxed Extended JSON format uses JSON numbers for several different BSON types. In order to allow parsers to use language-native JSON decoders (which may not distinguish numeric type when parsing), the following rules apply to parsing JSON numbers:
- If the number is a non-integer, parsers SHOULD interpret it as BSON Double.
- If the number is an integer, parsers SHOULD interpret it as being of the smallest BSON integer type that can represent the number exactly. If a parser is unable to represent the number exactly as an integer (e.g. a large 64-bit number on a 32-bit platform), it MUST interpret it as a BSON Double even if this results in a loss of precision. The parser MUST NOT interpret it as a BSON String containing a decimal representation of the number.
Special rules for parsing $uuid
fields
As per the UUID specification, Binary subtype 3 or 4 are used to represent UUIDs in BSON.
Consequently, UUIDs are handled as per the convention described for the Binary
type in the
Conversion table, e.g. the following document written with the MongoDB Python Driver:
{"Binary": uuid.UUID("c8edabc3-f738-4ca3-b68d-ab92a91478a3")}
is transformed into the following (newlines and spaces added for readability):
{"Binary": {
"$binary": {
"base64": "yO2rw/c4TKO2jauSqRR4ow==",
"subType": "04"}
}
}
[!NOTE] The above described type conversion assumes that UUID representation is set to
STANDARD
. See the UUID specification for more information about UUID representations.
While this transformation preserves BSON subtype information (since UUIDs can be represented as BSON subtype 3 or 4), base64-encoding is not the standard way of representing UUIDs and using it makes comparing these values against textual representations coming from platform libraries difficult. Consequently, we also allow UUIDs to be represented in extended JSON as:
{"$uuid": <canonical textual representation of a UUID>}
The rules for generating the canonical string representation of a UUID are defined in RFC 4122 Section 3. Use of this format result in a more readable extended JSON representation of the UUID from the previous example:
{"Binary": {
"$uuid": "c8edabc3-f738-4ca3-b68d-ab92a91478a3"
}
}
Parsers MUST interpret the $uuid
key as BSON Binary subtype 4. Parsers MUST accept textual representations of UUIDs
that omit the URN prefix (usually urn:uuid:
). Parsers MAY also accept textual representations of UUIDs that omit the
hyphens between hex character groups (e.g. c8edabc3f7384ca3b68dab92a91478a3
).
Generators
An Extended JSON generator (hereafter just "generator") produces strings in an Extended JSON format.
A generator MUST allow users to produce strings in either the Canonical Extended JSON format or the Relaxed Extended JSON format. If generators provide a default format, the default SHOULD be the Relaxed Extended JSON format.
A generator MAY be capable of exporting strings that adhere to other formats, such as Legacy Extended JSON formats.
A generator SHOULD support at least 100 levels of nesting in a BSON document.
Transforming BSON
Given a BSON document (e.g. a buffer of bytes meeting the requirements of the BSON specification), a generator MUST use the corresponding JSON values or Extended JSON type wrapper objects for the BSON type given in the Extended JSON Conversion table for the desired format. When transforming a BSON document into Extended JSON text, a generator SHOULD emit the JSON keys and values in the same order as given in the BSON document.
Transforming Language-Native data
Given language-native data (e.g. type primitives, container types, classes, etc.), if there is a semantically-equivalent
BSON type for a given language-native type, a generator MUST use the corresponding JSON values or Extended JSON type
wrapper objects for the BSON type given in the Extended JSON Conversion table for the desired
format. For example, a Python datetime
object must be represented the same as a BSON datetime type. A generator SHOULD
error if a language-native type has no semantically-equivalent BSON type.
Format and Method Names
The following format names SHOULD be used for selecting formats for generator output:
canonicalExtendedJSON
(references Canonical Extended JSON as described in this specification)relaxedExtendedJSON
(references Relaxed Extended JSON as described in this specification)legacyExtendedJSON
(if supported: references Legacy Extended JSON, with implementation-defined behavior)
Generators MAY use these format names as part of function/method names or MAY use them as arguments or constants, as needed.
If a generator provides a generic to_json
or to_extended_json
method, it MUST default to producing Relaxed Extended
JSON or MUST be deprecated in favor of a spec-compliant method.
Restrictions and limitations
Extended JSON is designed primarily for testing and human inspection of BSON documents. It is not designed to reliably round-trip BSON documents. One fundamental limitation is that JSON objects are inherently unordered and BSON objects are ordered.
Further, Extended JSON uses $
-prefixed keys in type wrappers and has no provision for escaping a leading $
used
elsewhere in a document. This means that the Extended JSON representation of a document with $
-prefixed keys could be
indistinguishable from another document with a type wrapper with the same keys.
Extended JSON formats SHOULD NOT be used in contexts where $
-prefixed keys could exist in BSON documents (with the
exception of the DBRef convention, which is accounted for in this spec).
Test Plan
Drivers, tools, and libraries can test their compliance to this specification by running the tests in version 2.0 and above of the BSON Corpus Test Suite.
Examples
Canonical Extended JSON Example
Consider the following document, written with the MongoDB Python Driver:
{
"_id": bson.ObjectId("57e193d7a9cc81b4027498b5"),
"String": "string",
"Int32": 42,
"Int64": bson.Int64(42),
"Double": 42.42,
"Decimal": bson.Decimal128("1234.5"),
"Binary": uuid.UUID("c8edabc3-f738-4ca3-b68d-ab92a91478a3"),
"BinaryUserDefined": bson.Binary(b'123', 80),
"Code": bson.Code("function() {}"),
"CodeWithScope": bson.Code("function() {}", scope={}),
"Subdocument": {"foo": "bar"},
"Array": [1, 2, 3, 4, 5],
"Timestamp": bson.Timestamp(42, 1),
"RegularExpression": bson.Regex("foo*", "xi"),
"DatetimeEpoch": datetime.datetime.utcfromtimestamp(0),
"DatetimePositive": datetime.datetime.max,
"DatetimeNegative": datetime.datetime.min,
"True": True,
"False": False,
"DBRef": bson.DBRef(
"collection", bson.ObjectId("57e193d7a9cc81b4027498b1"), database="database"),
"DBRefNoDB": bson.DBRef(
"collection", bson.ObjectId("57fd71e96e32ab4225b723fb")),
"Minkey": bson.MinKey(),
"Maxkey": bson.MaxKey(),
"Null": None
}
The above document is transformed into the following (newlines and spaces added for readability):
{
"_id": {
"$oid": "57e193d7a9cc81b4027498b5"
},
"String": "string",
"Int32": {
"$numberInt": "42"
},
"Int64": {
"$numberLong": "42"
},
"Double": {
"$numberDouble": "42.42"
},
"Decimal": {
"$numberDecimal": "1234.5"
},
"Binary": {
"$binary": {
"base64": "yO2rw/c4TKO2jauSqRR4ow==",
"subType": "04"
}
},
"BinaryUserDefined": {
"$binary": {
"base64": "MTIz",
"subType": "80"
}
},
"Code": {
"$code": "function() {}"
},
"CodeWithScope": {
"$code": "function() {}",
"$scope": {}
},
"Subdocument": {
"foo": "bar"
},
"Array": [
{"$numberInt": "1"},
{"$numberInt": "2"},
{"$numberInt": "3"},
{"$numberInt": "4"},
{"$numberInt": "5"}
],
"Timestamp": {
"$timestamp": { "t": 42, "i": 1 }
},
"RegularExpression": {
"$regularExpression": {
"pattern": "foo*",
"options": "ix"
}
},
"DatetimeEpoch": {
"$date": {
"$numberLong": "0"
}
},
"DatetimePositive": {
"$date": {
"$numberLong": "253402300799999"
}
},
"DatetimeNegative": {
"$date": {
"$numberLong": "-62135596800000"
}
},
"True": true,
"False": false,
"DBRef": {
"$ref": "collection",
"$id": {
"$oid": "57e193d7a9cc81b4027498b1"
},
"$db": "database"
},
"DBRefNoDB": {
"$ref": "collection",
"$id": {
"$oid": "57fd71e96e32ab4225b723fb"
}
},
"Minkey": {
"$minKey": 1
},
"Maxkey": {
"$maxKey": 1
},
"Null": null
}
Relaxed Extended JSON Example
In Relaxed Extended JSON, the example document is transformed similarly to Canonical Extended JSON, with the exception of the following keys (newlines and spaces added for readability):
{
...
"Int32": 42,
"Int64": 42,
"Double": 42.42,
...
"DatetimeEpoch": {
"$date": "1970-01-01T00:00:00.000Z"
},
...
}
Motivation for Change
There existed many Extended JSON parser and generator implementations prior to this specification that used conflicting formats, since there was no agreement on the precise format of Extended JSON. This resulted in problems where the output of some generators could not be consumed by some parsers.
MongoDB drivers needed a single, standard Extended JSON format for testing that covers all BSON types. However, there were BSON types that had no defined Extended JSON representation. This spec primarily addresses that need, but provides for slightly broader use as well.
Design Rationale
Of Relaxed and Canonical Formats
There are various use cases for expressing BSON documents in a text rather that binary format. They broadly fall into two categories:
- Type preserving: for things like testing, where one has to describe the expected form of a BSON document, it's helpful to be able to precisely specify expected types. In particular, numeric types need to differentiate between Int32, Int64 and Double forms.
- JSON-like: for things like a web API, where one is sending a document (or a projection of a document) that only uses ordinary JSON type primitives, it's desirable to represent numbers in the native JSON format. This output is also the most human readable and is useful for debugging and documentation.
The two formats in this specification address these two categories of use cases.
Of Parsers and Generators
Parsers need to accept any valid Extended JSON string that a generator can produce. Parsers and generators are permitted to accept and output strings in other formats as well for backwards compatibility.
Acceptable nesting depth has implications for resource usage so unlimited nesting is not permitted.
Generators support at least 100 levels of nesting in a BSON document being transformed to Extended JSON. This aligns with MongoDB's own limitation of 100 levels of nesting.
Parsers support at least 200 levels of nesting in Extended JSON text, since the Extended JSON language can double the level of apparent nesting of a BSON document by wrapping certain types in their own documents.
Of Canonical Type Wrapper Formats
Prior to this specification, BSON types fell into three categories with respect to Legacy Extended JSON:
- A single, portable representation for the type already existed.
- Multiple representations for the type existed among various Extended JSON generators, and those representations were in conflict with each other or with current portability goals.
- No Legacy Extended JSON representation existed.
If a BSON type fell into category (1), this specification just declares that form to be canonical, since all drivers, tools, and libraries already know how to parse or output this form. There are two exceptions:
RegularExpression
The form {"$regex: <string>, $options: <string>"}
has until this specification been canonical. The change to
{"$regularExpression": {pattern: <string>, "options": <string>"}}
is motivated by a conflict between the previous
canonical form and the $regex
MongoDB query operator. The form specified here disambiguates between the two, such that
a parser can accept any MongoDB query filter, even one containing the $regex
operator.
Binary
The form {"$binary": "AQIDBAU=", "$type": "80"}
has until this specification been canonical. The change to
{"$binary": {"base64": "AQIDBAU=", "subType": "80"}}
is motivated by a conflict between the previous canonical form
and the $type
MongoDB query operator. The form specified here disambiguates between the two, such that a parser can
accept any MongoDB query filter, even one containing the $type
operator.
Reconciled type wrappers
If a BSON type fell into category (2), this specification selects a new common representation for the type to be canonical. Conflicting formats were gathered by surveying a number of Extended JSON generators, including the MongoDB Java Driver (version 3.3.0), the MongoDB Python Driver (version 3.4.0.dev0), the MongoDB Extended JSON module on NPM (version 1.7.1), and each minor version of mongoexport from 2.4.14 through 3.3.12. When possible, we set the "strict" option on the JSON codec. The following BSON types had conflicting Extended JSON representations:
Binary
Some implementations write the Extended JSON form of a Binary object with a strict two-hexadecimal digit subtype (e.g.
they output a leading 0
for subtypes < 16). However, the NPM mongodb-extended-json module and Java driver use a
single hexadecimal digit to represent subtypes less than 16. This specification makes both one- and two-digit
representations acceptable.
Code
Mongoexport 2.4 does not quote the Code
value when writing out the extended JSON form of a BSON Code object. All other
implementations do so. This spec canonicalises the form where the Javascript code is quoted, since the latter form
adheres to the JSON specification and the former does not. As an additional note, the NPM mongodb-extended-json module
uses the form {"code": "<javascript code>"}
, omitting the dollar sign ($
) from the key. This specification does not
accommodate the eccentricity of a single library.
CodeWithScope
In addition to the same variants as BSON Code types, there are other variations when turning CodeWithScope objects into
Extended JSON. Mongoexport 2.4 and 2.6 omit the scope portion of CodeWithScope if it is empty, making the output
indistinguishable from a Code type. All other implementations include the empty scope. This specification therefore
canonicalises the form where the scope is always included. The presence of $scope
is what differentiates Code from
CodeWithScope.
Datetime
Mongoexport 2.4 and the Java driver always transform a Datetime object into an Extended JSON string of the form
{"$date": <ms since epoch>}
. This form has the problem of a potential loss of precision or range on the Datetimes that
can be represented. Mongoexport 2.6 transforms Datetime objects into an extended JSON string of the form
{"$date": <ISO-8601 date string in local time>}
for dates starting at or after the Unix epoch (UTC). Dates prior to the
epoch take the form {"$date": {"$numberLong": "<ms since epoch>"}}
. Starting in version 3.0, mongoexport always turns
Datetime objects into strings of the form {"$date": <ISO-8601 date string in UTC>}
. The NPM mongodb-extended-json
module does the same. The Python driver can also transform Datetime objects into strings like
{"$date": {"$numberLong": "<ms since epoch>"}}
. This specification canonicalises this form, since this form is the
most portable. In Relaxed Extended JSON format, this specification provides for ISO-8601 representation for better
readability, but limits it to a portable subset, from the epoch to the end of the largest year that can be represented
with four digits. This should encompass most typical use of dates in applications.
DBPointer
Mongoexport 2.4 and 2.6 use the form{"$ref": <namespace>, "$id": <hex string>}
. All other implementations studied
include the canonical ObjectId
form:{"$ref": <namespace>, "$id": {"$oid": <hex string>}}
. Neither of these forms are
distinguishable from that of DBRef, so this specification creates a new format:
{"$dbPointer": {"$ref": <namespace>, "$id": {"$oid": <hex string>}}}
.
Newly-added type wrappers .
If a BSON type fell into category (3), above, this specification creates a type wrapper format for the type. The following new Extended JSON type wrappers are introduced by this spec:
-
$dbPointer
- See above. -
$numberInt
- This is used to preserve the "int32" BSON type in Canonical Extended JSON. Without using$numberInt
, this type will be indistinguishable from a double in certain languages where the distinction does not exist, such as Javascript. -
$numberDouble
- This is used to preserve thedouble
type in Canonical Extended JSON, as some JSON generators might omit a trailing ".0" for integral types.It also supports representing non-finite values like NaN or Infinity which are prohibited in the JSON specification for numbers.
-
$symbol
- The use of the$symbol
key preserves the symbol type in Canonical Extended JSON, distinguishing it from JSON strings.
Reference Implementation
Canonical Extended JSON format reference implementation needs to be updated
PyMongo implements the Canonical Extended JSON format, which must be chosen by selecting the right option on the
JSONOptions
object::
from bson.json_util import dumps, DatetimeRepresentation, CANONICAL_JSON_OPTIONS
dumps(document, json_options=CANONICAL_JSON_OPTIONS)
Relaxed Extended JSON format reference implementation is TBD
Implementation Notes
JSON File Format
Some applications like mongoexport may wish to write multiple Extended JSON documents to a single file. One way to do
this is to list each JSON document one-per-line. When doing this, it is important to ensure that special characters like
newlines are encoded properly (e.g.n
).
Duplicate Keys
The BSON specification does not prohibit duplicate key names within the same BSON document, but provides no semantics for the interpretation of duplicate keys. The JSON specification says that names within an object should be unique, and many JSON libraries are incapable of handling this scenario. This specification is silent on the matter, so as not to conflict with a future change by either specification.
Future Work
This specification will need to be amended if future BSON types are added to the BSON specification.
Q&A
Q. Why was version 2 of the spec necessary?
A. After Version 1 was released, several stakeholders raised concerns that not providing an option to output BSON numbers as ordinary JSON numbers limited the utility of Extended JSON for common historical uses. We decided to provide a second format option and more clearly distinguish the use cases (and limitations) inherent in each format.
Q. My BSON parser doesn't distinguish every BSON type. Does my Extended JSON generator need to distinguish these types?
A. No. Some BSON parsers do not emit a unique type for each BSON type, making round-tripping BSON through such
libraries impossible without changing the document. For example, a DBPointer
will be parsed into a DBRef
by PyMongo.
In such cases, a generator must emit the Extended JSON form for whatever type the BSON parser emitted. It does not need
to preserve type information when that information has been lost by the BSON parser.
Q. How can implementations which require backwards compatibility with Legacy Extended JSON, in which BSON regular
expressions were represented with $regex
, handle parsing of extended JSON test representing a MongoDB query filter
containing the $regex
operator?
A. An implementation can handle this in a number of ways: - Introduce an enumeration that determines the behavior of
the parser. If the value is LEGACY, it will parse $regex
and not treat $regularExpression
specially, and if the value
is CANONICAL, it will parse $regularExpression
and not treat $regex
specially. - Support both legacy and canonical
forms in the parser without requiring the application to specify one or the other. Making that work for the $regex
query operator use case will require that the rules set forth in the 1.0.0 version of this specification are followed
for $regex
; specifically, that a document with a $regex
key whose value is a JSON object should be parsed as a
normal document and not reported as an error.
Q. How can implementations which require backwards compatibility with Legacy Extended JSON, in which BSON binary values were represented like {"$binary": "AQIDBAU=", "$type": "80"}
, handle parsing of extended JSON test representing a MongoDB query filter containing the $type
operator?
A. An implementation can handle this in a number of ways:
Introduce an enumeration that determines the behavior of the parser. If the value is LEGACY, it will parse the new
binary form and not treat the legacy one specially, and if the value is CANONICAL, it will parse the new form and not
treat the legacy form specially. - Support both legacy and canonical forms in the parser without requiring the
application to specify one or the other. Making that work for the $type
query operator use case will require that the
rules set forth in the 1.0.0 version of this specification are followed for $type
; specifically, that a document with
a $type
key whose value is an integral type, or a document with a $type
key but without a $binary
key, should be
parsed as a normal document and not reported as an error.
Q. Sometimes I see the term "extjson" used in other specifications. Is "extjson" related to this specification?
A. Yes, "extjson" is short for "Extended JSON".
Changelog
- 2024-05-29: Migrated from reStructuredText to Markdown.
- 2022-10-05: Remove spec front matter and reformat changelog.
- 2021-05-26:
- Remove any mention of extra dollar-prefixed keys being prohibited in a DBRef. MongoDB 5.0 and compatible drivers no longer enforce such restrictions.
- Objects that resemble a DBRef without fully complying to its structure should be left as-is during parsing. -
2020-09-01: Note that
$
-prefixed keys not matching a known type MUST be left as-is when parsing. This is patch-level change as this behavior was already required in the BSON corpus tests ("Document with keys that start with$
").
- 2020-09-08:
- Added support for parsing
$uuid
fields as BSON Binary subtype 4. - Changed the example to using the MongoDB Python Driver. It previously used the MongoDB Java Driver. The new example
excludes the following BSON types that are unsupported in Python -
Symbol
,SpecialFloat
,DBPointer
, andUndefined
. Transformations for these types are now only documented in the Conversion table
- Added support for parsing
- 2017-07-20:
-
Bumped specification to version 2.0.
-
Added "Relaxed" format.
-
Changed BSON timestamp type wrapper back to
{"t": *int*, "i": *int*}
for backwards compatibility. (The change in v1 to unsigned 64-bit string was premature optimization) -
Changed BSON regular expression type wrapper to
{"$regularExpression": {pattern: *string*, "options": *string*"}}
. -
Changed BSON binary type wrapper to
{"$binary": {"base64": <base64-encoded payload as a *string*>, "subType": <BSON binary type as a one- or two-character *hex string*>}}
-
Added "Restrictions and limitations" section.
-
Clarified parser and generator rules.
-
- 2017-02-01: Initial specification version 1.0.
This MUST conform to the Decimal128 specification
BSON Regular Expression options MUST be in alphabetical order.
See the docs manual
Fractional seconds SHOULD have exactly 3 decimal places if the fractional part is non-zero. Otherwise, fractional seconds SHOULD be omitted if zero.
See the docs manual
OP_MSG
- Status: Accepted
- Minimum Server Version: 3.6
Abstract
OP_MSG
is a bi-directional wire protocol opcode introduced in MongoDB 3.6 with the goal of replacing most existing
opcodes, merging their use into one extendable opcode.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Specification
Usage
OP_MSG
is only available in MongoDB 3.6 (maxWireVersion >= 6
) and later. MongoDB drivers MUST perform the MongoDB
handshake using OP_MSG
if an API version was declared on the client.
If no API version was declared, drivers that have historically supported MongoDB 3.4 and earlier MUST perform the
handshake using OP_QUERY
to determine if the node supports OP_MSG
. Drivers that have only ever supported MongoDB 3.6
and newer MAY default to using OP_MSG
.
If the node supports OP_MSG
, any and all messages MUST use OP_MSG
, optionally compressed with OP_COMPRESSED
.
Authentication messages MUST also use OP_MSG
when it is supported, but MUST NOT use OP_COMPRESSED
.
OP_MSG
Types used in this document
Type | Meaning |
---|---|
document | A BSON document |
cstring | NULL terminated string |
int32 | 4 bytes (32-bit signed integer, two's complement) |
uint8 | 1 byte (8-bit unsigned integer) |
uint32 | 4 bytes (32-bit unsigned integer) |
union | One of the listed members |
Elements inside brackets ([
]
) are optional parts of the message.
- Zero or more instances:
*
- One or more instances:
+
The new opcode, called OP_MSG
, has the following structure:
struct Section {
uint8 payloadType;
union payload {
document document; // payloadType == 0
struct sequence { // payloadType == 1
int32 size;
cstring identifier;
document* documents;
};
};
};
struct OP_MSG {
struct MsgHeader {
int32 messageLength;
int32 requestID;
int32 responseTo;
int32 opCode = 2013;
};
uint32 flagBits;
Section+ sections;
[uint32 checksum;]
};
Each OP_MSG
MUST NOT exceed the maxMessageSizeBytes
as configured by the MongoDB Handshake.
Each OP_MSG
MUST have one section with Payload Type 0
, and zero or more Payload Type 1
. Bulk writes SHOULD use
Payload Type 1
, and MUST do so when the batch contains more than one entry.
Sections may exist in any order. Each OP_MSG
MAY contain a checksum, and MUST set the relevant flagBits
when that
field is included.
Field | Description |
---|---|
flagBits | Network level flags, such as signaling recipient that another message is incoming without any other actions in the meantime, and availability of message checksums |
sections | An array of one or more sections |
checksum | crc32c message checksum. When present, the appropriate flag MUST be set in the flagBits. |
flagBits
flagBits contains a bit vector of specialized network flags. The low 16 bits declare what the current message contains, and what the expectations of the recipient are. The high 16 bits are designed to declare optional attributes of the current message and expectations of the recipient.
All unused bits MUST be set to 0.
Clients MUST error if any unsupported or undefined required bits are set to 1 and MUST ignore all undefined optional bits.
The currently defined flags are:
Bit | Name | Request | Response | Description |
---|---|---|---|---|
0 | checksumPresent | x | x | Checksum present |
1 | moreToCome | x | x | Sender will send another message and is not prepared for overlapping messages |
-16 | exhaustAllowed | x | Client is prepared for multiple replies (using the moreToCome bit) to this request |
checksumPresent
This is a reserved field for future support of crc32c
checksums.
moreToCome
The OP_MSG
message is essentially a request-response protocol, one message per turn. However, setting the moreToCome
flag indicates to the recipient that the sender is not ready to give up its turn and will send another message.
moreToCome On Requests
When the moreToCome
flag is set on a request it signals to the recipient that the sender does not want to know the
outcome of the message. There is no response to a request where moreToCome
has been set. Clients doing unacknowledged
writes MUST set the moreToCome
flag, and MUST set the writeConcern to w=0
.
If, during the processing of a moreToCome
flagged write request, a server discovers that it is no longer primary, then
the server will close the connection. All other errors during processing will be silently dropped, and will not result
in the connection being closed.
moreToCome On Responses
When the moreToCome
flag is set on a response it signals to the recipient that the sender will send additional
responses on the connection. The recipient MUST continue to read responses until it reads a response with the
moreToCome
flag not set, and MUST NOT send any more requests on this connection until it reads a response with the
moreToCome
flag not set. The client MUST either consume all messages with the moreToCome
flag set or close the
connection.
When the server sends responses with the moreToCome
flag set, each of these responses will have a unique messageId
,
and the responseTo
field of every follow-up response will be the messageId
of the previous response.
The client MUST be prepared to receive a response without moreToCome
set prior to completing iteration of a cursor,
even if an earlier response for the same cursor had the moreToCome
flag set. To continue iterating such a cursor, the
client MUST issue an explicit getMore
request.
exhaustAllowed
Setting this flag on a request indicates to the recipient that the sender is prepared to handle multiple replies (using
the moreToCome
bit) to this request. The server will never produce replies with the moreToCome
bit set unless the
request has the exhaustAllowed
bit set.
Setting the exhaustAllowed
bit on a request does not guarantee that the responses will have the moreToCome
bit set.
MongoDB server only handles the exhaustAllowed
bit on the following operations. A driver MUST NOT set the
exhaustAllowed
bit on other operations.
Operation | Minimum MongoDB Version |
---|---|
getMore | 4.2 |
hello (including legacy hello) | 4.4 (discoverable via topologyVersion) |
sections
Each message contains one or more sections. A section is composed of an uint8 which determines the payload's type, and a separate payload field. The payload size for payload type 0 and 1 is determined by the first 4 bytes of the payload field (includes the 4 bytes holding the size but not the payload type).
Field | Description |
---|---|
type | A byte indicating the layout and semantics of payload |
payload | The payload of a section can either be a single document, or a document sequence. |
When the Payload Type is 0, the content of the payload is:
Field | Description |
---|---|
document | The BSON document. The payload size is inferred from the document's leading int32. |
When the Payload Type is 1, the content of the payload is:
Field | Description |
---|---|
size | Payload size (includes this 4-byte field) |
identifier | A unique identifier (for this message). Generally the name of the "command argument" it contains the value for |
documents | 0 or more BSON documents. Each BSON document cannot be larger than maxBSONObjectSize . |
Any unknown Payload Types MUST result in an error and the socket MUST be closed. There is no ordering implied by payload types. A section with payload type 1 can be serialized before payload type 0.
A fully constructed OP_MSG
MUST contain exactly one Payload Type 0
, and optionally any number of Payload Type 1
where each identifier MUST be unique per message.
Command Arguments As Payload
Certain commands support "pulling out" certain arguments to the command, and providing them as Payload Type 1
, where
the identifier
is the command argument's name. Specifying a command argument as a separate payload removes the need to
use a BSON Array. For example, Payload Type 1
allows an array of documents to be specified as a sequence of BSON
documents on the wire without the overhead of array keys.
MongoDB 3.6 only allows certain command arguments to be provided this way. These are:
Command Name | Command Argument |
---|---|
insert | documents |
update | updates |
delete | deletes |
Global Command Arguments
The new opcode contains no field for providing the database name. Instead, the protocol now has the concept of global command arguments. These global command arguments can be passed to all MongoDB commands alongside the rest of the command arguments.
Currently defined global arguments:
Argument Name | Default Value | Description |
---|---|---|
$db | The database name to execute the command on. MUST be provided and be a valid database name. | |
$readPreference | { "mode": "primary" } | Determines server selection, and also whether a secondary server permits reads or responds "not writable primary". See Server Selection Spec for rules about when read preference must or must not be included, and for rules about when read preference "primaryPreferred" must be added automatically. |
Additional global arguments are likely to be introduced in the future and defined in their own specs.
User originating commands
Drivers MUST NOT mutate user provided command documents in any way, whether it is adding required arguments, pulling out arguments, compressing it, adding supplemental APM data or any other modification.
Examples
Command Arguments As Payload Examples
For example, an insert can be represented like:
{
"insert": "collectionName",
"documents": [
{"_id": "Document#1", "example": 1},
{"_id": "Document#2", "example": 2},
{"_id": "Document#3", "example": 3}
],
"writeConcern": { w: "majority" }
}
Or, pulling out the "documents"
argument out of the command document and Into Payload Type 1
. The Payload Type 0
would then be:
{
"insert": "collectionName",
"$db": "databaseName",
"writeConcern": { w: "majority" }
}
And Payload Type 1
:
identifier: "documents"
documents: {"_id": "Document#1", "example": 1}{"_id": "Document#2", "example": 2}{"_id": "Document#3", "example": 3}
Note that the BSON documents are placed immediately after each other, not with any separator. The writeConcern is also
left intact as a command argument in the Payload Type 0
section. The command name MUST continue to be the first key of
the command arguments in the Payload Type 0
section.
An update can for example be represented like:
{
"update": "collectionName",
"updates": [
{
"q": {"example": 1},
"u": { "$set": { "example": 4} }
},
{
"q": {"example": 2},
"u": { "$set": { "example": 5} }
}
]
}
Or, pulling out the "update"
argument out of the command document and Into Payload Type 1
. The Payload Type 0
would then be:
{
"update": "collectionName",
"$db": "databaseName"
}
And Payload Type 1
:
identifier: updates
documents: {"q": {"example": 1}, "u": { "$set": { "example": 4}}}{"q": {"example": 2}, "u": { "$set": { "example": 5}}}
Note that the BSON documents are placed immediately after each other, not with any separator.
A delete can for example be represented like:
{
"delete": "collectionName",
"deletes": [
{
"q": {"example": 3},
"limit": 1
},
{
"q": {"example": 4},
"limit": 1
}
]
}
Or, pulling out the "deletes"
argument out of the command document and into Payload Type 1
. The Payload Type 0
would then be:
{
"delete": "collectionName",
"$db": "databaseName"
}
And Payload Type 1
:
identifier: delete
documents: {"q": {"example": 3}, "limit": 1}{"q": {"example": 4}, "limit": 1}
Note that the BSON documents are placed immediately after each other, not with any separator.
Test Plan
- Create a single document and insert it over
OP_MSG
, ensure it works - Create two documents and insert them over
OP_MSG
, ensure each document is pulled out and presented as document sequence. - hello.maxWriteBatchSize might change and be bumped to 100,000
- Repeat the previous 5 tests as updates, and then deletes.
- Create one small document, and one large 16mb document. Ensure they are inserted, updated and deleted in one roundtrip.
Motivation For Change
MongoDB clients are currently required to work around various issues that each current opcode has, such as having to
determine what sort of node is on the other end as it affects the actual structure of certain messages. MongoDB 3.6
introduces a new wire protocol opcode, OP_MSG
, which aims to resolve most historical issues along with providing a
future compatible and extendable opcode.
Backwards Compatibility
The hello.maxWriteBatchSize is being bumped, which also affects OP_QUERY
, not only OP_MSG
. As a sideeffect, write
errors will now have the message truncated, instead of overflowing the maxMessageSize, if the server determines it would
overflow the allowed size. This applies to all commands that write. The error documents are structurally the same, with
the error messages simply replaced with empty strings.
Reference Implementations
- mongoc
- .net
Future Work
In the near future, this opcode is expected to be extended and include support for:
- Message checksum (crc32c)
- Output document sequences
moreToCome
can also be used for other commands, such askillCursors
to restoreOP_KILL_CURSORS
behaviour as currently any errors/replies are ignored.
Q & A
-
Has the maximum number of documents per batch changed ?
- The maximum number of documents per batch is dictated by the
maxWriteBatchSize
value returned during the MongoDB Handshake. It is likely this value will be bumped from 1,000 to 100,000.
- The maximum number of documents per batch is dictated by the
-
Has the maximum size of the message changed?
- No. The maximum message size is still the
maxMessageSizeBytes
value returned during the MongoDB Handshake.
- No. The maximum message size is still the
-
Is everything still little-endian?
- Yes. As with BSON, all MongoDB opcodes must be serialized in little-endian format.
-
How does fire-and-forget (w=0 / unacknowledged write) work over
OP_MSG
?- The client sets the
moreToCome
flag on the request. The server will not send a response to such requests. - Malformed operation or errors such as duplicate key errors are not discoverable and will be swallowed by the server.
- Write errors due to not-primary will close the connection, which clients will pickup on next time it uses the connection. This means at least one unacknowledged write operation will be lost as the client does not discover the failover until next time the socket is used.
- The client sets the
-
Should we provide
runMoreToComeCommand()
helpers? Since the protocol allows any command to be tagged withmoreToCome
, effectively allowing any operation to becomefire & forget
, it might be a good idea to add such helper, rather then adding wire protocol headers as options to the existingrunCommand
helpers.
Changelog
- 2024-04-30: Convert from RestructuredText to Markdown.
- 2022-10-05: Remove spec front matter.
- 2022-01-13: Clarify that
OP_MSG
must be used when using stable API - 2021-12-16: Clarify that old drivers should default to OP_QUERY handshakes
- 2021-04-20: Suggest using OP_MSG for initial handshake when using stable API
- 2021-04-06: Updated to use hello and not writable primary
- 2017-11-12: Specify read preferences for OP_MSG with direct connection
- 2017-08-17: Added the
User originating command
section - 2017-07-18: Published initial version
Run Command
- Status: Accepted
- Minimum Server Version: N/A
Abstract
This specification defines requirements and behaviors for drivers' run command and related APIs.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Specification
Terms
Command
A structure representing a BSON document that has a shape matching a supported MongoDB operation.
Implementation requirements
All drivers MAY offer the operations defined in the following sections. This does not preclude a driver from offering more.
Deviations
Please refer to The CRUD specification's Guidance on how APIs may deviate between languages.
Cursor iterating APIs MAY be offered via language syntax or predefined iterable methods.
runCommand
The following represents how a runCommand API SHOULD be exposed.
interface Database {
/**
* Takes an argument representing an arbitrary BSON document and executes it against the server.
*/
runCommand(command: BSONDocument, options: RunCommandOptions): BSONDocument;
}
interface RunCommandOptions {
/**
* An optional readPreference setting to apply to server selection logic.
* This value MUST be applied to the command document as the $readPreference global command argument if not set to primary.
*
* @defaultValue ReadPreference(mode: primary)
*
* @see ../server-selection/server-selection.md#read-preference
*/
readPreference?: ReadPreference;
/**
* An optional explicit client session.
* The associated logical session id (`lsid`) the driver MUST apply to the command.
*
* @see ../sessions/driver-sessions.md#clientsession
*/
session?: ClientSession;
/**
* An optional timeout option to govern the amount of time that a single operation can execute before control is returned to the user.
* This timeout applies to all of the work done to execute the operation, including but not limited to server selection, connection checkout, and server-side execution.
*
* @ see https://github.com/mongodb/specifications/blob/master/source/client-side-operations-timeout/client-side-operations-timeout.md
*/
timeoutMS?: number;
}
RunCommand implementation details
RunCommand provides a way to access MongoDB server commands directly without requiring a driver to implement a bespoke helper. The API is intended to take a document from a user and apply a number of common driver internal concerns before forwarding the command to a server. A driver MUST not inspect the user's command, this includes checking for the fields a driver MUST attach to the command sent as described below. Depending on a driver's BSON implementation this can result in these fields being overwritten or duplicated, a driver SHOULD document that using these fields has undefined behavior. A driver MUST not modify the user's command, a clone SHOULD be created before the driver attaches any of the required fields to the command.
Drivers that have historically modified user input SHOULD strive to instead clone the input such that appended fields do not affect the user's input in their next major version.
OP_MSG
The $db
global command argument MUST be set on the command sent to the server and it MUST equal the database name
RunCommand was invoked on.
- See OP_MSG's section on Global Command Arguments
ReadPreference
For the purposes of server selection RunCommand MUST assume all commands are read operations. To facilitate server
selection the RunCommand operation MUST accept an optional readPreference
option.
- See Server Selection's section on Use of read preferences with commands
If the provided ReadPreference is NOT {mode: primary}
and the selected server is NOT a standalone, the command sent
MUST include the $readPreference
global command argument.
- See OP_MSG's section on Global Command Arguments
Driver Sessions
A driver's RunCommand MUST provide an optional session option to support explicit sessions and transactions. If a
session is not provided the driver MUST attach an implicit session if the connection supports sessions. Drivers MUST NOT
attempt to check the command document for the presence of an lsid
.
Every ClientSession has a corresponding logical session ID representing the server-side session ID. The logical session
ID MUST be included under lsid
in the command sent to the server without modifying user input.
- See Driver Sessions' section on Sending the session ID to the server on all commands
The command sent to the server MUST gossip the $clusterTime
if cluster time support is detected.
- See Driver Sessions' section on Gossipping the cluster time
Transactions
If RunCommand is used within a transaction the read preference MUST be sourced from the transaction's options. The command sent to the server MUST include the transaction specific fields, summarized as follows:
- If
runCommand
is executing within a transaction:autocommit
- The autocommit flag MUST be set to false.txnNumber
- MUST be set.
- If
runCommand
is the first operation of the transaction:startTransaction
- MUST be set to true.readConcern
- MUST be set to the transaction's read concern if it is NOT the default.
- See Generic RunCommand helper within a transaction in the Transactions specification.
ReadConcern and WriteConcern
RunCommand MUST NOT support read concern and write concern options. Drivers MUST NOT attempt to check the command
document for the presence of a readConcern
and writeConcern
field.
Additionally, unless executing within a transaction, RunCommand MUST NOT set the readConcern
or writeConcern
fields
in the command document. For example, default values MUST NOT be inherited from client, database, or collection options.
If the user-provided command document already includes readConcern
or writeConcern
fields, the values MUST be left
as-is.
- See Read Concern's section on Generic Command Method
- See Write Concern's section on Generic Command Method
Retryability
All commands executed via RunCommand are non-retryable operations. Drivers MUST NOT inspect the command to determine if
it is a write and MUST NOT attach a txnNumber
.
- See Retryable Reads' section on Unsupported Read Operations
- See Retryable Writes' section on Behavioral Changes for Write Commands
Stable API
The command sent MUST attach stable API fields as configured on the MongoClient.
- See Stable API's section on Generic Command Helper Behaviour
Client Side Operations Timeout
RunCommand MUST provide an optional timeoutMS
option to support client side operations timeout. Drivers MUST NOT
attempt to check the command document for the presence of a maxTimeMS
field. Drivers MUST document the behavior of
RunCommand if a maxTimeMS
field is already set on the command (such as overwriting the command field).
- See Client Side Operations Timeout's section on runCommand
- See Client Side Operations Timeout's section on runCommand behavior
runCursorCommand
Drivers MAY expose a runCursorCommand API with the following syntax.
interface Database {
/**
* Takes an argument representing an arbitrary BSON document and executes it against the server.
*/
runCursorCommand(command: BSONDocument, options: RunCursorCommandOptions): RunCommandCursor;
}
interface RunCursorCommandOptions extends RunCommandOptions {
/**
* This option is an enum with possible values CURSOR_LIFETIME and ITERATION.
* For operations that create cursors, timeoutMS can either cap the lifetime of the cursor or be applied separately to the original operation and all subsequent calls.
* To support both of these use cases, these operations MUST support a timeoutMode option.
*
* @defaultValue CURSOR_LIFETIME
*
* @see https://github.com/mongodb/specifications/blob/master/source/client-side-operations-timeout/client-side-operations-timeout.md
*/
timeoutMode?: ITERATION | CURSOR_LIFETIME;
/**
* See the `cursorType` enum defined in the crud specification.
* @see https://github.com/mongodb/specifications/blob/master/source/crud/crud.md#read
*
* Identifies the type of cursor this is for client side operations timeout to properly apply timeoutMode settings.
*
* A tailable cursor can receive empty `nextBatch` arrays in `getMore` responses.
* However, subsequent `getMore` operations may return documents if new data has become available.
*
* A tailableAwait cursor is an enhancement where instead of dealing with empty responses the server will block until data becomes available.
*
* @defaultValue NON_TAILABLE
*/
cursorType?: CursorType;
}
/**
* The following are the configurations a driver MUST provide to control how getMores are constructed.
* How the options are controlled should be idiomatic to the driver's language.
* See Executing ``getMore`` Commands.
*/
interface RunCursorCommandGetMoreOptions {
/** Any positive integer is permitted. */
batchSize?: int;
/** Any non-negative integer is permitted. */
maxTimeMS?: int;
comment?: BSONValue;
}
RunCursorCommand implementation details
RunCursorCommand provides a way to access MongoDB server commands that return a cursor directly without requiring a
driver to implement a bespoke cursor implementation. The API is intended to be built upon RunCommand and take a document
from a user and apply a number of common driver internal concerns before forwarding the command to a server. A driver
can expect that the result from running this command will return a document with a cursor
field and MUST provide the
caller with a language native abstraction to continue iterating the results from the server. If the response from the
server does not include a cursor
field the driver MUST throw an error either before returning from runCursorCommand
or upon first iteration of the cursor.
High level RunCursorCommand steps:
- Run the cursor creating command provided by the caller and retain the ClientSession used as well as the server the command was executed on.
- Create a local cursor instance and store the
firstBatch
,ns
, andid
from the response. - When the current batch has been fully iterated, execute a
getMore
using the same server the initial command was executed on. - Store the
nextBatch
from thegetMore
response and update the cursor'sid
. - Continue to execute
getMore
commands as needed when the caller empties local batches until the cursor is exhausted or closed (i.e.id
is zero).
Driver Sessions
A driver MUST create an implicit ClientSession if none is provided and it MUST be attached for the duration of the
cursor's lifetime. All getMore
commands constructed for this cursor MUST send the same lsid
used on the initial
command. A cursor is considered exhausted or closed when the server reports its id
as zero. When the cursor is
exhausted the client session MUST be ended and the server session returned to the pool as early as possible rather than
waiting for a caller to completely iterate the final batch.
- See Drivers Sessions' section on Sessions and Cursors
Server Selection
RunCursorCommand MUST support a readPreference
option that MUST be used to determine server selection. The selected
server MUST be used for subsequent getMore
commands.
Load Balancers
When in loadBalanced
mode, a driver MUST pin the connection used to execute the initial operation, and reuse it for
subsequent getMore
operations.
- See Load Balancer's section on Behaviour With Cursors
Iterating the Cursor
Drivers MUST provide an API, typically, a method named next()
, that returns one document per invocation. If the
cursor's batch is empty and the cursor id is nonzero, the driver MUST perform a getMore
operation.
Executing getMore
Commands
The cursor API returned to the caller MUST offer an API to configure batchSize
, maxTimeMS
, and comment
options
that are sent on subsequent getMore
commands. If it is idiomatic for a driver to allow setting these options in
RunCursorCommandOptions
, the driver MUST document that the options only pertain to getMore
commands. A driver MAY
permit users to change getMore
field settings at any time during the cursor's lifetime and subsequent getMore
commands MUST be constructed with the changes to those fields. If that API is offered drivers MUST write tests asserting
getMore
commands are constructed with any updated fields.
- See Find, getMore and killCursors commands' section on GetMore
Tailable and TailableAwait
- See first: Find, getMore and killCursors commands's section on Tailable cursors
It is the responsibility of the caller to construct their initial command with awaitData
and tailable
flags as
well as inform RunCursorCommand of the cursorType
that should be constructed. Requesting a cursorType
that does
not align with the fields sent to the server on the initial command SHOULD be documented as undefined behavior.
Resource Cleanup
Drivers MUST provide an explicit mechanism for releasing the cursor resources, typically a .close()
method. If the
cursor id is nonzero a KillCursors operation MUST be attempted, the result of the operation SHOULD be ignored. The
ClientSession associated with the cursor MUST be ended and the ServerSession returned to the pool.
- See Driver Sessions' section on When sending a killCursors command
- See Find, getMore and killCursors commands' section on killCursors
Client Side Operations Timeout
RunCursorCommand MUST provide an optional timeoutMS
option to support client side operations timeout. Drivers MUST NOT
attempt to check the command document for the presence of a maxTimeMS
field. Drivers MUST document the behavior of
RunCursorCommand if a maxTimeMS
field is already set on the command. Drivers SHOULD raise an error if both timeoutMS
and the getMore
-specific maxTimeMS
option are specified (see:
Executing getMore Commands). Drivers MUST document that attempting to set both options
can have undefined behavior and is not supported.
When timeoutMS
and timeoutMode
are provided the driver MUST support timeout functionality as described in the CSOT
specification.
- See Client Side Operations Timeout's section on Cursors
Changelog
- 2024-09-02: Migrated from reStructuredText to Markdown.
- 2023-05-10: Add runCursorCommand API specification.
- 2023-05-08:
$readPreference
is not sent to standalone servers - 2023-04-20: Add run command specification.
Connection String Spec
- Status: Accepted
- Minimum Server Version: N/A
Abstract
The purpose of the Connection String is to provide a machine readable way of configuring a MongoClient, allowing users to configure and change the connection to their MongoDB system without requiring any application code changes.
This specification defines how the connection string is constructed and parsed. The aim is not to list all of connection string options and their semantics. Rather it defines the syntax of the connection string, including rules for parsing, naming conventions for options, and standard data types.
It should be noted that while the connection string specification is inspired by the URI specification as described in RFC 3986 and uses similar terminology, it does not conform to that specification.
Definitions
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
General Syntax
In general we follow URI style conventions, however unlike a URI the connection string supports multiple hosts.
mongodb://username:password@example.com:27017,example2.com:27017,...,example.comN:27017/database?key=value&keyN=valueN
\_____/ \_______________/ \_________/ \__/ \_______________________________________/ \______/ \_/ \___/
| | | | | | | |
Scheme | Host Port Alternative host identifiers | Key Value
Userinfo \_____________/ | \_______/
| Auth database |
Host Identifier Key Value Pair
\_______________________________________________________/ \___________________/
| |
Host Information Connection Options
Scheme
The scheme mongodb
represents that this is a connection string for a MongoClient.
Other schemes are also possible and are introduced through additional specifications. These additional schemes build on top of the connection string as documented in this specification.
For example the mongodb+srv
specification, introduced with
Initial DNS Seedlist Discovery, obtains
information from DNS in addition to just the connection string.
Userinfo (optional)
The user information if present, is followed by a commercial at-sign ("@") that delimits it from the host.
A password may be supplied as part of the user information and is anything after the first colon (":") up until the end of the user information.
RFC 3986 has guidance for encoding user information in Section 2.1 ("Percent-Encoding"), Section 2.2 ("Reserved Characters"), and Section 3.2.1 ("User Information").
Specifically, Section 3.2.1 provides for the following allowed characters:
userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
If the user information contains an at-sign ("@"), more than one colon (":"), or a percent-sign ("%") that does not match the rules for "pct-encoded", then an exception MUST be thrown informing the user that the username and password must be URL encoded.
Above and beyond that restriction, drivers SHOULD require connection string user information to follow the "userinfo" encoding rules of RFC 3986 and SHOULD throw an exception if disallowed characters are detected. However, for backwards-compatibility reasons, drivers MAY allow reserved characters other than "@" and ":" to be present in user information without percent-encoding.
Host Information
Unlike a standard URI, the connection string allows for identifying multiple hosts. The host information section of the connection string MAY be delimited by the trailing slash ("/") or end of string.
The host information must contain at least one host identifier but may contain more (see the alternative hosts / ports in the general syntax diagram above). Multiple host identifiers are delimited by a comma (",").
Host Identifier
A host identifier consists of a host and an optional port.
Host
Identifies a server address to connect to. It can identify either a hostname, IP address, IP Literal, or UNIX domain socket. For definitions of hostname, IP address and IP Literal formats see RFC 3986 Section 3.2.2 .
UNIX domain sockets MUST end in ".sock" and MUST be URL encoded, for example:
mongodb://user:pass@%2Ftmp%2Fmongodb-27017.sock/authDB?replicaSet=rs
The host information cannot contain an unescaped slash ("/"), if it does then an exception MUST be thrown informing users that paths must be URL encoded. For example:
Unsupported host '/tmp/mongodb-27017.sock', UNIX socket domain paths must be URL encoded.
Support for UNIX domain sockets and IP Literals is OPTIONAL.
Unsupported host types MUST throw an exception informing the user they are not supported.
This specification does not define how host types should be differentiated (e.g. determining if a parsed host string is a socket path or hostname). It is merely concerned with extracting the host identifiers from the URI.
Port (optional)
The port is an integer between 1 and 65535 (inclusive) that identifies the port to connect to. See RFC 3986 3.2.3.
Auth Database (optional)
The database to authenticate against. If provided it is everything after the Host Information (ending with "/") and up to the first question mark ("?") or end of string. The auth database MUST be URL decoded by the parser.
The following characters MUST NOT appear in the database name, once it has been decoded: slash ("/"), backslash (""), space (" "), double-quote ("""), or dollar sign ("$"). The MongoDB Manual says that period (".") is also prohibited, but drivers MAY allow periods in order to express a namespace (database and collection name, perhaps containing multiple periods) in this part of the URL.
The presence of the auth database component without other credential data such as Userinfo or authentication parameters in connection options MUST NOT be interpreted as a request for authentication.
Connection Options (optional)
Any extra options to configure the MongoClient connection can be specified in the connection options part of the connection string. If provided, it is everything after the Host Information (ending with "/"), optional auth database, and first question mark ("?") to the end of the string. Connection Options consist of an ordered list of Key Value Pairs that are delimited by an ampersand ("&"). A delimiter of a semi colon (";") MAY also be supported for connection options for legacy reasons.
Key Value Pair
A key value pair represents the option key and its associated value. The key is everything up to the first equals sign ("=") and the value is everything afterwards. Key values contain the following information:
- Key:
The connection option's key string. Keys should be normalised and character case should be ignored.
- Value: (optional)
The value if provided otherwise it defaults to an empty string.
Defining connection options
Connection option key values MUST be defined in the relevant specification that describes the usage of the key and value. The value data type MUST also be defined there. The value's default value SHOULD also be defined if it is relevant.
Keys
Keys are strings and the character case must be normalized by lower casing the uppercase ASCII characters A through Z; other characters are left as-is.
When defining and documenting keys, specifications should follow the camelCase naming convention with the first letter in lowercase, snake_case MUST not be used. Keys that aren't supported by a driver MUST be ignored.
Keys that aren't supported by a driver MUST be ignored. A WARN level logging message MUST be issued for unsupported keys. For example:
Unsupported option 'connectMS'.
Keys should be descriptive and follow existing conventions:
Time based keys
If a key represents a unit of time it MUST end with that unit of time.
Key authors SHOULD follow the existing convention of defaulting to using milliseconds as the unit of time (e.g.
connectionTimeoutMS
).
Values
The values in connection options MUST be URL decoded by the parser. The values can represent the following data types:
-
Strings: The value
-
Integer: The value parsed as a integer. If the value is the empty string, the key MUST be ignored.
-
Boolean: "true" and "false" strings MUST be supported. If the value is the empty string, the key MUST be ignored.
- For legacy reasons it is RECOMMENDED that alternative values for true and false be supported:
- true: "1", "yes", "y" and "t"
- false: "0", "-1", "no", "n" and "f".
Alternative values are deprecated and MUST be removed from documentation and examples.
If any of these alternative values are used, drivers MUST log a deprecation notice or issue a logging message at the WARNING level (as appropriate for your language). For example:
Deprecated boolean value for "journal" : "1", please update to "journal=true"
- For legacy reasons it is RECOMMENDED that alternative values for true and false be supported:
-
Lists: Repeated keys represent a list in the Connection String consisting of the corresponding values in the same order as they appear in the Connection String. For example:
?readPreferenceTags=dc:ny,rack:1&readPreferenceTags=dc:ny&readPreferenceTags=
-
Key value pairs: A value that represents one or more key and value pairs. Multiple key value pairs are delimited by a comma (","). The key is everything up to the first colon sign (":") and the value is everything afterwards.
For example:
?readPreferenceTags=dc:ny,rack:1
Drivers MUST handle unencoded colon signs (":") within the value. For example, given the connection string option:
authMechanismProperties=TOKEN_RESOURCE:mongodb://foo
the driver MUST interpret the key as
TOKEN_RESOURCE
and the value asmongodb://foo
.For any option key-value pair that may contain a comma (such as
TOKEN_RESOURCE
), drivers MUST document that: a value containing a comma (",") MUST NOT be provided as part of the connection string. This prevents use of values that would interfere with parsing.
Any invalid Values for a given key MUST be ignored and MUST log a WARN level message. For example:
Unsupported value for "fsync" : "ifPossible"
Repeated Keys
If a key is repeated and the corresponding data type is not a List then the precedence of which key value pair will be used is undefined except where defined otherwise by the URI options spec.
Where possible, a warning SHOULD be raised to inform the user that multiple options were found for the same value.
Deprecated Key Value Pairs
If a key name was deprecated due to renaming it MUST still be supported. Users aren't expected to be vigilant on changes to key names.
If the renamed key is also defined in the connection string the deprecated key MUST NOT be applied and a WARN level message MUST be logged. For example:
Deprecated key "wtimeout" present and ignored as found replacement "wtimeoutms" value.
Deprecated keys MUST log a WARN level message informing the user that the option is deprecated and supply the alternative key name. For example:
Deprecated key "wtimeout" has been replaced with "wtimeoutms"
Legacy support
Semi colon (";") query parameter delimiters and alternative string representations of Boolean values MAY be supported only for legacy reasons.
As these options are not standard they might not be supported across all drivers. As such, these alternatives MUST NOT be used as general examples or documentation.
Language specific connection options
Connection strings are a mechanism to configure a MongoClient outside the user's application. As each driver may have language specific configuration options, those options SHOULD also be supported via the connection string. Where suitable, specifications MUST be updated to reflect new options.
Keys MUST follow existing connection option naming conventions as defined above. Values MUST also follow the existing, specific data types.
Any options that are not supported MUST raise a WARN log level as described in the keys section.
Connection options precedence
If a driver allows URI options to be specified outside of the connection string (e.g. dictionary parameter to the
MongoClient constructor) it MUST document the precedence rules between all such mechanisms. For instance, a driver MAY
allow a value for option foo
in a dictionary parameter to override the value of foo
in the connection string (or
vice versa) so long as that behavior is documented.
Test Plan
See the README for tests.
Motivation for Change
The motivation for this specification is to publish how connection strings are formed and how they should be parsed. This is important because although the connection string follows the terminology of a standard URI format (as described in RFC 3986) it is not a standard URI and cannot be parsed by standard URI parsers.
The specification also formalizes the standard practice for the definition of new connection options and where the responsibility for their definition should be.
Design Rationale
The rationale for the Connection String is to provide a consistent, driver independent way to define the connection to a MongoDB system outside of the application. The connection string is an existing standard and is already widely used.
Backwards Compatibility
Connection Strings are already generally supported across languages and driver implementations. As the responsibility for the definitions of connections options relies on the specifications defining them, there should be no backwards compatibility breaks caused by this specification with regards to options.
Connection options precedence may cause some backwards incompatibilities as existing driver behaviour differs here. As such, it is currently only a recommendation.
Reference Implementation
The Java driver implements a ConnectionString
class for the parsing of the connection string; however, it does not
support UNIX domain sockets. The Python driver's uri_parser
module implements connection string parsing for both hosts
and UNIX domain sockets.
The following example parses a connection string into its components and can be used as a guide.
Given the string mongodb://foo:bar%3A@mongodb.example.com,%2Ftmp%2Fmongodb-27018.sock/admin?w=1
:
- Validate and remove the scheme prefix
mongodb://
, leaving:foo:bar%3A@mongodb.example.com,%2Ftmp%2Fmongodb-27018.sock/admin?w=1
- Split the string by the first, unescaped
/
(if any), yielding:- User information and host identifiers:
foo:bar%3A@mongodb.example.com,%2Ftmp%2Fmongodb-27018.sock
. - Auth database and connection options:
admin?w=1
.
- User information and host identifiers:
- Split the user information and host identifiers string by the last, unescaped
@
, yielding:- User information:
foo:bar%3A
. - Host identifiers:
mongodb.example.com,%2Ftmp%2Fmongodb-27018.sock
.
- User information:
- Validate, split (if applicable), and URL decode the user information. In this example, the username and password
would be
foo
andbar:
, respectively. - Validate, split, and URL decode the host identifiers. In this example, the hosts would be
["mongodb.example.com", "/tmp/mongodb-27018.sock"]
. - Split the auth database and connection options string by the first, unescaped
?
, yielding:- Auth database:
admin
. - Connection options:
w=1
.
- Auth database:
- URL decode the auth database. In this example, the auth database is
admin
. - Validate the database contains no prohibited characters.
- Validate, split, and URL decode the connection options. In this example, the connection options are
{w: 1}
.
Q&A
Q: What about existing Connection Options that aren't currently defined in a specification
Ideally all MongoClient options would already belong in their relevant specifications. As we iterate and produce more specifications these options should be covered.
Q: Why is it recommended that Connection Options take precedence over application set options
This is only a recommendation but the reasoning is application code is much harder to change across deployments. By making the Connection String take precedence from outside the application it would be easier for the application to be portable across environments. The order of precedence of MongoClient hosts and options is recommended to be from low to high:
- Default values
- MongoClient hosts and options
- Connection String hosts and options
Q: Why WARN level warning on unknown options rather than throwing an exception
It is responsible to inform users of possible misconfigurations and both methods achieve that. However, there are conflicting requirements of a Connection String. One goal is that any given driver should be configurable by a connection string but different drivers and languages have different feature sets. Another goal is that Connection Strings should be portable and as such some options supported by language X might not be relevant to language Y. Any given driver does not know is an option is specific to a different driver or is misspelled or just not supported. So the only way to stay portable and support configuration of all options is to not throw an exception but rather log a warning.
Q: How long should deprecation options be supported
This is not declared in this specification. It's not deemed responsible to give a single timeline for how long deprecated options should be supported. As such any specifications that deprecate options that do have the context of the decision should provide the timeline.
Q: Why can I not use a standard URI parser
The connection string format does not follow the standard URI format (as described in RFC 3986) we differ in two key areas:
-
Hosts
The connection string allows for multiple hosts for high availability reasons but standard URI's only ever define a single host.
-
Query Parameters / Connection Options
The connection string provides a concreted definition on how the Connection Options are parsed, including definitions of different data types. The RFC 3986 only defines that they are
key=value
pairs and gives no instruction on parsing. In fact different languages handle the parsing of query parameters in different ways and as such there is no such thing as a standard URI parser.
Q: Can the connection string contain non-ASCII characters
The connection string can contain non-ASCII characters. The connection string is text, which can be encoded in any way appropriate for the application (e.g. the C Driver requires you to pass it a UTF-8 encoded connection string).
Q: Why does reference implementation check for a .sock
suffix when parsing a socket path and possible auth database
To simplify parsing of a socket path followed by an auth database, we rely on MongoDB's
naming restrictions), which do not allow
database names to contain a dot character, and the fact that socket paths must end with .sock
. This allows us to
differentiate the last part of a socket path from a database name. While we could immediately rule out an auth database
on the basis of the dot alone, this specification is primarily concerned with breaking down the components of a URI
(e.g. hosts, auth database, options) in a deterministic manner, rather than applying strict validation to those parts
(e.g. host types, database names, allowed values for an option). Additionally, some drivers might allow a namespace
(e.g. "db.collection"
) for the auth database part, so we do not want to be more strict than is necessary for parsing.
Q: Why throw an exception if the userinfo contains a percent sign ("%"), at-sign ("@"), or more than one colon (":")
This is done to help users format the connection string correctly. Although at-signs ("@") or colons (":") in the username must be URL encoded, users may not be aware of that requirement. Take the following example:
mongodb://anne:bob:pass@localhost:27017
Is the username anne
and the password bob:pass
or is the username anne:bob
and the password pass
? Accepting this
as the userinfo could cause authentication to fail, causing confusion for the user as to why. Allowing unescaped at-sign
and percent symbols would invite further ambiguity. By throwing an exception users are made aware and then update the
connection string so to be explicit about what forms the username and password.
Q: Why must UNIX domain sockets be URL encoded
This has been done to reduce ambiguity between the socket name and the database name. Take the following example:
mongodb:///tmp/mongodb.sock/mongodb.sock
Is the host /tmp/mongodb.sock
and the auth database mongodb.sock
or does the connection string just contain the host
/tmp/mongodb.sock/mongodb.sock
and no auth database? By enforcing URL encoding on UNIX domain sockets it makes users
be explicit about the host and the auth database. By requiring an exception to be thrown when the host contains a slash
("/") users can be informed on how to migrate their connection strings.
Q: Why must the auth database be URL decoded by the parser
On Linux systems database names can contain a question mark ("?"), in these rare cases the auth database must be URL encoded. This disambiguates between the auth database and the connection options. Take the following example:
mongodb://localhost/admin%3F?w=1
In this case the auth database would be admin?
and the connection options w=1
.
Q: How should the space character be encoded in a connection string
Space characters SHOULD be encoded as %20
rather than +
, this will be portable across all implementations.
Implementations MAY support decoding +
into a space, as many languages treat strings as x-www-form-urlencoded
data
by default.
Changelog
-
2024-05-29: Clarify handling of key-value pairs and add specification test.
-
2024-02-15: Migrated from reStructuredText to Markdown.
-
2016-07-22: In Port section, clarify that zero is not an acceptable port.
-
2017-01-09: In Userinfo section, clarify that percent signs must be encoded.
-
2017-06-10: In Userinfo section, require username and password to be fully URI encoded, not just "%", "@", and ":". In Auth Database, list the prohibited characters. In Reference Implementation, split at the first "/", not the last.
-
2018-01-09: Clarified that space characters should be encoded to
%20
. -
2018-06-04: Revised Userinfo section to provide an explicit list of allowed characters and clarify rules for exceptions.
-
2019-02-04: In Repeated Keys section, clarified that the URI options spec may override the repeated key behavior described here for certain options.
-
2019-03-04: Require drivers to document option precedence rules
-
2019-04-26: Database name in URI alone does not trigger authentication
-
2020-01-21: Clarified how empty values in a connection string are parsed.
-
2022-10-05: Remove spec front matter and reformat changelog.
-
2022-12-27: Note that host information ends with a "/" character in connection options description.
-
2023-08-02: Make delimiting slash between host information and connection options optional and update tests
URI Options Specification
-
Status: Accepted
-
Minimum Server Version: N/A
Abstract
Historically, URI options have been defined in individual specs, and drivers have defined any additional options independently of one another. Because of the frustration due to there not being a single place where all of the URI options are defined, this spec aims to do just that—namely, provide a canonical list of URI options that each driver defines.
THIS SPEC DOES NOT REQUIRE DRIVERS TO MAKE ANY BREAKING CHANGES.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Specification
Conflicting TLS options
Per the Connection String spec, the behavior of duplicates of most URI options is undefined. However, due to the security implications of certain options, drivers MUST raise an error to the user during parsing if any of the following circumstances occur:
- Both
tlsInsecure
andtlsAllowInvalidCertificates
appear in the URI options. - Both
tlsInsecure
andtlsAllowInvalidHostnames
appear in the URI options. - Both
tlsInsecure
andtlsDisableOCSPEndpointCheck
appear in the URI options. - Both
tlsInsecure
andtlsDisableCertificateRevocationCheck
appear in the URI options. - Both
tlsAllowInvalidCertificates
andtlsDisableOCSPEndpointCheck
appear in the URI options. - Both
tlsAllowInvalidCertificates
andtlsDisableCertificateRevocationCheck
appear in the URI options. - Both
tlsDisableOCSPEndpointCheck
andtlsDisableCertificateRevocationCheck
appear in the URI options. - All instances of
tls
andssl
in the URI options do not have the same value. If all instances oftls
andssl
have the same value, an error MUST NOT be raised.
directConnection URI option with multiple seeds or SRV URI
The driver MUST report an error if the directConnection=true
URI option is specified with multiple seeds.
The driver MUST report an error if the directConnection=true
URI option is specified with an SRV URI, because the URI
may resolve to multiple hosts. The driver MUST allow specifying directConnection=false
URI option with an SRV URI.
srvServiceName and srvMaxHosts URI options
For URI option validation pertaining to srvServiceName
and srvMaxHosts
, please see the
Initial DNS Seedlist Discovery spec
for details.
Load Balancer Mode
For URI option validation in Load Balancer mode (i.e. loadBalanced=true
), please see the
Load Balancer spec for details.
SOCKS5 options
For URI option validation pertaining to proxyHost
, proxyPort
, proxyUsername
and proxyPassword
please see the
SOCKS5 support spec for details.
List of specified options
Each driver option below MUST be implemented in each driver unless marked as optional. If an option is marked as optional, a driver MUST meet any conditions specified for leaving it out if it is not included. If a driver already provides the option under a different name, the driver MAY implement the old and new names as aliases. All keys and values MUST be encoded in UTF-8. All integer options are 32-bit unless specified otherwise. Note that all requirements and recommendations described in the Connection String spec pertaining to URI options apply here.
Name | Accepted Values | Default Value | Optional to implement? | Description |
---|---|---|---|---|
appname | any string that meets the criteria listed in the handshake spec | no appname specified | no | Passed into the server in the client metadata as part of the connection handshake |
authMechanism | any string; valid values are defined in the auth spec | None; default values for authentication exist for constructing authentication credentials per the auth spec, but there is no default for the URI option itself. | no | The authentication mechanism method to use for connection to the server |
authMechanismProperties | comma separated key:value pairs, e.g. "opt1:val1,opt2:val2" | no properties specified | no | Additional options provided for authentication (e.g. to enable hostname canonicalization for GSSAPI) |
authSource | any string | None; default values for authentication exist for constructing authentication credentials per the auth spec, but there is no default for the URI option itself. | no | The database that connections should authenticate against |
compressors | comma separated list of strings, e.g. "snappy,zlib" | defined in compression spec | no | The list of allowed compression types for wire protocol messages sent or received from the server |
connectTimeoutMS | non-negative integer; 0 means "no timeout" | 10,000 ms (unless a driver already has a different default) | no | Amount of time to wait for a single TCP socket connection to the server to be established before erroring; note that this applies to SDAM hello and legacy hello operations |
directConnection | "true" or "false" | defined in SDAM spec | no | Whether to connect to the deployment in Single topology. |
heartbeatFrequencyMS | integer greater than or equal to 500 | defined in SDAM spec | no | the interval between regular server monitoring checks |
journal | "true" or "false" | no "j" field specified | no | Default write concern "j" field for the client |
loadBalanced | "true" or "false" | defined in Load Balancer spec | no | Whether the driver is connecting to a load balancer. |
localThresholdMS | non-negative integer; 0 means 0 ms (i.e. the fastest eligible server must be selected) | defined in the server selection spec | no | The amount of time beyond the fastest round trip time that a given server’s round trip time can take and still be eligible for server selection |
maxIdleTimeMS | non-negative integer; 0 means no minimum | defined in the Connection Pooling spec | required for drivers with connection pools | The amount of time a connection can be idle before it's closed |
maxPoolSize | non-negative integer; 0 means no maximum | defined in the Connection Pooling spec | required for drivers with connection pools | The maximum number of clients or connections able to be created by a pool at a given time. This count includes connections which are currently checked out. |
maxConnecting | positive integer | defined in the Connection Pooling spec | required for drivers with connection pools | The maximum number of Connections a Pool may be establishing concurrently. |
maxStalenessSeconds | -1 (no max staleness check) or integer >= 90 | defined in max staleness spec | no | The maximum replication lag, in wall clock time, that a secondary can suffer and still be eligible for server selection |
minPoolSize | non-negative integer | defined in the Connection Pooling spec | required for drivers with connection pools | The number of connections the driver should create and maintain in the pool even when no operations are occurring. This count includes connections which are currently checked out. |
proxyHost | any string | defined in the SOCKS5 support spec | no | The IPv4/IPv6 address or domain name of a SOCKS5 proxy server used for connecting to MongoDB services. |
proxyPort | non-negative integer | defined in the SOCKS5 support spec | no | The port of the SOCKS5 proxy server specified in proxyHost . |
proxyUsername | any string | defined in the SOCKS5 support spec | no | The username for username/password authentication to the SOCKS5 proxy server specified in proxyHost . |
proxyPassword | any string | defined in the SOCKS5 support spec | no | The password for username/password authentication to the SOCKS5 proxy server specified in proxyHost . |
readConcernLevel | any string (to allow for forwards compatibility with the server) | no read concern specified | no | Default read concern for the client |
readPreference | any string; currently supported values are defined in the server selection spec, but must be lowercase camelCase, e.g. "primaryPreferred" | defined in server selection spec | no | Default read preference for the client (excluding tags) |
readPreferenceTags | comma-separated key:value pairs (e.g. "dc:ny,rack:1" and "dc:ny) can be specified multiple times; each instance of this key is a separate tag set | no tags specified | no | Default read preference tags for the client; only valid if the read preference mode is not primary The order of the tag sets in the read preference is the same as the order they are specified in the URI |
replicaSet | any string | no replica set name provided | no | The name of the replica set to connect to |
retryReads | "true" or "false" | defined in retryable reads spec | no | Enables retryable reads on server 3.6+ |
retryWrites | "true" or "false" | defined in retryable writes spec | no | Enables retryable writes on server 3.6+ |
serverMonitoringMode | "stream", "poll", or "auto" | defined in SDAM spec | required for multi-threaded or asynchronous drivers | Configures which server monitoring protocol to use. |
serverSelectionTimeoutMS | positive integer; a driver may also accept 0 to be used for a special case, provided that it documents the meaning | defined in server selection spec | no | A timeout in milliseconds to block for server selection before raising an error |
serverSelectionTryOnce | "true" or "false" | defined in server selection spec | required for single-threaded drivers | Scan the topology only once after a server selection failure instead of repeatedly until the server selection times out |
socketTimeoutMS | non-negative integer; 0 means no timeout | no timeout | no | NOTE: This option is deprecated in favor of timeoutMS Amount of time spent attempting to send or receive on a socket before timing out; note that this only applies to application operations, not SDAM. |
srvMaxHosts | non-negative integer; 0 means no maximum | defined in the Initial DNS Seedlist Discovery spec | no | The maximum number of SRV results to randomly select when initially populating the seedlist or, during SRV polling, adding new hosts to the topology. |
srvServiceName | a valid SRV service name according to RFC 6335 | "mongodb" | no | the service name to use for SRV lookup in initial DNS seedlist discovery and SRV polling |
ssl | "true" or "false" | same as "tls" | no | alias of "tls"; required to ensure that Atlas connection strings continue to work |
tls | "true" or "false" | TLS required if "mongodb+srv" scheme; otherwise, drivers may may enable TLS by default if other "tls"-prefixed options are present Drivers MUST clearly document the conditions under which TLS is enabled implicitly | no | Whether or not to require TLS for connections to the server |
tlsAllowInvalidCertificates | "true" or "false" | error on invalid certificates | required if the driver’s language/runtime allows bypassing hostname verification | Specifies whether or not the driver should error when the server’s TLS certificate is invalid |
tlsAllowInvalidHostnames | "true" or "false" | error on invalid certificates | required if the driver’s language/runtime allows bypassing hostname verification | Specifies whether or not the driver should error when there is a mismatch between the server’s hostname and the hostname specified by the TLS certificate |
tlsCAFile | any string | no certificate authorities specified | required if the driver's language/runtime allows non-global configuration | Path to file with either a single or bundle of certificate authorities to be considered trusted when making a TLS connection |
tlsCertificateKeyFile | any string | no client certificate specified | required if the driver's language/runtime allows non-global configuration | Path to the client certificate file or the client private key file; in the case that they both are needed, the files should be concatenated |
tlsCertificateKeyFilePassword | any string | no password specified | required if the driver's language/runtime allows non-global configuration | Password to decrypt the client private key to be used for TLS connections |
tlsDisableCertificateRevocationCheck | "true" or "false" | false i.e. driver will reach check a certificate's revocation status | Yes | Controls whether or not the driver will check a certificate's revocation status via CRLs or OCSP. See the OCSP Support Spec for additional information. |
tlsDisableOCSPEndpointCheck | "true" or "false" | false i.e. driver will reach out to OCSP endpoints if needed. | Yes | Controls whether or not the driver will reach out to OCSP endpoints if needed. See the OCSP Support Spec for additional information. |
tlsInsecure | "true" or "false" | No TLS constraints are relaxed | no | Relax TLS constraints as much as possible (e.g. allowing invalid certificates or hostname mismatches); drivers must document the exact constraints which are relaxed by this option being true |
w | non-negative integer or string | no "w" value specified | no | Default write concern "w" field for the client |
waitQueueTimeoutMS | positive number | defined in the Connection Pooling spec | required for drivers with connection pools, with exceptions described in the Connection Pooling spec | NOTE: This option is deprecated in favor of timeoutMS Amount of time spent attempting to check out a connection from a server's connection pool before timing out |
wTimeoutMS | non-negative 64-bit integer; 0 means no timeout | no timeout | no | NOTE: This option is deprecated in favor of timeoutMS Default write concern "wtimeout" field for the client |
zlibCompressionLevel | integer between -1 and 9 (inclusive) | -1 (default compression level of the driver) | no | Specifies the level of compression when using zlib to compress wire protocol messages; -1 signifies the default level, 0 signifies no compression, 1 signifies the fastest speed, and 9 signifies the best compression |
Test Plan
Tests are implemented and described in the tests directory.
Design Rationale
Why allow drivers to provide the canonical names as aliases to existing options?
First and foremost, this spec aims not to introduce any breaking changes to drivers. Forcing a driver to change the name of an option that it provides will break any applications that use the old option. Moreover, it is already possible to provide duplicate options in the URI by specifying the same option more than once; drivers can use the same semantics to resolve the conflicts as they did before, whether it's raising an error, using the first option provided, using the last option provided, or simply telling users that the behavior is not defined.
Why use "tls" as the prefix instead of "ssl" for related options?
Technically speaking, drivers already only support TLS, which supersedes SSL. While SSL is commonly used in parlance to refer to TLS connections, the fact remains that SSL is a weaker cryptographic protocol than TLS, and we want to accurately reflect the strict requirements that drivers have in ensuring the security of a TLS connection.
Why use the names "tlsAllowInvalidHostnames" and "tlsAllowInvalidCertificates"?
The "tls" prefix is used for the same reasons described above. The use of the terms "AllowInvalidHostnames" and "AllowInvalidCertificates" is an intentional choice in order to convey the inherent unsafety of these options, which should only be used for testing purposes. Additionally, both the server and the shell use "AllowInvalid" for their equivalent options.
Why provide multiple implementation options for the insecure TLS options (i.e. "tlsInsecure" vs. "tlsAllowInvalidHostnames"/"tlsAllowInvalidCertificates"?
Some TLS libraries (e.g. Go's standard library implementation) do not provide the ability to distinguish between allow invalid certificates and hostnames, meaning they either both are allowed, or neither are. However, when more granular options are available, it's better to expose these to the user to allow them to relax security constraints as little as they need.
Why leave the decision up to drivers to enable TLS implicitly when TLS options are present?
It can be useful to turn on TLS implicitly when options such as "tlsCAFile" are present and "tls" is not present. However, with options such as "tlsAllowInvalidHostnames", some drivers may not have the ability to distinguish between "false" being provided and the option not being specified. To keep the implicit enabling of TLS consistent between such options, we defer the decision to enable TLS based on the presence of "tls"-prefixed options (besides "tls" itself) to drivers.
Reference Implementations
Ruby and Python
Security Implication
Each of the "insecure" TLS options (i.e. "tlsInsecure", "tlsAllowInvalidHostnames", "tlsAllowInvalidCertificates", "tlsDisableOCSPEndpointCheck", and "tlsDisableCertificateRevocationCheck") default to the more secure option when TLS is enabled. In order to be backwards compatible with existing driver behavior, neither TLS nor authentication is enabled by default.
Future Work
This specification is intended to represent the current state of drivers URI options rather than be a static description of the options at the time it was written. Whenever another specification is written or modified in a way that changes the name or the semantics of a URI option or adds a new URI option, this specification MUST be updated to reflect those changes.
Changelog
-
2024-05-08: Migrated from reStructuredText to Markdown.
-
2023-08-21: Add serverMonitoringMode option.
-
2022-10-05: Remove spec front matter and reformat changelog.
-
2022-01-19: Add the timeoutMS option and deprecate some existing timeout options
-
2021-12-14: Add SOCKS5 options
-
2021-11-08: Add maxConnecting option.
-
2021-10-14: Add srvMaxHosts option. Merge headings discussing URI validation for directConnection option.
-
2021-09-15: Add srvServiceName option
-
2021-09-13: Fix link to load balancer spec
-
2021-04-15: Adding in behaviour for load balancer mode.
-
2021-04-08: Updated to refer to hello and legacy hello
-
2020-03-03: Add tlsDisableCertificateRevocationCheck option
-
2020-02-26: Add tlsDisableOCSPEndpointCheck option
-
2019-09-08: Add retryReads option
-
2019-04-26: authSource and authMechanism have no default value
-
2019-02-04: Specified errors for conflicting TLS-related URI options
-
2019-01-25: Updated to reflect new Connection Monitoring and Pooling Spec
OCSP Support
- Status: Accepted
- Minimum Server Version: 4.4
Abstract
This specification is about the ability for drivers to to support OCSP—Online Certificate Status Protocol (RFC 6960)—and two of its related extensions: OCSP stapling (RFC 6066) and Must-Staple (RFC 7633).
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Specification
Required Server Versions
The server supports attaching a stapled OCSP response with versions ≥ 4.4. Future backports will bring stapling support to server versions ≥ 3.6. Drivers need not worry about the version of the server as a driver's TLS library should automatically perform the proper certificate revocation checking behavior once OCSP is enabled.
Enabling OCSP Support by Default
Drivers whose TLS libraries utilize application-wide settings for OCSP MUST respect the application's settings and MUST NOT change any OCSP settings. Otherwise:
- If a driver's TLS library supports OCSP, OCSP MUST be enabled by default whenever possible (even if this also enables Certificate Revocation List (CRL) checking).
- If a driver's TLS library supports verifying stapled OCSP responses, this option MUST be enabled by default whenever possible (even if this also enables CRL checking).
Suggested OCSP Behavior
Drivers SHOULD implement the OCSP behavior defined below to the extent that their TLS library allows. At any point in the steps defined below, if a certificate in or necessary to validate the chain is found to be invalid, the driver SHOULD end the connection.
- If a driver's TLS library supports Stapled OCSP, the server has a Must-Staple certificate and the server does not present a stapled OCSP response, a driver SHOULD end the connection.
- If a driver's TLS library supports Stapled OCSP and the server staples an OCSP response that does not cover the certificate it presents or is invalid per RFC 6960 Section 3.2, a driver SHOULD end the connection.
- If a driver's TLS library supports Stapled OCSP and the server staples an OCSP response that does cover the certificate it presents, a driver SHOULD accept the stapled OCSP response and validate all of the certificates that are presented in the response.
- If any unvalidated certificates in the chain remain and the client possesses an OCSP cache, the driver SHOULD attempt to validate the status of the unvalidated certificates using the cache.
- If any unvalidated certificates in the chain remain and the driver has a user specified CRL, the driver SHOULD attempt to validate the status of the unvalidated certificates using the user-specified CRL.
- If any unvalidated certificates in the chain remain and the driver has access to cached CRLs (e.g. OS-level/application-level/user-level caches), the driver SHOULD attempt to validate the status of the unvalidated certificates using the cached CRLs.
- If the server's certificate remains unvalidated, that certificate has a list of OCSP responder endpoints, and
tlsDisableOCSPEndpointCheck
ortlsDisableCertificateRevocationCheck
is false (if the driver supports these options), the driver SHOULD send HTTP requests to the responders in parallel. The first valid response that concretely marks the certificate status as good or revoked should be used. A timeout should be applied to requests per the Client Side Operations Timeout specification, with a default timeout of five seconds. The status for a response should only be checked if the response is valid per RFC 6960 Section 3.2 - If any unvalidated intermediate certificates remain and those certificates have OCSP endpoints, for each certificate, the driver SHOULD NOT reach out to the OCSP endpoint specified and attempt to validate that certificate.*
- If any unvalidated intermediate certificates remain and those certificates have CRL distribution points, the driver SHOULD NOT download those CRLs and attempt to validate the status of all the other certificates using those CRLs.*
- Finally, the driver SHOULD continue the connection, even if the status of all the unvalidated certificates has not been confirmed yet. This means that the driver SHOULD default to "soft fail" behavior, connecting as long as there are no explicitly invalid certificates—i.e. the driver will connect even if the status of all the unvalidated certificates has not been confirmed yet (e.g. because an OCSP responder is down).
*: See Design Rationale: Suggested OCSP Behavior
Suggested OCSP Response Validation Behavior
Drivers SHOULD validate OCSP Responses in the manner specified in RFC 6960: 3.2 to the extent that their TLS library allows.
Suggested OCSP Caching Behavior
Drivers with sufficient control over their TLS library's OCSP behavior SHOULD implement an OCSP cache. The key for this cache SHOULD be the certificate identifier (CertID) of the OCSP request as specified in RFC 6960: 4.1.1. For convenience, the relevant section has been duplicated below:
CertID ::= SEQUENCE {
hashAlgorithm AlgorithmIdentifier,
issuerNameHash OCTET STRING, -- Hash of issuer's DN
issuerKeyHash OCTET STRING, -- Hash of issuer's public key
serialNumber CertificateSerialNumber }
If a driver would accept a conclusive OCSP response (stapled or non-stapled), the driver SHOULD cache that response. We define a conclusive OCSP response as an OCSP response that indicates that a certificate is either valid or revoked. Thus, an unknown certificate status SHOULD NOT be considered conclusive, and the corresponding OCSP response SHOULD NOT be cached.
In accordance with RFC: 6960: 3.2, a cached response SHOULD be
considered valid up to and excluding the time specified in the response's nextUpdate
field. In other words, if the
current time is t, then the cache entry SHOULD be considered valid if thisUpdate ⩽ t < nextUpdate.
If a driver would accept a stapled OCSP response and that response has a later nextUpdate
than the response already in
the cache, drivers SHOULD replace the older entry in the cache with the fresher response.
MongoClient Configuration
This specification introduces the client-level configuration options defined below.
tlsDisableOCSPEndpointCheck
Drivers that can, on a per MongoClient basis, disable non-stapled OCSP while keeping stapled OCSP enabled MUST implement this option.
This boolean option determines whether a MongoClient should refrain from reaching out to an OCSP endpoint i.e. whether non-stapled OCSP should be disabled. When set to true, a driver MUST NOT reach out to OCSP endpoints. When set to false, a driver MUST reach out to OCSP endpoints if needed (as described in Specification: Suggested OCSP Behavior).
For drivers that pass the "Soft Fail Test", this option MUST default to false.
For drivers that fail the "Soft Fail Test" because their TLS library exhibits hard-fail behavior when a responder is unreachable, this option MUST default to true, and a driver MUST document this behavior. If this hard-failure behavior is specific to a particular platform (e.g. the TLS library hard-fails only on Windows) then this option MUST default to true only on the platform where the driver exhibits hard-fail behavior, and a driver MUST document this behavior.
tlsDisableCertificateRevocationCheck
Drivers whose TLS libraries support an option to toggle general certificate revocation checking must implement this option if enabling general certificate revocation checking causes hard-fail behavior when no revocation mechanisms are available (i.e. no methods are defined or the CRL distribution points/OCSP endpoints are unreachable).
This boolean option determines whether a MongoClient should refrain checking certificate revocation status. When set to true, a driver MUST NOT check certificate revocation status via CRLs or OCSP. When set to false, a driver MUST check certificate revocation status, reach out to OCSP endpoints if needed (as described in Specification: Suggested OCSP Behavior).
For drivers that pass the "Soft Fail Test" , this option MUST default to false.
If a driver does not support tlsDisableOCSPEndpointCheck
and that driver fails the "Soft Fail Test" because their TLS
library exhibits hard-fail behavior when a responder is unreachable, then that driver must default
tlsDisableCertificateRevocationCheck
to true. Such a driver also MUST document this behavior. If this hard-failure
behavior is specific to a particular platform (e.g. the TLS library hard-fails only on Windows) then this option MUST
default to true only on the platform where the driver exhibits hard-fail behavior, and a driver MUST document this
behavior.
Naming Deviations
Drivers MUST use the defined names of tlsDisableOCSPEndpointCheck
and tlsDisableCertificateRevocationCheck
for the
connection string parameters to ensure portability of connection strings across applications and drivers. If drivers
solicit MongoClient options through another mechanism (e.g. an options dictionary provided to the MongoClient
constructor), drivers SHOULD use the defined name but MAY deviate to comply with their existing conventions. For
example, a driver may use tls_disable_ocsp_endpoint_check
instead of tlsDisableOCSPEndpointCheck
.
How OCSP interacts with existing configuration options
The following requirements apply only to drivers that are able to enable/disable OCSP on a per MongoClient basis.
- If a connection string specifies
tlsInsecure=true
then the driver MUST disable OCSP. - If a connection string contains both
tlsInsecure
andtlsDisableOCSPEndpointCheck
then the driver MUST throw an error. - If a driver supports
tlsAllowInvalidCertificates
, and a connection string specifiestlsAllowInvalidCertificates=true
, then the driver MUST disable OCSP. - If a driver supports
tlsAllowInvalidCertificates
, and a connection string specifies bothtlsAllowInvalidCertificates
andtlsDisableOCSPEndpointCheck
, then the driver MUST throw an error.
The remaining requirements in this section apply only to drivers that expose an option to enable/disable certificate revocation checking on a per MongoClient basis.
- Driver MUST enable OCSP support (with stapling if possible) when certificate revocation checking is enabled unless their driver exhibits hard-fail behavior (see tlsDisableCertificateRevocationCheck). In such a case, a driver MUST disable OCSP support on the platforms where its TLS library exhibits hard-fail behavior.
- Drivers SHOULD throw an error if any of
tlsInsecure=true
ortlsAllowInvalidCertificates=true
ortlsDisableOCSPEndpointCheck=true
is specified alongside the option to enable certificate revocation checking. - If a connection string contains both
tlsInsecure
andtlsDisableCertificateRevocationCheck
then the driver MUST throw an error. - If a driver supports
tlsAllowInvalidCertificates
and a connection string specifies bothtlsAllowInvalidCertificates
andtlsDisableCertificateRevocationCheck
, then the driver MUST throw an error. - If a driver supports
tlsDisableOCSPEndpointCheck
, and a connection string specifiestlsDisableCertificateRevocationCheck
, then the driver MUST throw an error.
TLS Requirements
Server Name Indication (SNI) MUST BE used in the TLS connection that obtains the server's certificate, otherwise the server may present the incorrect certificate. This requirement is especially relevant to drivers whose TLS libraries allow for finer-grained control over their TLS behavior (e.g. Python, C).
Documentation Requirements
Drivers that cannot support OCSP MUST document this lack of support. Additionally, such drivers MUST document the following:
- They MUST document that they will be unable to support certificate revocation checking with Atlas when Atlas moves to OCSP-only certificates.
- They MUST document that users should be aware that if they use a Certificate Authority (CA) that issues OCSP-only certificates, then the driver cannot perform certificate revocation checking.
Drivers that support OCSP without stapling MUST document this lack of support for stapling. They also MUST document their behavior when an OCSP responder is unavailable and a server has a Must-Staple certificate. If a driver is able to connect in such a scenario due to the prevalence of "soft-fail" behavior in TLS libraries (where a certificate is accepted when an answer from an OCSP responder cannot be obtained), they additionally MUST document that this ability to connect to a server with a Must-Staple certificate when an OCSP responder is unavailable differs from the mongo shell or a driver that does support OCSP-stapling, both of which will fail to connect (i.e. "hard-fail") in such a scenario.
If a driver (e.g. Python, C) allows the user to provide their own certificate revocation list (CRL), then that driver MUST document their TLS library's preference between the user-provided CRL and OCSP.
Drivers that cannot enable OCSP by default on a per MongoClient basis (e.g. Java) MUST document this limitation.
Drivers that fail either of the "Malicious Server Tests" (i.e. the driver connects to a test server without TLS constraints being relaxed) as defined in the test plan below MUST document that their chosen TLS library will connect in the case that a server with a Must-Staple certificate does not staple a response.
Drivers that fail "Malicious Server Test 2" (i.e. the driver connects to the test server without TLS constraints being relaxed) as defined in the test plan below MUST document that their chosen TLS library will connect in the case that a server with a Must-Staple certificate does not staple a response and the OCSP responder is down.
Drivers that fail "Soft Fail Test" MUST document that their driver's TLS library utilizes "hard fail" behavior in the case of an unavailable OCSP responder in contrast to the mongo shell and drivers that utilize "soft fail" behavior. They also MUST document the change in defaults for the applicable options (see MongoClient Configuration).
If any changes related to defaults for OCSP behavior are made after a driver version that supports OCSP has been released, the driver MUST document potential backwards compatibility issues as noted in the Backwards Compatibility section.
Test Plan
See tests/README for tests.
Motivation for Change
MongoDB Atlas intends to use LetsEncrypt, a Certificate Authority (CA) that does not use CRLs and only uses OCSP. (Atlas currently uses DigiCert certificates which specify both OCSP endpoints and CRL distribution points.) Therefore, the MongoDB server is adding support for OCSP, and drivers need to support OCSP in order for applications to continue to have the ability to verify the revocation status of an Atlas server's certificate. Other CAs have also stopped using CRLs, so enabling OCSP support will ensure that a customer's choice in CAs is not limited by a driver's lack of OCSP support.
OCSP stapling will also help applications deployed behind a firewall with an outbound allowList. It's a very natural mistake to neglect to allowList the CRL distribution points and the OCSP endpoints, which can prevent an application from connecting to a MongoDB instance if certificate revocation checking is enabled but the driver does not support OCSP stapling.
Finally, drivers whose TLS libraries support OCSP stapling extension will be able to minimize the number of network round trips for the client because the driver's TLS library will read an OCSP response stapled to the server's certificate that the server provides as part of the TLS handshake. Drivers whose TLS libraries support OCSP but not stapling will need to make an additional round trip to contact the OCSP endpoint.
Design Rationale
We have chosen not to force drivers whose TLS libraries do not support OCSP/stapling "out of the box" to implement OCSP support due to the extra work and research that this might require. Similarly, this specification uses "SHOULD" more commonly (when other specs would prefer "MUST") to account for the fact that some drivers may not be able to fully customize OCSP behavior in their TLS library.
We are requiring drivers to support both stapled OCSP and non-stapled OCSP in order to support revocation checking for server versions in Atlas that do not support stapling, especially after Atlas switches to Let's Encrypt certificates (which do not have CRLs). Additionally, even when servers do support stapling, in the case of a non-"Must Staple" certificate (which is the type that Atlas is planning to use), if the server is unable to contact the OCSP responder (e.g. due to a network error) and staple a certificate, the driver being able to query the certificate's OCSP endpoint allows for one final chance to attempt to verify the certificate's validity.
Malicious Server Tests
"Malicious Server Test 2" is designed to reveal the behavior of TLS libraries of drivers in one of the worst case scenarios. Since a majority of the drivers will not have fine-grained control over their OCSP behavior, this test case provides signal about the soft/hard fail behavior in a driver's TLS library so that we can document this.
A driver with control over its OCSP behavior will react the same in "Malicious Server Test 1" and "Malicious Server Test 2", terminating the connection as long as TLS constraints have not been relaxed.
Atlas Connectivity Tests
No additional Atlas connectivity tests will be added because the existing tests should provide sufficient coverage (provided that one of the non-free tier clusters is upgraded ≥ 3.6).
Suggested OCSP Behavior
For drivers with finer-grain control over their OCSP behavior, the suggested OCSP behavior was chosen as a balance between security and availability, erring on availability while minimizing network round trips. Therefore, in order to minimize network round trips, drivers are advised not to reach out to OCSP endpoints and CRL distribution points in order to verify the revocation status of intermediate certificates.
Backwards Compatibility
An application behind a firewall with an outbound allowList that upgrades to a driver implementing this specification
may experience connectivity issues when OCSP is enabled. This is because the driver may need to contact OCSP endpoints
or CRL distribution points1 specified in the server's certificate and if these OCSP endpoints and/or CRL distribution
points are not accessible, then the connection to the server may fail. (N.B.: TLS libraries
typically implement "soft fail"
such that connections can continue even if the OCSP server is inaccessible, so this issue is much more likely in the
case of a server with a certificate that only contains CRL distribution points.) In such a scenario, connectivity may be
able to be restored by disabling non-stapled OCSP via tlsDisableOCSPEndpointCheck
or by disabling certificate
revocation checking altogether via tlsDisableCertificateRevocationCheck
.
An application that uses a driver that utilizes hard-fail behavior when there are no certificate revocation mechanisms available may also experience connectivity issue. Cases in which no certificate revocation mechanisms being available include:
- When a server's certificate defines neither OCSP endpoints nor CRL distribution points
- When a certificate defines CRL distribution points and/or OCSP endpoints but these points are unavailable (e.g. the points are down or the application is deployed behind a restrictive firewall).
In such a scenario, connectivity may be able to be restored by disabling non-stapled OCSP via
tlsDisableOCSPEndpointCheck
or by disabling certificate revocation checking via
tlsDisableCertificateRevocationCheck
.
Reference Implementation
The .NET/C#, Python, C, and Go drivers will provide the reference implementations. See CSHARP-2817, PYTHON-2093, CDRIVER-3408 and GODRIVER-1467.
Security Implications
Customers should be aware that if they choose to use CA that only supports OCSP, they will not be able to check certificate validity in drivers that cannot support OCSP.
In the case that the server has a Must-Staple certificate and its OCSP responder is down (for longer than the server is able to cache and staple a previously acquired response), the mongo shell or a driver that supports OCSP stapling will not be able to connect while a driver that supports OCSP but not stapling will be able to connect.
TLS libraries may implement "soft-fail" in the case of non-stapled OCSP which may be undesirable in highly secure contexts.
Drivers that fail the "Malicious Server" tests as defined in Test Plan will connect in the case that server with a Must-Staple certificate does not staple a response.
Testing Against Valid Certificate Chains
Some TLS libraries are stricter about the types of certificate chains they're willing to accept (and it can be difficult to debug why a particular certificate chain is considered invalid by a TLS library). Clients and servers with more control over their OCSP implementation may run into fewer up front costs, but this may be at the cost of not fully implementing every single aspect of OCSP.
For example, the server team's certificate generation tool generated X509 V1 certificates which were used for testing OCSP without any issues in the server team's tests. However, while we were creating a test plan for drivers, we discovered that Java's keytool refused to import X509 V1 certificates into its trust store and thus had to modify the server team's certificate generation tool to generate V3 certificates.
Another example comes from .NET on Linux, which currently enforces the CA/Browser forum requirement that while a leaf certificate can be covered solely by OCSP, "public CAs have to have CRL[s] covering their issuing CAs". This requirement is not enforced with Java's default TLS libraries. See also: Future Work: CA/Browser Forum Requirements Complications.
Future Work
When the server work is backported, drivers will need to update their prose tests so that tests are run against a wider range of compatible servers.
Automated Atlas connectivity tests (DRIVERS-382) may be updated with additional OCSP-related URIs when 4.4 becomes available for Atlas; alternatively, the clusters behind those URIs may be updated to 4.4 (or an earlier version where OCSP has been backported). Note: While the free tier cluster used for the Automated Atlas connectivity tests will automatically get updated to 4.4 when it is available, Atlas currently does not plan to enable OCSP for free and shared tier instances (i.e. Atlas Proxy).
Options to configure failure behavior (e.g. to maximize security or availability) may be added in the future.
CA/Browser Forum Requirements Complications
The test plan may need to be reworked if we discover that a driver's TLS library strictly implements CA/Browser forum requirements (e.g. .NET on Linux). This is because our current chain of certificates does not fulfill the following requirement: while a leaf certificate can be covered solely by OCSP, "public CAs have to have CRL[s] covering their issuing CAs." This rework of the test plan may happen during the initial implementation of OCSP support or happen later if a driver's TLS library implements the relevant CA/Browser forum requirement.
Extending the chain to fulfill the CA/Browser requirement should solve this issue, although drivers that don't support manually supplying a CRL may need to host a web server that serves the required CRL during testing.
Q&A
Can we use one Evergreen task combined with distinct certificates for each column in the test matrix to prevent OCSP caching from affecting testing?
No. This is because Evergreen may reuse a host with an OCSP cache from a previous execution, so using distinct certificates per column would not obviate the need to clear all relevant OCSP caches prior to each test run. Since Evergreen does perform some cleanup between executions, having separate tasks for each test column offers an additional layer of safety in protecting against stale data in OCSP caches.
Should drivers use a nonce when creating an OCSP request?
A driver MAY use a nonce if desired, but including a nonce in an OCSP request is not required as the server does not explicitly support nonces.
Should drivers utilize a tolerance period when accepting OCSP responses?
No. Although
RFC 5019, The Lightweight Online Certificate Status Protocol (OCSP) Profile for High-Volume Environments,
allows for the configuration of a tolerance period for the acceptance of OCSP responses after nextUpdate
, this spec is
not adhering to that RFC.
Why was the decision made to allow OCSP endpoint checking to be enabled/disabled via a URI option?
We initially hoped that we would be able to not expose any options specifically related to OCSP to the user, in accordance with the "No Knobs" drivers mantra However, we later decided that users may benefit from having the ability to disable OCSP endpoint checking when applications are deployed behind restrictive firewall with outbound allowLists, and this benefit is worth adding another URI option.
Appendix
OS-Level OCSP Cache Manipulation
Windows
On Windows, the OCSP cache can be viewed like so:
certutil -urlcache
To search the cache for "Lets Encrypt" OCSP cache entries, the following command could be used:
certutil -urlcache | findstr letsencrypt.org
On Windows, the OCSP cache can be cleared like so:
certutil -urlcache * delete
To delete only "Let's Encrypt" related entries, the following command could be used:
certutil -urlcache letsencrypt.org delete
macOS
On macOS 10.14, the OCSP cache can be viewed like so:
find ~/profile/Library/Keychains -name 'ocspcache.sqlite3' \
-exec sqlite3 "{}" 'SELECT responderURI FROM responses;' \;
To search the cache for "Let's Encrypt" OCSP cache entries, the following command could be used:
find ~/profile/Library/Keychains \
-name 'ocspcache.sqlite3' \
-exec sqlite3 "{}" \
'SELECT responderURI FROM responses WHERE responderURI LIKE "http://%.letsencrypt.org%";' \;
On macOS 10.14, the OCSP cache can be cleared like so:
find ~/profile/Library/Keychains -name 'ocspcache.sqlite3' \
-exec sqlite3 "{}" 'DELETE FROM responses ;' \;
To delete only "Let's Encrypt" related entries, the following command could be used:
find ~/profile/Library/Keychains -name 'ocspcache.sqlite3' \
-exec sqlite3 "{}" \
'DELETE FROM responses WHERE responderURI LIKE "http://%.letsencrypt.org%";' \;
Optional Quick Manual Validation Tests of OCSP Support
These optional validation tests are not a required part of the test plan. However, these optional tests may be useful for drivers trying to quickly determine if their TLS library supports OCSP and/or as an initial manual testing goal when implementing OCSP support.
Optional test to ensure that the driver's TLS library supports OCSP stapling
Create a test application with a connection string with TLS enabled that connects to any server that has OCSP-only certificate and supports OCSP stapling.
For example, the test application could connect to CV, one of the special testing Atlas clusters with a valid OCSP-only certificate. see Future Work for additional information).
Alternatively, the test application can attempt to connect to a non-mongod server that supports OCSP-stapling and
has a valid an OCSP-only certificate. The connection will fail of course, but we are only interested in the TLS
handshake and the OCSP requests that may follow. For example, the following connection string could be used:
mongodb://valid-isrgrootx1.letsencrypt.org:443/?tls=true
Run the test application and verify through packet analysis that the driver's ClientHello message's TLS extension
section includes the status_request
extension, thus indicating that the driver is advertising that it supports OCSP
stapling.
Note: If using WireShark as your chosen packet analyzer, the tls
(case-sensitive)
display filter may be useful in this endeavor.
OCSP Caching and the optional test to ensure that the driver's TLS library supports non-stapled OCSP
The "Optional test to ensure that the driver's TLS library supports non-stapled OCSP" is complicated by the fact that OCSP allows the client to cache the OCSP responses, so clearing an OCSP cache may be needed in order to force the TLS library to reach out to an OCSP endpoint. This cache may exist at the OS-level, application-level and/or at the user-level.
Optional test to ensure that the driver's TLS library supports non-stapled OCSP
Create a test application with a connection string with TLS enabled that connects to any server with an OCSP-only certificate.
Alternatively, the test application can attempt to connect to a non-mongod server that does not support OCSP-stapling and has a valid an OCSP-only certificate. The connection will fail of course, but we are only interested in the TLS handshake and the OCSP requests that may follow.
Alternatively, if it's known that a driver's TLS library does not support stapling or if stapling support can be toggled off, then any non-mongod server that has a valid an OCSP-only certificate will work, including the example shown in the "Optional test to ensure that the driver's TLS library supports OCSP stapling."
Clear the OS/user/application OCSP cache, if one exists and the TLS library makes use of it.
Run the test application and ensure that the TLS handshake succeeds. connection succeeds. Ensure that the driver's TLS library has contacted the OCSP endpoint specified in the server's certificate. Two simple ways of checking this are:
- Use a packet analyzer while the test application is running to ensure that the driver's TLS library contacts the OCSP
endpoint. When using WireShark, the
ocsp
andtls
(case-sensitive) display filters may be useful in this endeavor. - If the TLS library utilizes an OCSP cache and the cache was cleared prior to starting the test application, check the OCSP cache for a response from an OCSP endpoint specified in the server's certificate.
Changelog
-
2024-08-20: Migrated from reStructuredText to Markdown.
-
2022-10-05: Remove spec front matter and reformat changelog.
-
2022-01-19: Require that timeouts be applied per the client-side operations timeout spec.
-
2021-04-07: Updated terminology to use allowList.
-
2020-07-01: Default tlsDisableOCSPEndpointCheck or tlsDisableCertificateRevocationCheck to true in the case that a driver's TLS library exhibits hard-fail behavior and add provision for platform-specific defaults.
-
2020-03-20: Clarify OCSP documentation requirements for drivers unable to enable OCSP by default on a per MongoClient basis.
-
2020-03-03: Add tlsDisableCertificateRevocationCheck URI option. Add Go as a reference implementation. Add hard-fail backwards compatibility documentation requirements.
-
2020-02-26: Add tlsDisableOCSPEndpointCheck URI option.
-
2020-02-19: Clarify behavior for reaching out to OCSP responders.
-
2020-02-10: Add cache requirement.
-
2020-01-31: Add SNI requirement and clarify design rationale regarding minimizing round trips.
-
2020-01-28: Clarify behavior regarding nonces and tolerance periods.
-
2020-01-16: Initial commit.
Endnotes
Since this specification mandates that a driver must enable OCSP when possible, this may involve enabling certificate revocation checking in general, and thus the accessibility of CRL distribution points can become a factor.
MongoDB Handshake
- Status: Accepted
- Minimum Server Version: 3.4
Abstract
MongoDB 3.4 has the ability to annotate connections with metadata provided by the connecting client. The intent of this
metadata is to be able to identify client level information about the connection, such as application name, driver name
and version. The provided information will be logged through the mongo[d|s].log
and the profile logs; this should
enable sysadmins to easily backtrack log entries the offending application. The active connection data will also be
queryable through aggregation pipeline, to enable collecting and analyzing driver trends.
After connecting to a MongoDB node a hello command (if Stable API is requested) or a legacy hello command is issued, followed by authentication, if appropriate. This specification augments this handshake and defines certain arguments that clients provide as part of the handshake.
This spec furthermore adds a new connection string argument for applications to declare its application name to the server.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Terms
hello command
The command named hello
. It is the preferred and modern command for handshakes and topology monitoring.
legacy hello command
The command named isMaster
. It is the deprecated equivalent of the hello
command. It was deprecated in MongoDB 5.0.
isMaster / ismaster
The correct casing is isMaster
, but servers will accept the alternate casing ismaster
. Other case variations result
in CommandNotFound
. Drivers MUST take this case variation into account when determining which commands to encrypt,
redact, or otherwise treat specially.
Specification
Connection handshake
MongoDB uses the hello
or isMaster
commands for handshakes and topology monitoring. hello
is the modern and
preferred command. hello
must always be sent using the OP_MSG
protocol. isMaster
is referred to as "legacy hello"
and is maintained for backwards compatibility with servers that do not support the hello
command.
If a server API version is requested or loadBalanced: True
, drivers MUST use the
hello
command for the initial handshake and use the OP_MSG
protocol. If server API version is not requested and
loadBalanced: False
, drivers MUST use legacy hello for the first message of the initial handshake with the OP_QUERY
protocol (before switching to OP_MSG
if the maxWireVersion
indicates compatibility), and include helloOk:true
in
the handshake request.
ASIDE: If the legacy handshake response includes helloOk: true
, then subsequent topology monitoring commands MUST use
the hello
command. If the legacy handshake response does not include helloOk: true
, then subsequent topology
monitoring commands MUST use the legacy hello command. See the
Server Discovery and Monitoring spec
for further information.
The initial handshake MUST be performed on every socket to any and all servers upon establishing the connection to MongoDB, including reconnects of dropped connections and newly discovered members of a cluster. It MUST be the first command sent over the respective socket. If the command fails the client MUST disconnect. Timeouts MUST be applied to this command per the Client Side Operations Timeout specification.
hello
and legacy hello commands issued after the initial connection handshake MUST NOT contain handshake arguments.
Any subsequent hello
or legacy hello calls, such as the ones for topology monitoring purposes, MUST NOT include this
argument.
Example Implementation
Consider the following pseudo-code for establishing a new connection:
conn = Connection()
conn.connect() # Connect via TCP / TLS
if stable_api_configured or client_options.load_balanced:
cmd = {"hello": 1}
conn.supports_op_msg = True # Send the initial command via OP_MSG.
else:
cmd = {"legacy hello": 1, "helloOk": 1}
conn.supports_op_msg = False # Send the initial command via OP_QUERY.
cmd["client"] = client_metadata
if client_options.compressors:
cmd["compression"] = client_options.compressors
if client_options.load_balanced:
cmd["loadBalanced"] = True
creds = client_options.credentials
if creds:
# Negotiate auth mechanism and perform speculative auth. See Auth spec for details.
if not creds.has_mechanism_configured():
cmd["saslSupportedMechs"] = ...
cmd["speculativeAuthenticate"] = ...
reply = conn.send_command("admin", cmd)
if reply["maxWireVersion"] >= 6:
# Use OP_MSG for all future commands, including authentication.
conn.supports_op_msg = True
# Store the negotiated compressor, see OP_COMPRESSED spec.
if reply.get("compression"):
conn.compressor = reply["compression"][0]
# Perform connection authentication. See Auth spec for details.
negotiated_mechs = reply.get("saslSupportedMechs")
speculative_auth = reply.get("speculativeAuthenticate")
conn.authenticate(creds, negotiated_mechs, speculative_auth)
Hello Command
The initial handshake, as of MongoDB 3.4, supports a new argument, client
, provided as a BSON object. This object has
the following structure:
{
hello: 1,
helloOk: true,
client: {
/* OPTIONAL. If present, the "name" is REQUIRED */
application: {
name: "<string>"
},
/* REQUIRED, including all sub fields */
driver: {
name: "<string>",
version: "<string>"
},
/* REQUIRED */
os: {
type: "<string>", /* REQUIRED */
name: "<string>", /* OPTIONAL */
architecture: "<string>", /* OPTIONAL */
version: "<string>" /* OPTIONAL */
},
/* OPTIONAL */
platform: "<string>",
/* OPTIONAL */
env: {
name: "<string>", /* OPTIONAL */
timeout_sec: 42, /* OPTIONAL */
memory_mb: 1024, /* OPTIONAL */
region: "<string>", /* OPTIONAL */
/* OPTIONAL */
container: {
runtime: "<string>", /* OPTIONAL */
orchestrator: "<string>" /* OPTIONAL */
}
}
}
}
client.application.name
This value is application configurable.
The application name is printed to the mongod logs upon establishing the connection. It is also recorded in the slow query logs and profile collections.
The recommended way for applications to provide this value is through the connection URI. The connection string key is
appname
.
Example connection string:
mongodb://server:27017/db?appname=mongodump
This option MAY also be provided on the MongoClient itself, if normal for the driver. It is only valid to set this
attribute before any connection has been made to a server. Any attempt to set client.application.name
MUST result in
an failure when doing so will either change the existing value, or have any connection to MongoDB reporting inconsistent
values.
Drivers MUST NOT provide a default value for this key.
client.driver.name
This value is required and is not application configurable.
The internal driver name. For drivers written on-top of other core drivers, the underlying driver will typically expose a function to append additional name to this field.
Example:
- "pymongo"
- "mongoc / phongo"
client.driver.version
This value is required and is not application configurable.
The internal driver version. The version formatting is not defined. For drivers written on-top of other core drivers, the underlying driver will typically expose a function to append additional name to this field.
Example:
- "1.1.2-beta0"
- "1.4.1 / 1.2.0"
client.os.type
This value is required and is not application configurable.
The Operating System primary identification type the client is running on. Equivalent to uname -s
on POSIX systems.
This field is REQUIRED and clients must default to unknown
when an appropriate value cannot be determined.
Example:
- "Linux"
- "Darwin"
- "Windows"
- "BSD"
- "Unix"
client.os.name
This value is optional, but RECOMMENDED, it is not application configurable.
Detailed name of the Operating System's, such as fully qualified distribution name. On systemd systems, this is
typically PRETTY_NAME
of os-release(5)
(/etc/os-release
) or the DISTRIB_DESCRIPTION
(/etc/lsb-release
,
lsb_release(1) --description
) on LSB systems. The exact value and method to determine this value is undefined.
Example:
- "Ubuntu 16.04 LTS"
- "macOS"
- "CygWin"
- "FreeBSD"
- "AIX"
client.os.architecture
This value is optional, but RECOMMENDED, it is not application configurable. The machine hardware name. Equivalent to
uname -m
on POSIX systems.
Example:
- "x86_64"
- "ppc64le"
client.os.version
This value is optional and is not application configurable.
The Operating System version.
Example:
- "10"
- "8.1"
- "16.04.1"
client.platform
This value is optional and is not application configurable.
Driver specific platform details.
Example:
- clang 3.8.0 CFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium"
- "Oracle JVM EE 9.1.1"
client.env
This value is optional and is not application configurable.
Information about the execution environment, including Function-as-a-Service (FaaS) identification and container runtime.
The contents of client.env
MUST be adjusted to keep the handshake below the size limit; see
Limitations for specifics.
If no fields of client.env
would be populated, client.env
MUST be entirely omitted.
FaaS
FaaS details are captured in the name
, timeout_sec
, memory_mb
, and region
fields of client.env
. The name
field is determined by which of the following environment variables are populated:
Name | Environment Variable |
---|---|
aws.lambda | AWS_EXECUTION_ENV 1 or AWS_LAMBDA_RUNTIME_API |
azure.func | FUNCTIONS_WORKER_RUNTIME |
gcp.func | K_SERVICE or FUNCTION_NAME |
vercel | VERCEL |
If none of those variables are populated the other FaaS values MUST be entirely omitted. When variables for multiple
client.env.name
values are present, vercel
takes precedence over aws.lambda
; any other combination MUST cause the
other FaaS values to be entirely omitted.
Depending on which client.env.name
has been selected, other FaaS fields in client.env
SHOULD be populated:
Name | Field | Environment Variable | Expected Type |
---|---|---|---|
aws.lambda | client.env.region | AWS_REGION | string |
client.env.memory_mb | AWS_LAMBDA_FUNCTION_MEMORY_SIZE | int32 | |
gcp.func | client.env.memory_mb | FUNCTION_MEMORY_MB | int32 |
client.env.timeout_sec | FUNCTION_TIMEOUT_SEC | int32 | |
client.env.region | FUNCTION_REGION | string | |
vercel | client.env.region | VERCEL_REGION | string |
Missing variables or variables with values not matching the expected type MUST cause the corresponding client.env
field to be omitted and MUST NOT cause a user-visible error.
Container
Container runtime information is captured in client.env.container
.
client.env.container.runtime
MUST be set to "docker"
if the file .dockerenv
exists in the root directory.
client.env.container.orchestrator
MUST be set to "kubernetes"
if the environment variable KUBERNETES_SERVICE_HOST
is populated.
If no fields of client.env.container
would be populated, client.env.container
MUST be entirely omitted.
If the runtime environment has both FaaS and container information, both must have their metadata included in
client.env
.
Speculative Authentication
- Since: 4.4
The initial handshake supports a new argument, speculativeAuthenticate
, provided as a BSON document. Clients
specifying this argument to hello
or legacy hello will speculatively include the first command of an authentication
handshake. This command may be provided to the server in parallel with any standard request for supported authentication
mechanisms (i.e. saslSupportedMechs
). This would permit clients to merge the contents of their first authentication
command with their initial handshake request, and receive the first authentication reply along with the initial
handshake reply.
When the mechanism is MONGODB-X509
, speculativeAuthenticate
has the same structure as seen in the MONGODB-X509
conversation section in the Driver Authentication spec.
When the mechanism is SCRAM-SHA-1
or SCRAM-SHA-256
, speculativeAuthenticate
has the same fields as seen in the
conversation subsection of the SCRAM-SHA-1 and SCRAM-SHA-256 sections in the
Driver Authentication spec with an additional db
field to specify
the name of the authentication database.
When the mechanism is MONGODB-OIDC
, speculativeAuthenticate
has the same structure as seen in the MONGODB-OIDC
conversation section in the Driver Authentication spec.
If the initial handshake command with a speculativeAuthenticate
argument succeeds, the client should proceed with the
next step of the exchange. If the initial handshake response does not include a speculativeAuthenticate
reply and the
ok
field in the initial handshake response is set to 1, drivers MUST authenticate using the standard authentication
handshake.
The speculativeAuthenticate
reply has the same fields, except for the ok
field, as seen in the conversation sections
for MONGODB-X509, SCRAM-SHA-1 and SCRAM-SHA-256 in the
Driver Authentication spec.
Drivers MUST NOT validate the contents of the saslSupportedMechs
attribute of the initial handshake reply. Drivers
MUST NOT raise an error if the saslSupportedMechs
attribute of the reply includes an unknown mechanism.
If an authentication mechanism is not provided either via connection string or code, but a credential is provided,
drivers MUST use the SCRAM-SHA-256 mechanism for speculative authentication and drivers MUST send saslSupportedMechs
.
Older servers will ignore the speculativeAuthenticate
argument. New servers will participate in the standard
authentication conversation if this argument is missing.
Supporting Wrapping Libraries
Drivers MUST allow libraries which wrap the driver to append to the client metadata generated by the driver. The following class definition defines the options which MUST be supported:
class DriverInfoOptions {
/**
* The name of the library wrapping the driver.
*/
name: String;
/**
* The version of the library wrapping the driver.
*/
version: Optional<String>;
/**
* Optional platform information for the wrapping driver.
*/
platform: Optional<String>;
}
Note that how these options are provided to a driver is left up to the implementer.
If provided, these options MUST NOT replace the values used for metadata generation. The provided options MUST be
appended to their respective fields, and be delimited by a |
character. For example, when
Motor wraps PyMongo, the following fields are updated to include Motor's
"driver info":
{
client: {
driver: {
name: "PyMongo|Motor",
version: "3.6.0|2.0.0"
}
}
}
NOTE: All strings provided as part of the driver info MUST NOT contain the delimiter used for metadata concatention. Drivers MUST throw an error if any of these strings contains that character.
Deviations
Some drivers have already implemented such functionality, and should not be required to make breaking changes to comply with the requirements set forth here. A non-exhaustive list of acceptable deviations are as follows:
- The name of
DriverInfoOptions
is non-normative, implementers may feel free to name this whatever they like. - The choice of delimiter is not fixed,
|
is the recommended value, but some drivers currently use/
. - For cases where we own a particular stack of drivers (more than two), it may be preferable to accept a list of strings for each field.
Limitations
The entire client
metadata BSON document MUST NOT exceed 512 bytes. This includes all BSON overhead. The
client.application.name
cannot exceed 128 bytes. MongoDB will return an error if these limits are not adhered to,
which will result in handshake failure. Drivers MUST validate these values and truncate or omit driver provided values
if necessary. Implementers SHOULD cumulatively update fields in the following order until the document is under the size
limit:
- Omit fields from
env
exceptenv.name
. - Omit fields from
os
exceptos.type
. - Omit the
env
document entirely. - Truncate
platform
.
Additionally, implementers are encouraged to place high priority information about the platform earlier in the string, in order to avoid possible truncating of those details.
Motivation For Change
Being able to annotate individual connections with custom data will allow users and sysadmins to easily correlate events happening on their MongoDB deployment to a specific application. For support engineers, it furthermore helps identify potential problems in drivers or client platforms, and paves the way for providing proactive support via Cloud Manager and/or Atlas to advise customers about out of date driver versions.
Design Rationale
Drivers run on a multitude of platforms, languages, environments and systems. There is no defined list of data points that may or may not be valuable to every system. Rather than specifying such a list it was decided we would report the basics; something that everyone can discover and consider valuable. The obvious requirement here being the driver itself and its version. Any additional information is generally very system specific. Scala may care to know the Java runtime, while Python would like to know if it was built with C extensions - and C would like to know the compiler.
Having to define dozens of arguments that may or may not be useful to one or two drivers isn't a good idea. Instead, we
define a platform
argument that is driver dependent. This value will not have defined value across drivers and is
therefore not generically queryable -- however, it will gain defined schema for that particular driver, and will
therefore over time gain defined structure that can be formatted and value extracted from.
Backwards Compatibility
The legacy hello command currently ignores arguments. (i.e. If arguments are provided the legacy hello command discards them without erroring out). Adding client metadata functionality has therefore no backwards compatibility concerns.
This also allows a driver to determine if the hello
command is supported. On server versions that support the hello
command, the legacy hello command with helloOk: true
will respond with helloOk: true
. On server versions that do not
support the hello
command, the helloOk: true
argument is ignored and the legacy hello response will not contain
helloOk: true
.
Reference Implementation
Q&A
-
The 128 bytes application.name limit, does that include BSON overhead
- No, just the string itself
-
The 512 bytes limit, does that include BSON overhead?
- Yes
-
The 512 bytes limit, does it apply to the full initial handshake document or just the
client
subdocument- Just the subdocument
-
Should I really try to fill the 512 bytes with data?
- Not really. The server does not attempt to normalize or compress this data in anyway, so it will hold it in memory as-is per connection. 512 bytes for 20,000 connections is ~ 10mb of memory the server will need.
-
What happens if I pass new arguments in the legacy hello command to previous MongoDB versions?
- Nothing. Arguments passed to the legacy hello command to prior versions of MongoDB are not treated in any special way and have no effect one way or another.
-
Are there wire version bumps or anything accompanying this specification?
- No
-
Is establishing the handshake required for connecting to MongoDB 3.4?
- No, it only augments the connection. MongoDB will not reject connections without it
-
Does this affect SDAM implementations?
- Possibly. There are a couple of gotchas. If the application.name is not in the URI...
- The SDAM monitoring cannot be launched until the user has had the ability to set the application name because the application name has to be sent in the initial handshake. This means that the connection pool cannot be established until the first user initiated command, or else some connections will have the application name while other won't
- The initial handshake must be called on all sockets, including administrative background sockets to MongoDB
- Possibly. There are a couple of gotchas. If the application.name is not in the URI...
-
My language doesn't have
uname
, but does instead provide its own variation of these values, is that OK?- Absolutely. As long as the value is identifiable it is fine. The exact method and values are undefined by this specification
Changelog
- 2024-11-05: Move handshake prose tests from spec file to prose test file.
- 2024-10-09: Clarify that FaaS and container metadata must both be populated when both are present.
- 2024-08-16: Migrated from reStructuredText to Markdown.
- 2019-11-13: Added section about supporting wrapping libraries
- 2020-02-12: Added section about speculative authentication
- 2021-04-27: Updated to define
hello
and legacy hello - 2022-01-13: Updated to disallow
hello
usingOP_QUERY
- 2022-01-19: Require that timeouts be applied per the client-side operations timeout spec.
- 2022-02-24: Rename Versioned API to Stable API
- 2022-10-05: Remove spec front matter and reformat changelog.
- 2023-03-13: Add
env
toclient
document - 2023-04-03: Simplify truncation for metadata
- 2023-05-04:
AWS_EXECUTION_ENV
must start with"AWS_Lambda_"
- 2023-08-24: Added container awareness
- 2024-04-22: Clarify that driver should not validate
saslSupportedMechs
content.
AWS_EXECUTION_ENV
must start with the string "AWS_Lambda_"
.
Wire Compression in Drivers
-
Status: Accepted
-
Minimum Server Version: 3.4
Abstract
This specification describes how to implement Wire Protocol Compression between MongoDB drivers and mongo[d|s].
Compression is achieved through a new bi-directional wire protocol opcode, referred to as OP_COMPRESSED.
Server side compressor support is checked during the initial MongoDB Handshake, and is compatible with all historical versions of MongoDB. If a client detects a compatible compressor it will use the compressor for all valid requests.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Terms
Compressor - A compression algorithm. There are currently three supported algorithms: snappy, zlib, and zstd.
Specification
MongoDB Handshake Amendment
The MongoDB Handshake Protocol describes an argument passed to the handshake
command, client
. This specification adds an additional argument, compression
, that SHOULD be provided to the
handshake command if a driver supports wire compression.
The value of this argument is an array of supported compressors.
Example:
{
hello: 1,
client: {}, /* See MongoDB Handshake */
compression: ["snappy", "zlib", "zstd"]
}
When no compression is enabled on the client, drivers SHOULD send an empty compression argument.
Example:
{
hello: 1,
client: {}, /* See MongoDB Handshake */
compression: []
}
Clients that want to compress their messages need to send a list of the algorithms - in the order they are specified in the client configuration - that they are willing to support to the server during the initial handshake call. For example, a client wishing to compress with the snappy algorithm, should send:
{ hello: 1, ... , compression: [ "snappy" ] }
The server will respond with the intersection of its list of supported compressors and the client's. For example, if the server had both snappy and zlib enabled and the client requested snappy, it would respond with:
{ ... , compression: [ "snappy" ], ok: 1 }
If the client included both snappy and zlib, the server would respond with something like:
{ ... , compression: [ "snappy", "zlib" ], ok: 1 }
If the server has no compression algorithms in common with the client, it sends back a handshake response without a compression field. Clients MAY issue a log level event to inform the user, but MUST NOT error.
When MongoDB server receives a compressor it can support it MAY reply to any and all requests using the selected compressor, including the reply of the initial MongoDB Handshake. As each OP_COMPRESSED message contains the compressor ID, clients MUST NOT assume which compressor each message uses, but MUST decompress the message using the compressor identified in the OP_COMPRESSED opcode header.
When compressing, clients MUST use the first compressor in the client's configured compressors list that is also in the servers list.
Connection String Options
Two new connection string options:
compressors
Comma separated list of compressors the client should present to the server. Unknown compressors MUST yield a warning, as per the Connection String specification, and MUST NOT be included in the handshake. By default, no compressor is configured and thus compression will not be used. When multiple compressors are provided, the list should be treated as a priority ordered list of compressors to use, with highest priority given to the first compressor in the list, and lowest priority to the last compressor in the list.
Example:
mongodb://localhost/?compressors=zstd,snappy,zlib
zlibCompressionLevel
Integer value from -1
- 9
. This configuration option only applies to the zlib compressor. When zlib is not one of
the compressors enumerated by the compressors
configuration option then providing this option has no meaning, but
clients SHOULD NOT issue a warning.
Level | Description |
---|---|
-1 | Default Compression (usually 6) |
0 | No compression |
1 | Best Speed |
9 | Best Compression |
Note that this value only applies to the client side compression level, not the response.
OP_COMPRESSED
The new opcode, called OP_COMPRESSED, has the following structure:
struct OP_COMPRESSED {
struct MsgHeader {
int32 messageLength;
int32 requestID;
int32 responseTo;
int32 opCode = 2012;
};
int32_t originalOpcode;
int32_t uncompressedSize;
uint8_t compressorId;
char *compressedMessage;
};
Field | Description |
---|---|
originalOpcode | Contains the value of the wrapped opcode. |
uncompressedSize | The size of the deflated compressedMessage, which excludes the MsgHeader |
compressorId | The ID of the compressor that compressed the message |
compressedMessage | The opcode itself, excluding the MsgHeader |
Compressor IDs
Each compressor is assigned a predefined compressor ID.
compressorId | Handshake Value | Description |
---|---|---|
0 | noop | The content of the message is uncompressed. This is realistically only used for testing |
1 | snappy | The content of the message is compressed using snappy. |
2 | zlib | The content of the message is compressed using zlib. |
3 | zstd | The content of the message is compressed using zstd. |
4-255 | reserved | Reserved for future use. |
Compressible messages
Any opcode can be compressed and wrapped in an OP_COMPRESSED
header. The OP_COMPRESSED
is strictly a wire protocol
without regards to what opcode it wraps, be it OP_QUERY
, OP_REPLY
, OP_MSG
or any other future or past opcode. The
compressedMessage
contains the original opcode, excluding the standard MsgHeader
. The originalOpcode
value
therefore effectively replaces the standard MsgHeader
of the compressed opcode.
There is no guarantee that a response will be compressed even though compression was negotiated for in the handshake. Clients MUST be able to parse both compressed and uncompressed responses to both compressed and uncompressed requests.
MongoDB 3.4 will always reply with a compressed response when compression has been negotiated, but future versions may not.
A client MAY choose to implement compression for only OP_QUERY
, OP_REPLY
, and OP_MSG
, and perhaps for future
opcodes, but not to implement it for OP_INSERT
, OP_UPDATE
, OP_DELETE
, OP_GETMORE
, and OP_KILLCURSORS
.
Note that certain messages, such as authentication commands, MUST NOT be compressed. All other messages MUST be compressed, when compression has been negotiated and the driver has implemented compression for the opcode in use.
Messages not allowed to be compressed
In efforts to mitigate against current and previous attacks, certain messages MUST NOT be compressed, such as authentication requests.
Messages using the following commands MUST NOT be compressed:
- hello
- legacy hello (see MongoDB Handshake Protocol for details)
- saslStart
- saslContinue
- getnonce
- authenticate
- createUser
- updateUser
- copydbSaslStart
- copydbgetnonce
- copydb
Test Plan
There are no automated tests accompanying this specification, instead the following is a description of test scenarios clients should implement.
In general, after implementing this functionality and the test cases, running the traditional client test suite against a server with compression enabled, and ensuring the test suite is configured to provide a valid compressor as part of the connection string, is a good idea. MongoDB-supported drivers MUST add such variant to their CI environment.
The following cases assume a standalone MongoDB 3.4 (or later) node configured with:
mongod --networkMessageCompressors "snappy" -vvv
Create an example application which connects to a provided connection string, runs ping: 1
, and then quits the program
normally.
Connection strings, and results
-
mongodb://localhost:27017/?compressors=snappy
mongod should have logged the following (the exact log output may differ depending on server version):
{"t":{"$date":"2021-04-08T13:28:38.885-06:00"},"s":"I", "c":"NETWORK", "id":22943, "ctx":"listener","msg":"Connection accepted","attr":{"remote":"127.0.0.1:50635","uuid":"03961627-aec7-4543-8a17-9690f87273a6","connectionId":2,"connectionCount":1}} {"t":{"$date":"2021-04-08T13:28:38.886-06:00"},"s":"D3", "c":"EXECUTOR", "id":22983, "ctx":"listener","msg":"Starting new executor thread in passthrough mode"} {"t":{"$date":"2021-04-08T13:28:38.887-06:00"},"s":"D3", "c":"-", "id":5127801, "ctx":"thread27","msg":"Setting the Client","attr":{"client":"conn2"}} {"t":{"$date":"2021-04-08T13:28:38.887-06:00"},"s":"D2", "c":"COMMAND", "id":21965, "ctx":"conn2","msg":"About to run the command","attr":{"db":"admin","commandArgs":{"hello":1,"client":{"application":{"name":"MongoDB Shell"},"driver":{"name":"MongoDB Internal Client","version":"4.9.0-alpha7-555-g623aa8f"},"os":{"type":"Darwin","name":"Mac OS X","architecture":"x86_64","version":"19.6.0"}},"compression":["snappy"],"apiVersion":"1","apiStrict":true,"$db":"admin"}}} {"t":{"$date":"2021-04-08T13:28:38.888-06:00"},"s":"I", "c":"NETWORK", "id":51800, "ctx":"conn2","msg":"client metadata","attr":{"remote":"127.0.0.1:50635","client":"conn2","doc":{"application":{"name":"MongoDB Shell"},"driver":{"name":"MongoDB Internal Client","version":"4.9.0-alpha7-555-g623aa8f"},"os":{"type":"Darwin","name":"Mac OS X","architecture":"x86_64","version":"19.6.0"}}}} {"t":{"$date":"2021-04-08T13:28:38.889-06:00"},"s":"D3", "c":"NETWORK", "id":22934, "ctx":"conn2","msg":"Starting server-side compression negotiation"} {"t":{"$date":"2021-04-08T13:28:38.889-06:00"},"s":"D3", "c":"NETWORK", "id":22937, "ctx":"conn2","msg":"supported compressor","attr":{"compressor":"snappy"}} {"t":{"$date":"2021-04-08T13:28:38.889-06:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn2","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"hello":1,"client":{"application":{"name":"MongoDB Shell"},"driver":{"name":"MongoDB Internal Client","version":"4.9.0-alpha7-555-g623aa8f"},"os":{"type":"Darwin","name":"Mac OS X","architecture":"x86_64","version":"19.6.0"}},"compression":["snappy"],"apiVersion":"1","apiStrict":true,"$db":"admin"},"numYields":0,"reslen":351,"locks":{},"remote":"127.0.0.1:50635","protocol":"op_query","durationMillis":1}} {"t":{"$date":"2021-04-08T13:28:38.890-06:00"},"s":"D2", "c":"QUERY", "id":22783, "ctx":"conn2","msg":"Received interrupt request for unknown op","attr":{"opId":596,"knownOps":[]}} {"t":{"$date":"2021-04-08T13:28:38.890-06:00"},"s":"D3", "c":"-", "id":5127803, "ctx":"conn2","msg":"Released the Client","attr":{"client":"conn2"}} {"t":{"$date":"2021-04-08T13:28:38.890-06:00"},"s":"D3", "c":"-", "id":5127801, "ctx":"conn2","msg":"Setting the Client","attr":{"client":"conn2"}} {"t":{"$date":"2021-04-08T13:28:38.891-06:00"},"s":"D3", "c":"NETWORK", "id":22927, "ctx":"conn2","msg":"Decompressing message","attr":{"compressor":"snappy"}} {"t":{"$date":"2021-04-08T13:28:38.891-06:00"},"s":"D2", "c":"COMMAND", "id":21965, "ctx":"conn2","msg":"About to run the command","attr":{"db":"admin","commandArgs":{"whatsmyuri":1,"apiStrict":false,"$db":"admin","apiVersion":"1"}}} {"t":{"$date":"2021-04-08T13:28:38.892-06:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn2","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"whatsmyuri":1,"apiStrict":false,"$db":"admin","apiVersion":"1"},"numYields":0,"reslen":63,"locks":{},"remote":"127.0.0.1:50635","protocol":"op_msg","durationMillis":0}} {"t":{"$date":"2021-04-08T13:28:38.892-06:00"},"s":"D2", "c":"QUERY", "id":22783, "ctx":"conn2","msg":"Received interrupt request for unknown op","attr":{"opId":597,"knownOps":[]}} {"t":{"$date":"2021-04-08T13:28:38.892-06:00"},"s":"D3", "c":"NETWORK", "id":22925, "ctx":"conn2","msg":"Compressing message","attr":{"compressor":"snappy"}} {"t":{"$date":"2021-04-08T13:28:38.893-06:00"},"s":"D3", "c":"-", "id":5127803, "ctx":"conn2","msg":"Released the Client","attr":{"client":"conn2"}} {"t":{"$date":"2021-04-08T13:28:38.893-06:00"},"s":"D3", "c":"-", "id":5127801, "ctx":"conn2","msg":"Setting the Client","attr":{"client":"conn2"}} {"t":{"$date":"2021-04-08T13:28:38.895-06:00"},"s":"D3", "c":"NETWORK", "id":22927, "ctx":"conn2","msg":"Decompressing message","attr":{"compressor":"snappy"}} {"t":{"$date":"2021-04-08T13:28:38.895-06:00"},"s":"D2", "c":"COMMAND", "id":21965, "ctx":"conn2","msg":"About to run the command","attr":{"db":"admin","commandArgs":{"buildinfo":1.0,"apiStrict":false,"$db":"admin","apiVersion":"1"}}} {"t":{"$date":"2021-04-08T13:28:38.896-06:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn2","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"buildinfo":1.0,"apiStrict":false,"$db":"admin","apiVersion":"1"},"numYields":0,"reslen":2606,"locks":{},"remote":"127.0.0.1:50635","protocol":"op_msg","durationMillis":0}} {"t":{"$date":"2021-04-08T13:28:38.896-06:00"},"s":"D2", "c":"QUERY", "id":22783, "ctx":"conn2","msg":"Received interrupt request for unknown op","attr":{"opId":598,"knownOps":[]}} {"t":{"$date":"2021-04-08T13:28:38.897-06:00"},"s":"D3", "c":"NETWORK", "id":22925, "ctx":"conn2","msg":"Compressing message","attr":{"compressor":"snappy"}} {"t":{"$date":"2021-04-08T13:28:38.897-06:00"},"s":"D3", "c":"-", "id":5127803, "ctx":"conn2","msg":"Released the Client","attr":{"client":"conn2"}} {"t":{"$date":"2021-04-08T13:28:38.897-06:00"},"s":"D3", "c":"-", "id":5127801, "ctx":"conn2","msg":"Setting the Client","attr":{"client":"conn2"}} {"t":{"$date":"2021-04-08T13:28:38.898-06:00"},"s":"D3", "c":"NETWORK", "id":22927, "ctx":"conn2","msg":"Decompressing message","attr":{"compressor":"snappy"}} {"t":{"$date":"2021-04-08T13:28:38.899-06:00"},"s":"D2", "c":"COMMAND", "id":21965, "ctx":"conn2","msg":"About to run the command","attr":{"db":"admin","commandArgs":{"endSessions":[{"id":{"$uuid":"c4866af5-ed6b-4f01-808b-51a3f8aaaa08"}}],"$db":"admin","apiVersion":"1","apiStrict":true}}} {"t":{"$date":"2021-04-08T13:28:38.899-06:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn2","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"endSessions":[{"id":{"$uuid":"c4866af5-ed6b-4f01-808b-51a3f8aaaa08"}}],"$db":"admin","apiVersion":"1","apiStrict":true},"numYields":0,"reslen":38,"locks":{},"remote":"127.0.0.1:50635","protocol":"op_msg","durationMillis":0}} {"t":{"$date":"2021-04-08T13:28:38.900-06:00"},"s":"D2", "c":"QUERY", "id":22783, "ctx":"conn2","msg":"Received interrupt request for unknown op","attr":{"opId":599,"knownOps":[]}} {"t":{"$date":"2021-04-08T13:28:38.900-06:00"},"s":"D3", "c":"NETWORK", "id":22925, "ctx":"conn2","msg":"Compressing message","attr":{"compressor":"snappy"}} {"t":{"$date":"2021-04-08T13:28:38.900-06:00"},"s":"D3", "c":"-", "id":5127803, "ctx":"conn2","msg":"Released the Client","attr":{"client":"conn2"}} {"t":{"$date":"2021-04-08T13:28:38.901-06:00"},"s":"D3", "c":"-", "id":5127801, "ctx":"conn2","msg":"Setting the Client","attr":{"client":"conn2"}} {"t":{"$date":"2021-04-08T13:28:38.901-06:00"},"s":"D2", "c":"NETWORK", "id":22986, "ctx":"conn2","msg":"Session from remote encountered a network error during SourceMessage","attr":{"remote":"127.0.0.1:50635","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"}}} {"t":{"$date":"2021-04-08T13:28:38.902-06:00"},"s":"D1", "c":"-", "id":23074, "ctx":"conn2","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":410}} {"t":{"$date":"2021-04-08T13:28:38.902-06:00"},"s":"W", "c":"EXECUTOR", "id":4910400, "ctx":"conn2","msg":"Terminating session due to error","attr":{"error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"}}} {"t":{"$date":"2021-04-08T13:28:38.902-06:00"},"s":"I", "c":"NETWORK", "id":5127900, "ctx":"conn2","msg":"Ending session","attr":{"error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"}}} {"t":{"$date":"2021-04-08T13:28:38.903-06:00"},"s":"I", "c":"NETWORK", "id":22944, "ctx":"conn2","msg":"Connection ended","attr":{"remote":"127.0.0.1:50635","uuid":"03961627-aec7-4543-8a17-9690f87273a6","connectionId":2,"connectionCount":0}} {"t":{"$date":"2021-04-08T13:28:38.903-06:00"},"s":"D3", "c":"-", "id":5127803, "ctx":"conn2","msg":"Released the Client","attr":{"client":"conn2"}}
The result of the program should have been:
{ "ok" : 1 }
-
mongodb://localhost:27017/?compressors=snoopy
mongod should have logged the following:
{"t":{"$date":"2021-04-20T09:57:26.823-06:00"},"s":"D2", "c":"COMMAND", "id":21965, "ctx":"conn5","msg":"About to run the command","attr":{"db":"admin","commandArgs":{"hello":1,"client":{"driver":{"name":"mongo-csharp-driver","version":"2.12.2.0"},"os":{"type":"macOS","name":"Darwin 19.6.0 Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST 2021; root:xnu-6153.141.16~1/RELEASE_X86_64","architecture":"x86_64","version":"19.6.0"},"platform":".NET 5.0.3"},"compression":[],"$readPreference":{"mode":"secondaryPreferred"},"$db":"admin"}}} {"t":{"$date":"2021-04-20T09:57:26.823-06:00"},"s":"I", "c":"NETWORK", "id":51800, "ctx":"conn5","msg":"client metadata","attr":{"remote":"127.0.0.1:54372","client":"conn5","doc":{"driver":{"name":"mongo-csharp-driver","version":"2.12.2.0"},"os":{"type":"macOS","name":"Darwin 19.6.0 Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST 2021; root:xnu-6153.141.16~1/RELEASE_X86_64","architecture":"x86_64","version":"19.6.0"},"platform":".NET 5.0.3"}}} {"t":{"$date":"2021-04-20T09:57:26.824-06:00"},"s":"D3", "c":"NETWORK", "id":22934, "ctx":"conn5","msg":"Starting server-side compression negotiation"} {"t":{"$date":"2021-04-20T09:57:26.824-06:00"},"s":"D3", "c":"NETWORK", "id":22936, "ctx":"conn5","msg":"No compressors provided"} {"t":{"$date":"2021-04-20T09:57:26.825-06:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn5","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"hello":1,"client":{"driver":{"name":"mongo-csharp-driver","version":"2.12.2.0"},"os":{"type":"macOS","name":"Darwin 19.6.0 Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST 2021; root:xnu-6153.141.16~1/RELEASE_X86_64","architecture":"x86_64","version":"19.6.0"},"platform":".NET 5.0.3"},"compression":[],"$readPreference":{"mode":"secondaryPreferred"},"$db":"admin"},"numYields":0,"reslen":319,"locks":{},"remote":"127.0.0.1:54372","protocol":"op_query","durationMillis":1}}
e.g., empty compression: [] array. No operations should have been compressed.
The results of the program should have been:
WARNING: Unsupported compressor: 'snoopy' { "ok" : 1 }
-
mongodb://localhost:27017/?compressors=snappy,zlib
mongod should have logged the following:
{"t":{"$date":"2021-04-08T13:28:38.898-06:00"},"s":"D3", "c":"NETWORK", "id":22927, "ctx":"conn2","msg":"Decompressing message","attr":{"compressor":"snappy"}}
The results of the program should have been:
{ "ok" : 1 }
-
mongodb://localhost:27017/?compressors=zlib,snappy
mongod should have logged the following:
{"t":{"$date":"2021-04-08T13:28:38.898-06:00"},"s":"D3", "c":"NETWORK", "id":22927, "ctx":"conn2","msg":"Decompressing message","attr":{"compressor":"zlib"}}
The results of the program should have been:
{ "ok" : 1 }
-
Create example program that authenticates to the server using SCRAM-SHA-1, then creates another user (MONGODB-CR), then runs hello followed with serverStatus.
-
Reconnect to the same server using the created MONGODB-CR credentials. Observe that the only command that was decompressed on the server was
serverStatus
, while the server replied with OP_COMPRESSED for at least the serverStatus command.
Motivation For Change
Drivers provide the canonical interface to MongoDB. Most tools for MongoDB are written with the aid of MongoDB drivers. There exist a lot of tools for MongoDB that import massive datasets which could stand to gain a lot from compression. Even day-to-day applications stand to gain from reduced bandwidth utilization at low cpu costs, especially when doing large reads off the network.
Not all use cases fit compression, but we will allow users to decide if wire compression is right for them.
Design rationale
Snappy has minimal cost and provides a reasonable compression ratio, but it is not expected to be available for all
languages MongoDB Drivers support. Supporting zlib is therefore important to the ecosystem, but for languages that do
support snappy we expected it to be the default choice. While snappy has no knobs to tune, zlib does have support for
specifying the compression level (tuned from speed to compression). As we don’t anticipate adding support for
compression libraries with complex knobs to tune this specification has opted not to define a complex configuration
structure and only define the currently relevant zlibCompressionLevel
. When other compression libraries are supported,
adding support for configuring that library (if any is needed) should be handled on a case by case basis.
More recently, the MongoDB server added Zstandard (zstd) support for another modern alternative to zlib.
Backwards Compatibility
The new argument provided to the MongoDB Handshake has no backwards compatible implications as servers that do not expect it will simply ignore it. This means a server will therefore never reply with a list of acceptable compressors which in turns means a client CANNOT use the OP_COMPRESSED opcode.
Reference Implementation
Future Work
Possible future improvements include defining an API to determine compressor and configuration per operation, rather than needing to create two different client pools, one for compression and one without, when the user is expecting only needing to (not) compress very few operations.
Q & A
-
Statistics?
- See serverStatus in the server
-
How to try this/enable it?
mongod --networkMessageCompressors "snappy"
-
The server MAY reply with compressed data even if the request was not compressed?
- Yes, and this is in fact the behaviour of MongoDB 3.4
-
Can drivers compress the initial MongoDB Handshake/hello request?
- No.
-
Can the server reply to the MongoDB Handshake/hello compressed?
- Yes, yes it can. Be aware it is completely acceptable for the server to use compression for any and all replies, using any supported compressor, when the client announced support for compression - this includes the reply to the actual MongoDB Handshake/hello where the support was announced.
-
This is billed a MongoDB 3.6 feature -- but I hear it works with MongoDB3.4?
- Yes, it does. All MongoDB versions support the
compression
argument to the initial handshake and all MongoDB versions will reply with an intersection of compressors it supports. This works even with MongoDB 3.0, as it will not reply with any compressors. It also works with MongoDB 3.4 which will reply withsnappy
if it was part of the driver's list. MongoDB 3.6 will likely include zlib support.
- Yes, it does. All MongoDB versions support the
-
Which compressors are currently supported?
- MongoDB 3.4 supports
snappy
- MongoDB 3.6 supports
snappy
andzlib
- MongoDB 4.2 supports
snappy
,zlib
, andzstd
- MongoDB 3.4 supports
-
My language supports xyz compressor, should I announce them all in the handshake?
- No. But you are allowed to if you really want to make sure you can use that compressor with MongoDB 42 and your current driver versions.
-
My language does not support xzy compressor. What do I do?
- That is OK. You don’t have to support xyz.
-
No MongoDB supported compressors are available for my language
- That is OK. You don’t have to support compressors you can’t support. All it means is you can’t compress the request, and since you never declared support for any compressor, you won’t be served with compressed responses either.
-
Why did the server not support zlib in MongoDB 3.4?
- Snappy was selected for its very low performance hit, while giving reasonable compression, resulting in quite significant bandwidth reduction. Zlib characteristics are slightly different out-of-the-box and did not make sense for the initial goal of reducing bandwidth between replica set nodes.
-
If snappy is preferable to zlib, why add support for zlib in MongoDB 3.6?
- Zlib is available on every platform known to man. Snappy is not. Having zlib support makes sense for client traffic, which could originate on any type of platform, which may or may not support snappy.
Changelog
- 2024-02-16: Migrated from reStructuredText to Markdown.
- 2022-10-05: Remove spec front matter and reformat changelog.
- 2021-04-06: Use 'hello' command
- 2019-05-13: Add zstd as supported compression algorithm
- 2017-06-13: Don't require clients to implement legacy opcodes
- 2017-05-10: Initial commit.
SOCKS5 Support
- Status: Accepted
- Minimum Server Version: N/A
Abstract
SOCKS5 is a standardized protocol for connecting to network services through a separate proxy server. It can be used for connecting to hosts that would otherwise be unreachable from the local network by connecting to a proxy server, which receives the intended target host's address from the client and then connects to that address.
This specification defines driver behaviour when connecting to MongoDB services through a SOCKS5 proxy.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Specification
Terms
SOCKS5
The SOCKS protocol, version 5, as defined in RFC1928, restricted to either no authentication or username/password authentication as defined in RFC1929.
MongoClient Configuration
proxyHost
To specify to the driver to connect using a SOCKS5 proxy, a connection string option of proxyHost=host
MUST be added
to the connection string or passed through an equivalent MongoClient
option. This option specifies a domain name or
IPv4 or IPv6 address on which a SOCKS5 proxy is listening. This option MUST only be configurable at the level of a
MongoClient
.
proxyPort
To specify to the driver to connect using a SOCKS5 proxy listening on a non-default port, a connection string option of
proxyPort=port
MUST be added to the connection string or passed through an equivalent MongoClient
option. This
option specifies a TCP port number. The default for this option MUST be 1080
. This option MUST only be configurable at
the level of a MongoClient
. Drivers MUST error if this option was specified and proxyHost
was not specified.
proxyUsername
To specify to the driver to connect using a SOCKS5 proxy requiring username/password authentication, a connection string
option of - Code:proxyUsername=username
MUST be added to the connection string or passed through an equivalent
MongoClient
option. This option specifies a string of non-zero length. Drivers MUST ignore this option if it specifies
a zero-length string. Drivers MUST error if this option was specified and proxyHost
was not specified or
proxyPassword
was not specified.
proxyPassword
To specify to the driver to connect using a SOCKS5 proxy requiring username/password authentication, a connection string
option of - Code:proxyPassword=password
MUST be added to the connection string or passed through an equivalent
MongoClient
option. This option specifies a string of non-zero length. Drivers MUST ignore this option if it specifies
a zero-length string. Drivers MUST error if this option was specified and proxyHost
was not specified or
proxyUsername
was not specified.
Connection Pooling
Connection Establishment
When establishing a new outgoing TCP connection, drivers MUST perform the following steps if proxyHost
was specified:
-
Connect to the SOCKS5 proxy host, using
proxyHost
andproxyPort
as specified. -
Perform a SOCKS5 handshake as specified in RFC1928.
If
proxyUsername
andproxyPassword
were passed, drivers MUST indicate in the handshake that both "no authentication" and "username/password authentication" are supported. Otherwise, drivers MUST indicate support for "no authentication" only.Drivers MUST NOT attempt to perform DNS A or AAAA record resolution of the destination hostname and instead pass the hostname to the proxy as-is.
-
Continue with connection establishment as if the connection was one to the destination host.
Drivers MUST use the SOCKS5 proxy for connections to MongoDB services and client-side field-level encryption KMS servers.
Drivers MUST NOT use the SOCKS5 proxy for connections to - Code:mongocryptd
processes spawned for automatic
client-side field-level encryption.
Drivers MUST treat a connection failure when connecting to the SOCKS5 proxy or a SOCKS5 handshake or authentication
failure the same as a network error (e.g. ECONNREFUSED
).
Events
SOCKS5 proxies are fully transparent to connection monitoring events. In particular, in CommandStartedEvent
,
CommandSucceededEvent
, and - Code:CommandFailedEvent
, the driver SHOULD NOT reference the SOCKS5 proxy as part of
the connectionId
field or other fields.
Q&A
Why not include DNS requests in the set of proxied network communication?
While SOCKS5 as a protocol does support UDP forwarding, using this feature has a number of downsides. Notably, only a subset of SOCKS5 client libraries and SOCKS5 server implementations support UDP forwarding (e.g. the OpenSSH client's dynamic forwarding feature does not). This would also considerably increase implementation complexity in drivers that do not use DNS libraries in which the driver is in control of how the UDP packets are sent and received.
Why not support other proxy protocols, such as Socks4/Socks4a, HTTP Connect proxies, etc.?
SOCKS5 is a powerful, standardized and widely used proxy protocol. It is likely that almost all users which require tunneling/proxying of some sort will be able to use it, and those who require another protocol or a more advanced setup like proxy chaining, can work around that by using a local SOCKS5 intermediate proxy.
Why are the connection string parameters generic, with no explicit mention of SOCKS5?
In the case that future changes will enable drivers using other proxy protocols, keeping the option names generic allows their re-use. In that case, another option would specify the protocol and SOCKS5 would be the implied default. However, since there is no reason to believe that such additions will be made in the foreseeable future, no option for specifying the proxy protocol is introduced here.
Why is support for authentication methods limited to no authentication and username/password authentication?
This matches the set of authentication methods most commonly implemented by SOCKS5 client libraries and thus reduces implementation complexity for drivers. This advantage is sufficient to ignore the possible advantages that would come with enabling other authentication methods.
Design Rationales
Alternative Designs
Changelog
- 2024-09-04: Migrated from reStructuredText to Markdown.
- 2022-10-05: Remove spec front matter
Initial DNS Seedlist Discovery
- Status: Accepted
- Minimum Server Version: N/A
Abstract
Presently, seeding a driver with an initial list of ReplicaSet or MongoS addresses is somewhat cumbersome, requiring a comma-delimited list of host names to attempt connections to. A standardized answer to this problem exists in the form of SRV records, which allow administrators to configure a single SRV record to return a list of host names. Supporting this feature would assist our users by decreasing maintenance load, primarily by removing the need to maintain seed lists at an application level.
This specification builds on the Connection String specification. It adds a new protocol scheme and modifies how the Host Information is interpreted.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Specification
Connection String Format
The connection string parser in the driver is extended with a new protocol mongodb+srv
as a logical pre-processing
step before it considers the connection string and SDAM specifications. In this protocol, the comma separated list of
host names is replaced with a single host name. The format is:
mongodb+srv://{hostname}/{options}
{options}
refers to the optional elements from the Connection String
specification following the Host Information
. This includes the Auth database
and Connection Options
.
For the purposes of this document, {hostname}
will be divided using the following terminology. If an SRV {hostname}
has:
-
Three or more
.
separated parts, then the left-most part is the{subdomain}
and the remaining portion is the{domainname}
.- Examples:
-
{hostname}
=cluster_1.tests.mongodb.co.uk
{subdomain}
=cluster_1
{domainname}
=tests.mongodb.co.uk
-
{hostname}
=hosts_34.example.com
{subdomain}
=hosts_34
{domainname}
=example.com
-
- Examples:
-
One or two
.
separated part(s), then the{hostname}
is equivalent to the{domainname}
, and there is no subdomain.- Examples:
{hostname}
={domainname}
=localhost
{hostname}
={domainname}
=mongodb.local
- Examples:
Only {domainname}
is used during SRV record verification and {subdomain}
is ignored.
MongoClient Configuration
srvMaxHosts
This option is used to limit the number of mongos connections that may be created for sharded topologies. This option
limits the number of SRV records used to populate the seedlist during initial discovery, as well as the number of
additional hosts that may be added during
SRV polling. This option
requires a non-negative integer and defaults to zero (i.e. no limit). This option MUST only be configurable at the level
of a MongoClient
.
srvServiceName
This option specifies a valid SRV service name according to
RFC 6335, with the exception that it may exceed 15
characters as long as the 63rd (62nd with prepended underscore) character DNS query limit is not surpassed. This option
requires a string value and defaults to "mongodb". This option MUST only be configurable at the level of a
MongoClient
.
URI Validation
The driver MUST report an error if either the srvServiceName
or srvMaxHosts
URI options are specified with a non-SRV
URI (i.e. scheme other than mongodb+srv
). The driver MUST allow specifying the srvServiceName
and srvMaxHosts
URI
options with an SRV URI (i.e. mongodb+srv
scheme).
If srvMaxHosts
is a positive integer, the driver MUST throw an error in the following cases:
- The connection string contains a
replicaSet
option. - The connection string contains a
loadBalanced
option with a value oftrue
.
When validating URI options, the driver MUST first do the SRV and TXT lookup and then perform the validation. For
drivers that do SRV lookup asynchronously this may result in a MongoClient
being instantiated but erroring later
during operation execution.
Seedlist Discovery
Validation Before Querying DNS
It is an error to specify a port in a connection string with the mongodb+srv
protocol, and the driver MUST raise a
parse error and MUST NOT do DNS resolution or contact hosts.
It is an error to specify more than one host name in a connection string with the mongodb+srv
protocol, and the driver
MUST raise a parse error and MUST NOT do DNS resolution or contact hosts.
If mongodb+srv
is used, a driver MUST implicitly also enable TLS. Clients can turn this off by passing tls=false
in
either the Connection String, or options passed in as parameters in code to the MongoClient constructor (or equivalent
API for each driver), but not through a TXT record (discussed in a later section).
Querying DNS
In this preprocessing step, the driver will query the DNS server for SRV records on the hostname, prefixed with the SRV
service name and protocol. The SRV service name is provided in the srvServiceName
URI option and defaults to
mongodb
. The protocol is always tcp
. After prefixing, the URI should look like: _{srvServiceName}._tcp.{hostname}
.
This DNS query is expected to respond with one or more SRV records.
The priority and weight fields in returned SRV records MUST be ignored.
If the DNS result returns no SRV records, or no records at all, or a DNS error happens, an error MUST be raised indicating that the URI could not be used to find hostnames. The error SHALL include the reason why they could not be found.
A driver MUST verify that the host names returned through SRV records share the original SRV's {domainname}
. In
addition, SRV records with fewer than three .
separated parts, the returned hostname MUST have at least one more
domain level than the SRV record hostname. Drivers MUST raise an error and MUST NOT initiate a connection to any
returned hostname which does not fulfill these requirements.
The driver MUST NOT attempt to connect to any hosts until the DNS query has returned its results.
If srvMaxHosts
is zero or greater than or equal to the number of hosts in the DNS result, the driver MUST populate the
seedlist with all hosts.
If srvMaxHosts
is greater than zero and less than the number of hosts in the DNS result, the driver MUST randomly
select that many hosts and use them to populate the seedlist. Drivers SHOULD use the
Fisher-Yates shuffle for
randomization.
Default Connection String Options
As a second preprocessing step, a Client MUST also query the DNS server for TXT records on {hostname}
. If available, a
TXT record provides default connection string options. The maximum length of a TXT record string is 255 characters, but
there can be multiple strings per TXT record. A Client MUST support multiple TXT record strings and concatenate them as
if they were one single string in the order they are defined in each TXT record. The order of multiple character strings
in each TXT record is guaranteed. A Client MUST NOT allow multiple TXT records for the same host name and MUST raise an
error when multiple TXT records are encountered.
Information returned within a TXT record is a simple URI string, just like the {options}
in a connection string.
A Client MUST only support the authSource
, replicaSet
, and loadBalanced
options through a TXT record, and MUST
raise an error if any other option is encountered. Although using mongodb+srv://
implicitly enables TLS, a Client MUST
NOT allow the ssl
option to be set through a TXT record option.
TXT records MAY be queried either before, in parallel, or after SRV records. Clients MUST query both the SRV and the TXT records before attempting any connection to MongoDB.
A Client MUST use options specified in the Connection String, and options passed in as parameters in code to the MongoClient constructor (or equivalent API for each driver), to override options provided through TXT records.
If any connection string option in a TXT record is incorrectly formatted, a Client MUST throw a parse exception.
This specification does not change the behaviour of handling unknown keys or incorrect values as is set out in the
Connection String spec. Unknown keys or
incorrect values in default options specified through TXT records MUST be handled in the same way as unknown keys or
incorrect values directly specified through a Connection String. For example, if a driver that does not support the
authSource
option finds authSource=db
in a TXT record, it MUST handle the unknown option according to the rules in
the Connection String spec.
CNAME not supported
The use of DNS CNAME records is not supported. Clients MUST NOT check for a CNAME record on {hostname}
. A system's DNS
resolver could transparently handle CNAME, but because of how clients validate records returned from SRV queries, use of
CNAME could break validation. Seedlist discovery therefore does not recommend or support the use of CNAME records in
concert with SRV or TXT records.
Example
If we provide the following URI:
mongodb+srv://server.mongodb.com/
The driver needs to request the DNS server for the SRV record _mongodb._tcp.server.mongodb.com
. This could return:
Record TTL Class Priority Weight Port Target
_mongodb._tcp.server.mongodb.com. 86400 IN SRV 0 5 27317 mongodb1.mongodb.com.
_mongodb._tcp.server.mongodb.com. 86400 IN SRV 0 5 27017 mongodb2.mongodb.com.
The returned host names (mongodb1.mongodb.com
and mongodb2.mongodb.com
) must share the same domainname
(mongodb.com
) as the provided host name (server.mongodb.com
).
The driver also needs to request the DNS server for the TXT records on server.mongodb.com
. This could return:
Record TTL Class Text
server.mongodb.com. 86400 IN TXT "replicaSet=replProduction&authSource=authDB"
From the DNS results, the driver now MUST treat the host information as if the following URI was used instead:
mongodb://mongodb1.mongodb.com:27317,mongodb2.mongodb.com:27107/?ssl=true&replicaSet=replProduction&authSource=authDB
If we provide the following URI with the same DNS (SRV and TXT) records:
mongodb+srv://server.mongodb.com/?authSource=otherDB
Then the default in the TXT record for authSource
is not used as the value in the connection string overrides it. The
Client MUST treat the host information as if the following URI was used instead:
mongodb://mongodb1.mongodb.com:27317,mongodb2.mongodb.com:27107/?ssl=true&replicaSet=replProduction&authSource=otherDB
Test Plan
Prose Tests
See README.md in the accompanying test directory.
Spec Tests
See README.md in the accompanying test directory.
Additionally, see the mongodb+srv
test invalid-uris.yml
in the
Connection String Spec tests.
Motivation
Several of our users have asked for this through tickets:
- https://jira.mongodb.org/browse/DRIVERS-201
- https://jira.mongodb.org/browse/NODE-865
- https://jira.mongodb.org/browse/CSHARP-536
Design Rationale
The design specifically calls for a pre-processing stage of the processing of connection URLs to minimize the impact on existing functionality.
Justifications
Why Are Multiple Key-Value Pairs Allowed in One TXT Record?
One could imagine an alternative design in which each TXT record would allow only one URI option. No &
character would
be allowed as a delimiter within TXT records.
In this spec we allow multiple key-value pairs within one TXT record, delimited by &
, because it will be common for
all options to fit in a single 255-character TXT record, and it is much more convenient to configure one record in this
case than to configure several.
Secondly, in some cases the order in which options occur is important. For example, readPreferenceTags can appear both multiple times, and the order in which they appear is significant. Because DNS servers may return TXT records in any order, it is only possible to guarantee the order in which readPreferenceTags keys appear by having them in the same TXT record.
Why Is There No Mention of UTF-8 Characters?
Although DNS TXT records allow any octet to exist in its value, many DNS providers do not allow non-ASCII characters to be configured. As it is unlikely that any option names or values in the connection string have non-ASCII characters, we left the behaviour of supporting UTF-8 characters as unspecified.
Reference Implementation
None yet.
Backwards Compatibility
There are no backwards compatibility concerns.
Future Work
In the future we could consider using the priority and weight fields of the SRV records.
ChangeLog
-
2024-09-24: Removed requirement for URI to have three '.' separated parts; these SRVs have stricter parent domain matching requirements for security. Create terminology section. Remove usage of term
{TLD}
. The{hostname}
now refers to the entire hostname, not just the{subdomain}
. -
2024-03-06: Migrated from reStructuredText to Markdown.
-
2022-10-05: Revise spec front matter and reformat changelog.
-
2021-10-14: Add
srvMaxHosts
MongoClient option and restructure Seedlist Discovery section. Improve documentation for thesrvServiceName
MongoClient option and add a new URI Validation section. -
2021-09-15: Clarify that service name only defaults to
mongodb
, and should be defined by thesrvServiceName
URI option. -
2021-04-15: Adding in behaviour for load balancer mode.
-
2019-03-07: Clarify that CNAME is not supported
-
2018-02-08: Clarify that
{options}}
in the Specification section includes all the optional elements from the Connection String specification. -
2017-11-21: Add clause that using
mongodb+srv://
implies enabling TLS. Add restriction that onlyauthSource
andreplicaSet
are allows in TXT records. Add restriction that only one TXT record is supported share the same parent domain name as the given host name. -
2017-11-17: Add new rule that indicates that host names in returned SRV records MUST share the same parent domain name as the given host name. Remove language and tests for non-ASCII characters.
-
2017-11-07: Clarified that all parts of listable options such as readPreferenceTags are ignored if they are also present in options to the MongoClient constructor. Clarified which host names to use for SRV and TXT DNS queries.
-
2017-11-01: Clarified that individual TXT records can have multiple strings.
-
2017-10-31: Added a clause that specifying two host names with a
mongodb+srv://
URI is not allowed. Added a few more test cases. -
2017-10-18: Removed prohibition of raising DNS related errors when parsing the URI.
-
2017-10-04: Removed from Future Work the line about multiple MongoS discovery. The current specification already allows for it, as multiple host names which are all MongoS servers is already allowed under SDAM. And this specification does not modify SDAM. Added support for connection string options through TXT records.
-
2017-09-19: Clarify that host names in
mongodb+srv://
URLs work like normal host specifications. -
2017-09-01: Updated test plan with YAML tests, and moved prose tests for URI parsing into invalid-uris.yml in the Connection String Spec tests.
Server Discovery And Monitoring
- Status: Accepted
- Minimum Server Version: 2.4
Abstract
This spec defines how a MongoDB client discovers and monitors one or more servers. It covers monitoring a single server, a set of mongoses, or a replica set. How does the client determine what type of servers they are? How does it keep this information up to date? How does the client find an entire replica set from a seed list, and how does it respond to a stepdown, election, reconfiguration, or network error?
All drivers must answer these questions the same. Or, where platforms' limitations require differences among drivers, there must be as few answers as possible and each must be clearly explained in this spec. Even in cases where several answers seem equally good, drivers must agree on one way to do it.
MongoDB users and driver authors benefit from having one way to discover and monitor servers. Users can substantially understand their driver's behavior without inspecting its code or asking its author. Driver authors can avoid subtle mistakes when they take advantage of a design that has been well-considered, reviewed, and tested.
The server discovery and monitoring method is specified in four sections. First, a client is configured. Second, it begins monitoring by calling hello or legacy hello on all servers. (Multi-threaded and asynchronous monitoring is described first, then single-threaded monitoring.) Third, as hello or legacy hello responses are received the client parses them, and fourth, it updates its view of the topology.
Finally, this spec describes how drivers update their topology view in response to errors, and includes generous implementation notes for driver authors.
This spec does not describe how a client chooses a server for an operation; that is the domain of the Server Selection Spec. But there is a section describing the interaction between monitoring and server selection.
There is no discussion of driver architecture and data structures, nor is there any specification of a user-facing API. This spec is only concerned with the algorithm for monitoring the server topology.
Meta
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Specification
General Requirements
Direct connections: A client MUST be able to connect to a single server of any type. This includes querying hidden replica set members, and connecting to uninitialized members (see RSGhost) in order to run "replSetInitiate". Setting a read preference MUST NOT be necessary to connect to a secondary. Of course, the secondary will reject all operations done with the PRIMARY read preference because the secondaryOk bit is not set, but the initial connection itself succeeds. Drivers MAY allow direct connections to arbiters (for example, to run administrative commands).
Replica sets: A client MUST be able to discover an entire replica set from a seed list containing one or more replica set members. It MUST be able to continue monitoring the replica set even when some members go down, or when reconfigs add and remove members. A client MUST be able to connect to a replica set while there is no primary, or the primary is down.
Mongos: A client MUST be able to connect to a set of mongoses and monitor their availability and round trip time. This spec defines how mongoses are discovered and monitored, but does not define which mongos is selected for a given operation.
Terms
Server
A mongod or mongos process, or a load balancer.
Deployment
One or more servers: either a standalone, a replica set, or one or more mongoses.
Topology
The state of the deployment: its type (standalone, replica set, or sharded), which servers are up, what type of servers they are, which is primary, and so on.
Client
Driver code responsible for connecting to MongoDB.
Seed list
Server addresses provided to the client in its initial configuration, for example from the connection string.
Data-Bearing Server Type
A server type from which a client can receive application data:
- Mongos
- RSPrimary
- RSSecondary
- Standalone
- LoadBalanced
Round trip time
Also known as RTT.
The client's measurement of the duration of one hello or legacy hello call. The round trip time is used to support the "localThresholdMS"1 option in the Server Selection Spec.
hello or legacy hello outcome
The result of an attempt to call the hello or legacy hello command on a server. It consists of three elements: a boolean indicating the success or failure of the attempt, a document containing the command response (or null if it failed), and the round trip time to execute the command (or null if it failed).
check
The client checks a server by attempting to call hello or legacy hello on it, and recording the outcome.
scan
The process of checking all servers in the deployment.
suitable
A server is judged "suitable" for an operation if the client can use it for a particular operation. For example, a write requires a standalone, primary, or mongos. Suitability is fully specified in the Server Selection Spec.
address
The hostname or IP address, and port number, of a MongoDB server.
network error
An error that occurs while reading from or writing to a network socket.
network timeout
A timeout that occurs while reading from or writing to a network socket.
minHeartbeatFrequencyMS
Defined in the Server Monitoring spec. This value MUST be 500 ms, and it MUST NOT be configurable.
pool generation number
The pool's generation number which starts at 0 and is incremented each time the pool is cleared. Defined in the Connection Monitoring and Pooling spec.
connection generation number
The pool's generation number at the time this connection was created. Defined in the Connection Monitoring and Pooling spec.
error generation number
The error's generation number is the generation of the connection on which the application error occurred. Note that when a network error occurs before the handshake completes then the error's generation number is the generation of the pool at the time the connection attempt was started.
State Change Error
A server reply document indicating a "not writable primary" or "node is recovering" error. Starting in MongoDB 4.4 these errors may also include a topologyVersion field.
Data structures
This spec uses a few data structures to describe the client's view of the topology. It must be emphasized that a driver is free to implement the same behavior using different data structures. This spec uses these enums and structs in order to describe driver behavior, not to mandate how a driver represents the topology, nor to mandate an API.
Constants
clientMinWireVersion and clientMaxWireVersion
Integers. The wire protocol range supported by the client.
Enums
TopologyType
Single, ReplicaSetNoPrimary, ReplicaSetWithPrimary, Sharded, LoadBalanced, or Unknown.
See updating the TopologyDescription.
ServerType
Standalone, Mongos, PossiblePrimary, RSPrimary, RSSecondary, RSArbiter, RSOther, RSGhost, LoadBalancer or Unknown.
See parsing a hello or legacy hello response.
[!NOTE] Single-threaded clients use the PossiblePrimary type to maintain proper scanning order. Multi-threaded and asynchronous clients do not need this ServerType; it is synonymous with Unknown.
TopologyDescription
The client's representation of everything it knows about the deployment's topology.
Fields:
type
: a TopologyType enum value. See initial TopologyType.setName
: the replica set name. Default null.maxElectionId
: an ObjectId or null. The largest electionId ever reported by a primary. Default null. Part of the (electionId
,setVersion
) tuple.maxSetVersion
: an integer or null. The largest setVersion ever reported by a primary. It may not monotonically increase, as electionId takes precedence in ordering Default null. Part of the (electionId
,setVersion
) tuple.servers
: a set of ServerDescription instances, one for each of the servers in the topology.stale
: a boolean for single-threaded clients, whether the topology must be re-scanned. (Not related to maxStalenessSeconds, nor to stale primaries.)compatible
: a boolean. False if any server's wire protocol version range is incompatible with the client's. Default true.compatibilityError
: a string. The error message if "compatible" is false, otherwise null.logicalSessionTimeoutMinutes
: integer or null. Default null. See logical session timeout.
ServerDescription
The client's view of a single server, based on the most recent hello or legacy hello outcome.
Again, drivers may store this information however they choose; this data structure is defined here merely to describe the monitoring algorithm.
Fields:
-
address
: the hostname or IP, and the port number, that the client connects to. Note that this is not theme
field in the server's hello or legacy hello response, in the case that the server reports an address different from the address the client uses. -
(=)
error
: information about the last error related to this server. Default null. MUST contain or be able to produce a string describing the error. -
roundTripTime
: the duration of the hello or legacy hello call. Default null. -
minRoundTripTime
: the minimum RTT for the server. Default null. -
lastWriteDate
: a 64-bit BSON datetime or null. ThelastWriteDate
from the server's most recent hello or legacy hello response. -
opTime
: an opTime or null. An opaque value representing the position in the oplog of the most recently seen write. Default null. (Only mongos and shard servers record this field when monitoring config servers as replica sets, at least until drivers allow applications to use readConcern "afterOptime".) -
(=)
type
: a ServerType enum value. Default Unknown. -
(=)
minWireVersion
,maxWireVersion
: the wire protocol version range supported by the server. Both default to 0. Use min and maxWireVersion only to determine compatibility. -
(=)
me
: The hostname or IP, and the port number, that this server was configured with in the replica set. Default null. -
(=)
hosts
,passives
,arbiters
: Sets of addresses. This server's opinion of the replica set's members, if any. These hostnames are normalized to lower-case. Default empty. The client monitors all three types of servers in a replica set. -
(=)
tags
: map from string to string. Default empty. -
(=)
setName
: string or null. Default null. -
(=)
electionId
: an ObjectId, if this is a MongoDB 2.6+ replica set member that believes it is primary. See using electionId and setVersion to detect stale primaries. Default null. -
(=)
setVersion
: integer or null. Default null. -
(=)
primary
: an address. This server's opinion of who the primary is. Default null. -
lastUpdateTime
: when this server was last checked. Default "infinity ago". -
(=)
logicalSessionTimeoutMinutes
: integer or null. Default null. -
(=)
topologyVersion
: A topologyVersion or null. Default null. The "topologyVersion" from the server's most recent hello or legacy hello response or State Change Error. -
(=)
iscryptd
: boolean indicating if the server is a mongocryptd server. Default null.
"Passives" are priority-zero replica set members that cannot become primary. The client treats them precisely the same as other members.
Fields marked (=) are used for Server Description Equality comparison.
Configuration
No breaking changes
This spec does not intend to require any drivers to make breaking changes regarding what configuration options are available, how options are named, or what combinations of options are allowed.
Initial TopologyDescription
The default values for TopologyDescription fields are described above. Users may override the defaults as follows:
Initial Servers
The user MUST be able to set the initial servers list to a seed list of one or more addresses.
The hostname portion of each address MUST be normalized to lower-case.
Initial TopologyType
If the directConnection
URI option is specified when a MongoClient is constructed, the TopologyType must be
initialized based on the value of the directConnection
option and the presence of the replicaSet
option according to
the following table:
directConnection | replicaSet present | Initial TopologyType |
---|---|---|
true | no | Single |
true | yes | Single |
false | no | Unknown |
false | yes | ReplicaSetNoPrimary |
If the directConnection
option is not specified, newly developed drivers MUST behave as if it was specified with the
false value.
Since changing the starting topology can reasonably be considered a backwards-breaking change, existing drivers SHOULD
stage implementation according to semantic versioning guidelines. Specifically, support for the directConnection
URI
option can be added in a minor release. In a subsequent major release, the default starting topology can be changed to
Unknown. Drivers MUST document this in a prior minor release.
Existing drivers MUST deprecate other URI options, if any, for controlling topology discovery or specifying the
deployment topology. If such a legacy option is specified and the directConnection
option is also specified, and the
values of the two options are semantically different, the driver MUST report an error during URI option parsing.
The API for initializing TopologyType using language-specific native options is not specified here. Drivers might already have a convention, e.g. a single seed means Single, a setName means ReplicaSetNoPrimary, and a list of seeds means Unknown. There are variations, however: In the Java driver a single seed means Single, but a list containing one seed means Unknown, so it can transition to replica-set monitoring if the seed is discovered to be a replica set member. In contrast, PyMongo requires a non-null setName in order to begin replica-set monitoring, regardless of the number of seeds. This spec does not cover language-specific native options that a driver may provide.
Initial setName
It is allowed to use directConnection=true
in conjunction with the replicaSet
URI option. The driver must connect in
Single topology and verify that setName matches the specified name, as per
verifying setName with TopologyType Single.
When a MongoClient is initialized using language-specific native options, the user MUST be able to set the client's initial replica set name. A driver MAY require the set name in order to connect to a replica set, or it MAY be able to discover the replica set name as it connects.
Allowed configuration combinations
Drivers MUST enforce:
- TopologyType Single cannot be used with multiple seeds.
directConnection=true
cannot be used with multiple seeds.- If setName is not null, only TopologyType ReplicaSetNoPrimary, and possibly Single, are allowed. (See verifying setName with TopologyType Single.)
loadBalanced=true
cannot be used in conjunction withdirectConnection=true
orreplicaSet
Handling of SRV URIs resolving to single host
When a driver is given an SRV URI, if the directConnection
URI option is not specified, and the replicaSet
URI
option is not specified, the driver MUST start in Unknown topology, and follow the rules in the
TopologyType table for transitioning to other topologies. In particular, the driver MUST NOT use
the number of hosts from the initial SRV lookup to decide what topology to start in.
heartbeatFrequencyMS
The interval between server checks, counted from the end of the previous check until the beginning of the next one.
For multi-threaded and asynchronous drivers it MUST default to 10 seconds and MUST be configurable. For single-threaded drivers it MUST default to 60 seconds and MUST be configurable. It MUST be called heartbeatFrequencyMS unless this breaks backwards compatibility.
For both multi- and single-threaded drivers, the driver MUST NOT permit users to configure it less than minHeartbeatFrequencyMS (500ms).
(See heartbeatFrequencyMS defaults to 10 seconds or 60 seconds and what's the point of periodic monitoring?)
Client construction
Except for initial DNS seed list discovery when
given a connection string with mongodb+srv
scheme, the client's constructor MUST NOT do any I/O. This means that the
constructor does not throw an exception if servers are unavailable: the topology is not yet known when the constructor
returns. Similarly if a server has an incompatible wire protocol version, the constructor does not throw. Instead, all
subsequent operations on the client fail as long as the error persists.
See clients do no I/O in the constructor for the justification.
Multi-threaded and asynchronous client construction
The constructor MAY start the monitors as background tasks and return immediately. Or the monitors MAY be started by some method separate from the constructor; for example they MAY be started by some "initialize" method (by any name), or on the first use of the client for an operation.
Single-threaded client construction
Single-threaded clients do no I/O in the constructor. They MUST scan the servers on demand, when the first operation is attempted.
Client closing
When a client is closing, before it emits the TopologyClosedEvent
as per the
Events API, it SHOULD remove all
servers from its TopologyDescription
and set its TopologyType
to Unknown
, emitting the corresponding
TopologyDescriptionChangedEvent
.
Monitoring
See the Server Monitoring spec for how a driver monitors each server. In summary, the client monitors each server in the topology. The scope of server monitoring is to provide the topology with updated ServerDescriptions based on hello or legacy hello command responses.
Parsing a hello or legacy hello response
The client represents its view of each server with a ServerDescription. Each time the client checks a server, it MUST replace its description of that server with a new one if and only if the new ServerDescription's topologyVersion is greater than or equal to the current ServerDescription's topologyVersion.
(See Replacing the TopologyDescription for an example implementation.)
This replacement MUST happen even if the new server description compares equal to the previous one, in order to keep client-tracked attributes like last update time and round trip time up to date.
Drivers MUST be able to handle responses to both hello
and legacy hello commands. When checking results, drivers MUST
first check for the isWritablePrimary
field and fall back to checking for an ismaster
field if isWritablePrimary
was not found.
ServerDescriptions are created from hello or legacy hello outcomes as follows:
type
The new ServerDescription's type field is set to a ServerType. Note that these states do not exactly correspond to replica set member states. For example, some replica set member states like STARTUP and RECOVERING are identical from the client's perspective, so they are merged into "RSOther". Additionally, states like Standalone and Mongos are not replica set member states at all.
State | Symptoms |
---|---|
Unknown | Initial, or after a network error or failed hello or legacy hello call, or "ok: 1" not in hello or legacy hello response. |
Standalone | No "msg: isdbgrid", no setName, and no "isreplicaset: true". |
Mongos | "msg: isdbgrid". |
PossiblePrimary | Not yet checked, but another member thinks it is the primary. |
RSPrimary | "isWritablePrimary: true" or "ismaster: true", "setName" in response. |
RSSecondary | "secondary: true", "setName" in response. |
RSArbiter | "arbiterOnly: true", "setName" in response. |
RSOther | "setName" in response, "hidden: true" or not primary, secondary, nor arbiter. |
RSGhost | "isreplicaset: true" in response. |
LoadBalanced | "loadBalanced=true" in URI. |
A server can transition from any state to any other. For example, an administrator could shut down a secondary and bring up a mongos in its place.
RSGhost and RSOther
The client MUST monitor replica set members even when they cannot be queried. These members are in state RSGhost or RSOther.
RSGhost members occur in at least three situations:
- briefly during server startup,
- in an uninitialized replica set,
- or when the server is shunned (removed from the replica set config).
An RSGhost server has no hosts list nor setName. Therefore the client MUST NOT attempt to use its hosts list nor check its setName (see JAVA-1161 or CSHARP-671.) However, the client MUST keep the RSGhost member in its TopologyDescription, in case the client's only hope for staying connected to the replica set is that this member will transition to a more useful state.
For simplicity, this is the rule: any server is an RSGhost that reports "isreplicaset: true".
Non-ghost replica set members have reported their setNames since MongoDB 1.6.2. See only support replica set members running MongoDB 1.6.2 or later.
[!NOTE] The Java driver does not have a separate state for RSGhost; it is an RSOther server with no hosts list.
RSOther servers may be hidden, starting up, or recovering. They cannot be queried, but their hosts lists are useful for discovering the current replica set configuration.
If a hidden member is provided as a seed, the client can use it to find the primary. Since the hidden member does not appear in the primary's host list, it will be removed once the primary is checked.
error
If the client experiences any error when checking a server, it stores error information in the ServerDescription's error field. The message contained in this field MUST contain the substrings detailed in the table below when the ServerDescription is changed to Unknown in the circumstances outlined.
circumstance | error substring |
---|---|
RSPrimary with a stale electionId/setVersion is discovered | 'primary marked stale due to electionId/setVersion mismatch, <stale tuple> is stale compared to <max tuple>' |
New primary is elected/discovered | 'primary marked stale due to discovery of newer primary' |
roundTripTime
Drivers MUST record the server's round trip time (RTT) after each successful call to hello or legacy hello. The Server Selection Spec describes how RTT is averaged and how it is used in server selection. Drivers MUST also record the server's minimum RTT per Server Monitoring (Measuring RTT).
If a hello or legacy hello call fails, the RTT is not updated. Furthermore, while a server's type is Unknown its RTT is null, and if it changes from a known type to Unknown its RTT is set to null. However, if it changes from one known type to another (e.g. from RSPrimary to RSSecondary) its RTT is updated normally, not set to null nor restarted from scratch.
lastWriteDate and opTime
The hello or legacy hello response of a replica set member running MongoDB 3.4 and later contains a lastWrite
subdocument with fields lastWriteDate
and opTime
(SERVER-8858). If
these fields are available, parse them from the hello or legacy hello response, otherwise set them to null.
Clients MUST NOT attempt to compensate for the network latency between when the server generated its hello or legacy
hello response and when the client records lastUpdateTime
.
lastUpdateTime
Clients SHOULD set lastUpdateTime with a monotonic clock.
Hostnames are normalized to lower-case
The same as with seeds provided in the initial configuration, all hostnames in the hello or legacy hello response's "me", "hosts", "passives", and "arbiters" entries MUST be lower-cased.
This prevents unnecessary work rediscovering a server if a seed "A" is provided and the server responds that "a" is in the replica set.
Domain Name System (DNS) names are "case insensitive".
logicalSessionTimeoutMinutes
MongoDB 3.6 and later include a logicalSessionTimeoutMinutes
field if logical sessions are enabled in the deployment.
Clients MUST check for this field and set the ServerDescription's logicalSessionTimeoutMinutes field to this value, or
to null otherwise.
topologyVersion
MongoDB 4.4 and later include a topologyVersion
field in all hello or legacy hello and
State Change Error responses. Clients MUST check for this field and set the ServerDescription's
topologyVersion field to this value, if present. The topologyVersion helps the client and server determine the relative
freshness of topology information in concurrent messages. (See
What is the purpose of topologyVersion?)
The topologyVersion is a subdocument with two fields, "processId" and "counter":
{
topologyVersion: {processId: <ObjectId>, counter: <int64>},
( ... other fields ...)
}
topologyVersion Comparison
To compare a topologyVersion from a hello or legacy hello or State Change Error response to the current ServerDescription's topologyVersion:
- If the response topologyVersion is unset or the ServerDescription's topologyVersion is null, the client MUST assume the response is more recent.
- If the response's topologyVersion.processId is not equal to the ServerDescription's, the client MUST assume the response is more recent.
- If the response's topologyVersion.processId is equal to the ServerDescription's, the client MUST use the counter field to determine which topologyVersion is more recent.
See Replacing the TopologyDescription for an example implementation of topologyVersion comparison.
serviceId
MongoDB 5.0 and later, as well as any mongos-like service, include a serviceId
field when the service is configured
behind a load balancer.
Other ServerDescription fields
Other required fields defined in the ServerDescription data structure are parsed from the hello or legacy hello response in the obvious way.
Server Description Equality
For the purpose of determining whether to publish SDAM events, two server descriptions having the same address MUST be considered equal if and only if the values of ServerDescription fields marked (=) are respectively equal.
This specification does not prescribe how to compare server descriptions with different addresses for equality.
Updating the TopologyDescription
Each time the client checks a server, it processes the outcome (successful or not) to create a ServerDescription, and then it processes the ServerDescription to update its TopologyDescription.
The TopologyDescription's TopologyType influences how the ServerDescription is processed. The following subsection specifies how the client updates its TopologyDescription when the TopologyType is Single. The next subsection treats the other types.
TopologyType Single
The TopologyDescription's type was initialized as Single and remains Single forever. There is always one ServerDescription in TopologyDescription.servers.
Whenever the client checks a server (successfully or not), and regardless of whether the new server description is equal to the previous server description as defined in Server Description Equality, the ServerDescription in TopologyDescription.servers MUST be replaced with the new ServerDescription.
Checking wire protocol compatibility
A ServerDescription which is not Unknown is incompatible if:
minWireVersion
>clientMaxWireVersion
, ormaxWireVersion
<clientMinWireVersion
If any ServerDescription is incompatible, the client MUST set the TopologyDescription's "compatible" field to false and fill out the TopologyDescription's "compatibilityError" field like so:
-
if
ServerDescription.minWireVersion
>clientMaxWireVersion
:"Server at
host
:port
requires wire versionminWireVersion
, but this version ofdriverName
only supports up toclientMaxWireVersion
." -
if
ServerDescription.maxWireVersion
<clientMinWireVersion
:"Server at
host
:port
reports wire versionmaxWireVersion
, but this version ofdriverName
requires at leastclientMinWireVersion
(MongoDBmongoVersion
)."
Replace mongoVersion
with the appropriate MongoDB minor version, for example if clientMinWireVersion
is 2 and it
connects to MongoDB 2.4, format the error like:
"Server at example.com:27017 reports wire version 0, but this version of My Driver requires at least 2 (MongoDB 2.6)."
In this second case, the exact required MongoDB version is known and can be named in the error message, whereas in the first case the implementer does not know which MongoDB versions will be compatible or incompatible in the future.
Verifying setName with TopologyType Single
A client MAY allow the user to supply a setName with an initial TopologyType of Single. In this case, if the ServerDescription's setName is null or wrong, the ServerDescription MUST be replaced with a default ServerDescription of type Unknown.
TopologyType LoadBalanced
See the Load Balancer Specification for details.
Other TopologyTypes
If the TopologyType is not Single, the topology can contain zero or more servers. The state of topology containing zero servers is terminal (because servers can only be added if they are reported by a server already in the topology). A client SHOULD emit a warning if it is constructed with no seeds in the initial seed list. A client SHOULD emit a warning when, in the process of updating its topology description, it removes the last server from the topology.
Whenever a client completes a hello or legacy hello call, it creates a new ServerDescription with the proper ServerType. It replaces the server's previous description in TopologyDescription.servers with the new one.
Apply the logic for checking wire protocol compatibility to each ServerDescription in the topology. If any server's wire protocol version range does not overlap with the client's, the client updates the "compatible" and "compatibilityError" fields as described above for TopologyType Single. Otherwise "compatible" is set to true.
It is possible for a multi-threaded client to receive a hello or legacy hello outcome from a server after the server has been removed from the TopologyDescription. For example, a monitor begins checking a server "A", then a different monitor receives a response from the primary claiming that "A" has been removed from the replica set, so the client removes "A" from the TopologyDescription. Then, the check of server "A" completes.
In all cases, the client MUST ignore hello or legacy hello outcomes from servers that are not in the TopologyDescription.
The following subsections explain in detail what actions the client takes after replacing the ServerDescription.
TopologyType table
The new ServerDescription's type is the vertical axis, and the current TopologyType is the horizontal. Where a ServerType and a TopologyType intersect, the table shows what action the client takes.
"no-op" means, do nothing after replacing the server's old description with the new one.
TopologyType Unknown | TopologyType Sharded | TopologyType ReplicaSetNoPrimary | TopologyType ReplicaSetWithPrimary | |
---|---|---|---|---|
ServerType Unknown | no-op | no-op | no-op | checkIfHasPrimary |
ServerType Standalone | updateUnknownWithStandalone | remove | remove | remove and checkIfHasPrimary |
ServerType Mongos | Set topology type to Sharded | no-op | remove | remove and checkIfHasPrimary |
ServerType RSPrimary | Set topology type to ReplicaSetWithPrimary then updateRSFromPrimary | remove | Set topology type to ReplicaSetWithPrimary then updateRSFromPrimary | updateRSFromPrimary |
ServerType RSSecondary | Set topology type to ReplicaSetNoPrimary then updateRSWithoutPrimary | remove | updateRSWithoutPrimary | updateRSWithPrimaryFromMember |
ServerType RSArbiter | Set topology type to ReplicaSetNoPrimary then updateRSWithoutPrimary | remove | updateRSWithoutPrimary | updateRSWithPrimaryFromMember |
ServerType RSOther | Set topology type to ReplicaSetNoPrimary then updateRSWithoutPrimary | remove | updateRSWithoutPrimary | updateRSWithPrimaryFromMember |
ServerType RSGhost | no-op2 | remove | no-op | checkIfHasPrimary |
TopologyType explanations
This subsection complements the TopologyType table with prose explanations of the TopologyTypes (besides Single and LoadBalanced).
TopologyType Unknown
A starting state.
Actions:
- If the incoming ServerType is Unknown (that is, the hello or legacy hello call failed), keep the server in TopologyDescription.servers. The TopologyType remains Unknown.
- The TopologyType remains Unknown when an RSGhost is discovered, too.
- If the type is Standalone, run updateUnknownWithStandalone.
- If the type is Mongos, set the TopologyType to Sharded.
- If the type is RSPrimary, record its setName and call updateRSFromPrimary.
- If the type is RSSecondary, RSArbiter or RSOther, record its setName, set the TopologyType to ReplicaSetNoPrimary, and call updateRSWithoutPrimary.
TopologyType Sharded
A steady state. Connected to one or more mongoses.
Actions:
- If the server is Unknown or Mongos, keep it.
- Remove others.
TopologyType ReplicaSetNoPrimary
A starting state. The topology is definitely a replica set, but no primary is known.
Actions:
- Keep Unknown servers.
- Keep RSGhost servers: they are members of some replica set, perhaps this one, and may recover. (See RSGhost and RSOther.)
- Remove any Standalones or Mongoses.
- If the type is RSPrimary call updateRSFromPrimary.
- If the type is RSSecondary, RSArbiter or RSOther, run updateRSWithoutPrimary.
TopologyType ReplicaSetWithPrimary
A steady state. The primary is known.
Actions:
- If the server type is Unknown, keep it, and run checkIfHasPrimary.
- Keep RSGhost servers: they are members of some replica set, perhaps this one, and may recover. (See RSGhost and RSOther.) Run checkIfHasPrimary.
- Remove any Standalones or Mongoses and run checkIfHasPrimary.
- If the type is RSPrimary run updateRSFromPrimary.
- If the type is RSSecondary, RSArbiter or RSOther, run updateRSWithPrimaryFromMember.
Actions
updateUnknownWithStandalone
This subroutine is executed with the ServerDescription from Standalone when the TopologyType is Unknown:
if description.address not in topologyDescription.servers:
return
if settings.seeds has one seed:
topologyDescription.type = Single
else:
remove this server from topologyDescription and stop monitoring it
See TopologyType remains Unknown when one of the seeds is a Standalone.
updateRSWithoutPrimary
This subroutine is executed with the ServerDescription from an RSSecondary, RSArbiter, or RSOther when the TopologyType is ReplicaSetNoPrimary:
if description.address not in topologyDescription.servers:
return
if topologyDescription.setName is null:
topologyDescription.setName = description.setName
else if topologyDescription.setName != description.setName:
remove this server from topologyDescription and stop monitoring it
return
for each address in description's "hosts", "passives", and "arbiters":
if address is not in topologyDescription.servers:
add new default ServerDescription of type "Unknown"
begin monitoring the new server
if description.primary is not null:
find the ServerDescription in topologyDescription.servers whose
address equals description.primary
if its type is Unknown, change its type to PossiblePrimary
if description.address != description.me:
remove this server from topologyDescription and stop monitoring it
return
Unlike updateRSFromPrimary, this subroutine does not remove any servers from the TopologyDescription based on the list of servers in the "hosts" field of the hello or legacy hello response. The only server that might be removed is the server itself that the hello or legacy hello response is from.
The special handling of description.primary ensures that a single-threaded client scans the possible primary before other members.
See replica set monitoring with and without a primary.
updateRSWithPrimaryFromMember
This subroutine is executed with the ServerDescription from an RSSecondary, RSArbiter, or RSOther when the TopologyType is ReplicaSetWithPrimary:
if description.address not in topologyDescription.servers:
# While we were checking this server, another thread heard from the
# primary that this server is not in the replica set.
return
# SetName is never null here.
if topologyDescription.setName != description.setName:
remove this server from topologyDescription and stop monitoring it
checkIfHasPrimary()
return
if description.address != description.me:
remove this server from topologyDescription and stop monitoring it
checkIfHasPrimary()
return
# Had this member been the primary?
if there is no primary in topologyDescription.servers:
topologyDescription.type = ReplicaSetNoPrimary
if description.primary is not null:
find the ServerDescription in topologyDescription.servers whose
address equals description.primary
if its type is Unknown, change its type to PossiblePrimary
The special handling of description.primary ensures that a single-threaded client scans the possible primary before other members.
updateRSFromPrimary
This subroutine is executed with a ServerDescription of type RSPrimary:
if serverDescription.address not in topologyDescription.servers:
return
if topologyDescription.setName is null:
topologyDescription.setName = serverDescription.setName
else if topologyDescription.setName != serverDescription.setName:
# We found a primary but it doesn't have the setName
# provided by the user or previously discovered.
remove this server from topologyDescription and stop monitoring it
checkIfHasPrimary()
return
# Election ids are ObjectIds, see
# see "Using electionId and setVersion to detect stale primaries"
# for comparison rules.
if serverDescription.maxWireVersion >= 17: # MongoDB 6.0+
# Null values for both electionId and setVersion are always considered less than
if serverDescription.electionId > topologyDescription.maxElectionId or (
serverDescription.electionId == topologyDescription.maxElectionId
and serverDescription.setVersion >= topologyDescription.maxSetVersion
):
topologyDescription.maxElectionId = serverDescription.electionId
topologyDescription.maxSetVersion = serverDescription.setVersion
else:
# Stale primary.
# The error field MUST include the substring "primary marked stale due to electionId/setVersion mismatch"
replace serverDescription with a default ServerDescription of type "Unknown"
checkIfHasPrimary()
return
else:
# Maintain old comparison rules, namely setVersion is checked before electionId
if serverDescription.setVersion is not null and serverDescription.electionId is not null:
if (
topologyDescription.maxSetVersion is not null
and topologyDescription.maxElectionId is not null
and (
topologyDescription.maxSetVersion > serverDescription.setVersion
or (
topologyDescription.maxSetVersion == serverDescription.setVersion
and topologyDescription.maxElectionId > serverDescription.electionId
)
)
):
# Stale primary.
# The error field MUST include the substring "primary marked stale due to electionId/setVersion mismatch"
replace serverDescription with a default ServerDescription of type "Unknown"
checkIfHasPrimary()
return
topologyDescription.maxElectionId = serverDescription.electionId
if serverDescription.setVersion is not null and (
topologyDescription.maxSetVersion is null
or serverDescription.setVersion > topologyDescription.maxSetVersion
):
topologyDescription.maxSetVersion = serverDescription.setVersion
for each server in topologyDescription.servers:
if server.address != serverDescription.address:
if server.type is RSPrimary:
# See note below about invalidating an old primary.
# the error field MUST include the substring "primary marked stale due to discovery of newer primary"
replace the server with a default ServerDescription of type "Unknown"
for each address in serverDescription's "hosts", "passives", and "arbiters":
if address is not in topologyDescription.servers:
add new default ServerDescription of type "Unknown"
begin monitoring the new server
for each server in topologyDescription.servers:
if server.address not in serverDescription's "hosts", "passives", or "arbiters":
remove the server and stop monitoring it
checkIfHasPrimary()
A note on invalidating the old primary: when a new primary is discovered, the client finds the previous primary (there
should be none or one) and replaces its description with a default ServerDescription of type "Unknown". Additionally,
the error
field of the new ServerDescription
object MUST include a descriptive error explaining that it was
invalidated because the primary was determined to be stale. A multi-threaded client MUST
request an immediate check for that server as soon as possible.
If the old primary server version is 4.0 or earlier, the client MUST clear its connection pool for the old primary, too: the connections are all bad because the old primary has closed its sockets. If the old primary server version is 4.2 or newer, the client MUST NOT clear its connection pool for the old primary.
See replica set monitoring with and without a primary.
If the server is primary with an obsolete electionId or setVersion, it is likely a stale primary that is going to step down. Mark it Unknown and let periodic monitoring detect when it becomes secondary. See using electionId and setVersion to detect stale primaries. Drivers MAY additionally specify whether this was due to an electionId or setVersion mismatch as described in the ServerDescripion.error section.
A note on checking "me": Unlike updateRSWithPrimaryFromMember
, there is no need to remove the server if the address is
not equal to "me": since the server address will not be a member of either "hosts", "passives", or "arbiters", the
server will already have been removed.
checkIfHasPrimary
Set TopologyType to ReplicaSetWithPrimary if there is an RSPrimary in TopologyDescription.servers, otherwise set it to ReplicaSetNoPrimary.
For example, if the TopologyType is ReplicaSetWithPrimary and the client is processing a new ServerDescription of type Unknown, that could mean the primary just disconnected, so checkIfHasPrimary must run to check if the TopologyType should become ReplicaSetNoPrimary.
Another example is if the client first reaches the primary via its external IP, but the response's host list includes only internal IPs. In that case the client adds the primary's internal IP to the TopologyDescription and begins monitoring it, and removes the external IP. Right after removing the external IP from the description, the TopologyType MUST be ReplicaSetNoPrimary, since no primary is available at this moment.
remove
Remove the server from TopologyDescription.servers and stop monitoring it.
In multi-threaded clients, a monitor may be currently checking this server and may not immediately abort. Once the check completes, this server's hello or legacy hello outcome MUST be ignored, and the monitor SHOULD halt.
Logical Session Timeout
Whenever a client updates the TopologyDescription from a hello or legacy hello response, it MUST set TopologyDescription.logicalSessionTimeoutMinutes to the smallest logicalSessionTimeoutMinutes value among ServerDescriptions of all data-bearing server types. If any have a null logicalSessionTimeoutMinutes, then TopologyDescription.logicalSessionTimeoutMinutes MUST be set to null.
See the Driver Sessions Spec for the purpose of this value.
Connection Pool Management
For drivers that support connection pools, after a server check is completed successfully, if the server is determined to be data-bearing or a direct connection to the server is requested, and does not already have a connection pool, the driver MUST create the connection pool for the server. Additionally, if a driver implements a CMAP compliant connection pool, the server's pool (even if it already existed) MUST be marked as "ready". See the Server Monitoring spec for more information.
Clearing the connection pool for a server MUST be synchronized with the update to the corresponding ServerDescription (e.g. by holding the lock on the TopologyDescription when clearing the pool). This prevents a possible race between the monitors and application threads. See Why synchronize clearing a server's pool with updating the topology? for more information.
Error handling
Network error during server check
See error handling in the Server Monitoring spec.
Application errors
When processing a network or command error, clients MUST first check the error's generation number. If the error's generation number is equal to the pool's generation number then error handling MUST continue according to Network error when reading or writing or "not writable primary" and "node is recovering". Otherwise, the error is considered stale and the client MUST NOT update any topology state. (See Why ignore errors based on CMAP's generation number?)
Error handling pseudocode
Application operations can fail in various places, for example:
- A network error, network timeout, or command error may occur while establishing a new connection. Establishing a connection includes the MongoDB handshake and completing authentication (if configured).
- A network error or network timeout may occur while reading or writing to an established connection.
- A command error may be returned from the server.
- A "writeConcernError" field may be included in the command response.
Depending on the context, these errors may update SDAM state by marking the server Unknown and may clear the server's connection pool. Some errors also require other side effects, like cancelling a check or requesting an immediate check. Drivers may use the following pseudocode to guide their implementation:
def handleError(error):
address = error.address
topologyVersion = error.topologyVersion
with client.lock:
# Ignore stale errors based on generation and topologyVersion.
if isStaleError(client.topologyDescription, error)
return
if isStateChangeError(error):
# Don't mark server unknown in load balanced mode.
if type != LoadBalanced
# Mark the server Unknown
unknown = new ServerDescription(type=Unknown, error=error, topologyVersion=topologyVersion)
onServerDescriptionChanged(unknown, connection pool for server)
if isShutdown(code) or (error was from <4.2):
# the pools must only be cleared while the lock is held.
if type == LoadBalanced:
clear connection pool for serviceId
else:
clear connection pool for server
if multi-threaded:
request immediate check
else:
# Check right now if this is "not writable primary", since it might be a
# useful secondary. If it's "node is recovering" leave it for the
# next full scan.
if isNotWritablePrimary(error):
check failing server
elif isNetworkError(error) or (not error.completedHandshake and (isNetworkTimeout(error) or isAuthError(error))):
if type != LoadBalanced
# Mark the server Unknown
unknown = new ServerDescription(type=Unknown, error=error)
onServerDescriptionChanged(unknown, connection pool for server)
clear connection pool for server
else
if serviceId
clear connection pool for serviceId
# Cancel inprogress check
cancel monitor check
def isStaleError(topologyDescription, error):
currentServer = topologyDescription.servers[server.address]
currentGeneration = currentServer.pool.generation
generation = get connection generation from error
if generation < currentGeneration:
# Stale generation number.
return True
currentTopologyVersion = currentServer.topologyVersion
# True if the current error's topologyVersion is greater than the server's
# We use >= instead of > because any state change should result in a new topologyVersion
return compareTopologyVersion(currentTopologyVersion, error.commandResponse.get("topologyVersion")) >= 0
The following pseudocode checks a response for a "not master" or "node is recovering" error:
recoveringCodes = [11600, 11602, 13436, 189, 91]
notWritablePrimaryCodes = [10107, 13435, 10058]
shutdownCodes = [11600, 91]
def isRecovering(message, code):
if code:
if code in recoveringCodes:
return true
else:
# if no code, use the error message.
return ("not master or secondary" in message
or "node is recovering" in message)
def isNotWritablePrimary(message, code):
if code:
if code in notWritablePrimaryCodes:
return true
else:
# if no code, use the error message.
if isRecovering(message, None):
return false
return ("not master" in message)
def isShutdown(code):
if code and code in shutdownCodes:
return true
return false
def isStateChangeError(error):
message = error.errmsg
code = error.code
return isRecovering(message, code) or isNotWritablePrimary(message, code)
def parseGle(response):
if "err" in response:
handleError(CommandError(response, response["err"], response["code"]))
# Parse response to any command
def parseCommandResponse(response):
if not response["ok"]:
handleError(CommandError(response, response["errmsg"], response["code"]))
else if response["writeConcernError"]:
wce = response["writeConcernError"]
handleError(WriteConcernError(response, wce["errmsg"], wce["code"]))
def parseQueryResponse(response):
if the "QueryFailure" bit is set in response flags:
handleError(CommandError(response, response["$err"], response["code"]))
The following sections describe the handling of different classes of application errors in detail including network errors, network timeout errors, state change errors, and authentication errors.
Network error when reading or writing
To describe how the client responds to network errors during application operations, we distinguish two phases of connecting to a server and using it for application operations:
- Before the handshake completes: the client establishes a new connection to the server and completes an initial handshake by calling "hello" or legacy hello and reading the response, and optionally completing authentication
- After the handshake completes: the client uses the established connection for application operations
If there is a network error or timeout on the connection before the handshake completes, the client MUST replace the server's description with a default ServerDescription of type Unknown when the TopologyType is not LoadBalanced, and fill the ServerDescription's error field with useful information.
If there is a network error or timeout on the connection before the handshake completes, and the TopologyType is LoadBalanced, the client MUST keep the ServerDescription as LoadBalancer.
If there is a network timeout on the connection after the handshake completes, the client MUST NOT mark the server Unknown. (A timeout may indicate a slow operation on the server, rather than an unavailable server.) If, however, there is some other network error on the connection after the handshake completes, the client MUST replace the server's description with a default ServerDescription of type Unknown if the TopologyType is not LoadBalanced, and fill the ServerDescription's error field with useful information, the same as if an error or timeout occurred before the handshake completed.
When the client marks a server Unknown due to a network error or timeout, the Unknown ServerDescription MUST be sent through the same process for updating the TopologyDescription as if it had been a failed hello or legacy hello outcome from a server check: for example, if the TopologyType is ReplicaSetWithPrimary and a write to the RSPrimary server fails because of a network error (other than timeout), then a new ServerDescription is created for the primary, with type Unknown, and the client executes the proper subroutine for an Unknown server when the TopologyType is ReplicaSetWithPrimary: referring to the table above we see the subroutine is checkIfHasPrimary. The result is the TopologyType changes to ReplicaSetNoPrimary. See the test scenario called "Network error writing to primary".
The client MUST close all idle sockets in its connection pool for the server: if one socket is bad, it is likely that all are.
Clients MUST NOT request an immediate check of the server; since application sockets are used frequently, a network error likely means the server has just become unavailable, so an immediate refresh is likely to get a network error, too.
The server will not remain Unknown forever. It will be refreshed by the next periodic check or, if an application operation needs the server sooner than that, then a re-check will be triggered by the server selection algorithm.
"not writable primary" and "node is recovering"
These errors are detected from a write command response or query response. Clients MUST check if the server error is a "node is recovering" error or a "not writable primary" error.
If the response includes an error code, it MUST be solely used to determine if error is a "node is recovering" or "not writable primary" error. Clients MUST match the errors by the numeric error code and not by the code name, as the code name can change from one server version to the next.
The following error codes indicate a replica set member is temporarily unusable. These are called "node is recovering" errors:
Error Name | Error Code |
---|---|
InterruptedAtShutdown | 11600 |
InterruptedDueToReplStateChange | 11602 |
NotPrimaryOrSecondary | 13436 |
PrimarySteppedDown | 189 |
ShutdownInProgress | 91 |
And the following error codes indicate a "not writable primary" error:
Error Name | Error Code |
---|---|
NotWritablePrimary | 10107 |
NotPrimaryNoSecondaryOk | 13435 |
LegacyNotPrimary | 10058 |
Clients MUST fallback to checking the error message if and only if the response does not include an error code. The error is considered a "node is recovering" error if the substrings "node is recovering" or "not master or secondary" are anywhere in the error message. Otherwise, if the substring "not master" is in the error message it is a "not writable primary" error.
Additionally, if the response includes a write concern error, then the code and message of the write concern error MUST be checked the same way a response error is checked above.
Errors contained within the writeErrors field MUST NOT be checked.
See the test scenario called "parsing 'not writable primary' and 'node is recovering' errors" for example response documents.
When the client sees a "not writable primary" or "node is recovering" error and the error's topologyVersion is strictly greater than the current ServerDescription's topologyVersion it MUST replace the server's description with a ServerDescription of type Unknown. Clients MUST store useful information in the new ServerDescription's error field, including the error message from the server. Clients MUST store the error's topologyVersion field in the new ServerDescription if present. (See What is the purpose of topologyVersion?)
Multi-threaded and asynchronous clients MUST request an immediate check of the server. Unlike in the "network error" scenario above, a "not writable primary" or "node is recovering" error means the server is available but the client is wrong about its type, thus an immediate re-check is likely to provide useful information.
For single-threaded clients, in the case of a "not writable primary" or "node is shutting down" error, the client MUST mark the topology as "stale" so the next server selection scans all servers. For a "node is recovering" error, single-threaded clients MUST NOT mark the topology as "stale". If a node is recovering for some time, an immediate scan may not gain useful information.
The following subset of "node is recovering" errors is defined to be "node is shutting down" errors:
Error Name | Error Code |
---|---|
InterruptedAtShutdown | 11600 |
ShutdownInProgress | 91 |
When handling a "not writable primary" or "node is recovering" error, the client MUST clear the server's connection pool if and only if the error is "node is shutting down" or the error originated from server version < 4.2.
(See when does a client see "not writable primary" or "node is recovering"?, use error messages to detect "not master" and "node is recovering", and other transient errors and Why close connections when a node is shutting down?.)
Authentication and Handshake errors
If the driver encounters errors when establishing application connections (this includes the initial handshake and authentication), the driver MUST mark the server Unknown and clear the server's connection pool if the TopologyType is not LoadBalanced. (See Why mark a server Unknown after an auth error?)
Monitoring SDAM events
The required driver specification for providing lifecycle hooks into server discovery and monitoring for applications to consume can be found in the SDAM Monitoring Specification.
Implementation notes
This section intends to provide generous guidance to driver authors. It is complementary to the reference implementations. Words like "should", "may", and so on are used more casually here.
See also, the implementation notes in the Server Monitoring spec.
Multi-threaded or asynchronous server selection
While no suitable server is available for an operation, the client MUST re-check all servers every minHeartbeatFrequencyMS. (See requesting an immediate check.)
Single-threaded server selection
When a client that uses single-threaded monitoring fails to select a suitable server for any operation, it scans the servers, then attempts selection again, to see if the scan discovered suitable servers. It repeats, waiting minHeartbeatFrequencyMS after each scan, until a timeout.
Documentation
Giant seed lists
Drivers' manuals should warn against huge seed lists, since it will slow initialization for single-threaded clients and generate load for multi-threaded and asynchronous drivers.
Multi-threaded
Warning about the maxWireVersion from a monitor's hello or legacy hello response
Clients consult some fields from a server's hello or legacy hello response to decide how to communicate with it:
- maxWireVersion
- maxBsonObjectSize
- maxMessageSizeBytes
- maxWriteBatchSize
It is tempting to take these values from the last hello or legacy hello response a monitor received and store them in the ServerDescription, but this is an anti-pattern. Multi-threaded and asynchronous clients that do so are prone to several classes of race, for example:
- Setup: A MongoDB 3.0 Standalone with authentication enabled, the client must log in with SCRAM-SHA-1.
- The monitor thread discovers the server and stores maxWireVersion on the ServerDescription
- An application thread wants a socket, selects the Standalone, and is about to check the maxWireVersion on its ServerDescription when...
- The monitor thread gets disconnected from server and marks it Unknown, with default maxWireVersion of 0.
- The application thread resumes, creates a socket, and attempts to log in using MONGODB-CR, since maxWireVersion is now reported as 0.
- Authentication fails, the server requires SCRAM-SHA-1.
Better to call hello or legacy hello for each new socket, as required by the Auth Spec, and use the hello or legacy hello response associated with that socket for maxWireVersion, maxBsonObjectSize, etc.: all the fields required to correctly communicate with the server.
The hello or legacy hello responses received by monitors determine if the topology as a whole is compatible with the driver, and which servers are suitable for selection. The monitors' responses should not be used to determine how to format wire protocol messages to the servers.
Immutable data
Multi-threaded drivers should treat ServerDescriptions and TopologyDescriptions as immutable: the client replaces them, rather than modifying them, in response to new information about the topology. Thus readers of these data structures can simply acquire a reference to the current one and read it, without holding a lock that would block a monitor from making further updates.
Process one hello or legacy hello outcome at a time
Although servers are checked in parallel, the function that actually creates the new TopologyDescription should be synchronized so only one thread can run it at a time.
Replacing the TopologyDescription
Drivers may use the following pseudocode to guide their implementation. The client object has a lock and a condition variable. It uses the lock to ensure that only one new ServerDescription is processed at a time, and it must be acquired before invoking this function. Once the client has taken the lock it must do no I/O:
def onServerDescriptionChanged(server, pool):
# "server" is the new ServerDescription.
# "pool" is the pool associated with the server
if server.address not in client.topologyDescription.servers:
# The server was once in the topologyDescription, otherwise
# we wouldn't have been monitoring it, but an intervening
# state-change removed it. E.g., we got a host list from
# the primary that didn't include this server.
return
newTopologyDescription = client.topologyDescription.copy()
# Ignore this update if the current topologyVersion is greater than
# the new ServerDescription's.
if isStaleServerDescription(td, server):
return
# Replace server's previous description.
address = server.address
newTopologyDescription.servers[address] = server
# for drivers that implement CMAP, mark the connection pool as ready after a successful check
if (server.type in (Mongos, RSPrimary, RSSecondary, Standalone, LoadBalanced))
or (server.type != Unknown and newTopologyDescription.type == Single):
pool.ready()
take any additional actions,
depending on the TopologyType and server...
# Replace TopologyDescription and notify waiters.
client.topologyDescription = newTopologyDescription
client.condition.notifyAll()
def compareTopologyVersion(tv1, tv2):
"""Return -1 if tv1<tv2, 0 if tv1==tv2, 1 if tv1>tv2"""
if tv1 is None or tv2 is None:
# Assume greater.
return -1
pid1 = tv1['processId']
pid2 = tv2['processId']
if pid1 == pid2:
counter1 = tv1['counter']
counter2 = tv2['counter']
if counter1 == counter2:
return 0
elif counter1 < counter2:
return -1
else:
return 1
else:
# Assume greater.
return -1
def isStaleServerDescription(topologyDescription, server):
# True if the new ServerDescription's topologyVersion is greater than
# or equal to the current server's.
currentServer = topologyDescription.servers[server.address]
currentTopologyVersion = currentServer.topologyVersion
return compareTopologyVersion(currentTopologyVersion, server.topologyVersion) > 0
Notifying the condition unblocks threads waiting in the server-selection loop for a suitable server to be discovered.
[!NOTE] The Java driver uses a CountDownLatch instead of a condition variable, and it atomically swaps the old and new CountDownLatches so it does not need "client.lock". It does, however, use a lock to ensure that only one thread runs onServerDescriptionChanged at a time.
Rationale
Clients do no I/O in the constructor
An alternative proposal was to distinguish between "discovery" and "monitoring". When discovery begins, the client checks all its seeds, and discovery is complete once all servers have been checked, or after some maximum time. Application operations cannot proceed until discovery is complete.
If the discovery phase is distinct, then single- and multi-threaded drivers could accomplish discovery in the constructor, and throw an exception from the constructor if the deployment is unavailable or misconfigured. This is consistent with prior behavior for many drivers. It will surprise some users that the constructor now succeeds, but all operations fail.
Similarly for misconfigured seed lists: the client may discover a mix of mongoses and standalones, or find multiple replica set names. It may surprise some users that the constructor succeeds and the client attempts to proceed with a compatible subset of the deployment.
Nevertheless, this spec prohibits I/O in the constructor for the following reasons:
Common case
In the common case, the deployment is available and usable. This spec favors allowing operations to proceed as soon as possible in the common case, at the cost of surprising behavior in uncommon cases.
Simplicity
It is simpler to omit a special discovery phase and treat all server checks the same.
Consistency
Asynchronous clients cannot do I/O in a constructor, so it is consistent to prohibit I/O in other clients' constructors as well.
Restarts
If clients can be constructed when the deployment is in some states but not in other states, it leads to an unfortunate scenario: When the deployment is passing through a strange state, long-running clients may keep working, but any clients restarted during this period fail.
Say an administrator changes one replica set member's setName. Clients that are already constructed remove the bad member and stay usable, but if any client is restarted its constructor fails. Web servers that dynamically adjust their process pools will show particularly undesirable behavior.
heartbeatFrequencyMS defaults to 10 seconds or 60 seconds
Many drivers have different values. The time has come to standardize. Lacking a rigorous methodology for calculating the best frequency, this spec chooses 10 seconds for multi-threaded or asynchronous drivers because some already use that value.
Because scanning has a greater impact on the performance of single-threaded drivers, they MUST default to a longer frequency (60 seconds).
An alternative is to check servers less and less frequently the longer they remain unchanged. This idea is rejected because it is a goal of this spec to answer questions about monitoring such as,
- "How rapidly can I rotate a replica set to a new set of hosts?"
- "How soon after I add a secondary will query load be rebalanced?"
- "How soon will a client notice a change in round trip time, or tags?"
Having a constant monitoring frequency allows us to answer these questions simply and definitively. Losing the ability to answer these questions is not worth any minor gain in efficiency from a more complex scheduling method.
The client MUST re-check all servers every minHeartbeatFrequencyMS
While an application is waiting to do an operation for which there is no suitable server, a multi-threaded client MUST re-check all servers very frequently. The slight cost is worthwhile in many scenarios. For example:
- A client and a MongoDB server are started simultaneously.
- The client checks the server before it begins listening, so the check fails.
- The client waits in the server-selection loop for the topology to change.
In this state, the client should check the server very frequently, to give it ample opportunity to connect to the server before timing out in server selection.
No knobs
This spec does not intend to introduce any new configuration options unless absolutely necessary.
The client MUST monitor arbiters
Mongos 2.6 does not monitor arbiters, but it costs little to do so, and in the rare case that all data members are moved to new hosts in a short time, an arbiter may be the client's last hope to find the new replica set configuration.
Only support replica set members running MongoDB 1.6.2 or later
Replica set members began reporting their setNames in that version. Supporting earlier versions is impractical.
TopologyType remains Unknown when an RSGhost is discovered
If the TopologyType is Unknown and the client receives a hello or legacy hello response from anRSGhost, the TopologyType could be set to ReplicaSetNoPrimary. However, an RSGhost does not report its setName, so the setName would still be unknown. This adds an additional state to the existing list: "TopologyType ReplicaSetNoPrimary and no setName." The additional state adds substantial complexity without any benefit, so this spec says clients MUST NOT change the TopologyType when an RSGhost is discovered.
TopologyType remains Unknown when one of the seeds is a Standalone
If TopologyType is Unknown and there are multiple seeds, and one of them is discovered to be a standalone, it MUST be removed. The TopologyType remains Unknown.
This rule supports the following common scenario:
- Servers A and B are in a replica set.
- A seed list with A and B is stored in a configuration file.
- An administrator removes B from the set and brings it up as standalone for maintenance, without changing its port number.
- The client is initialized with seeds A and B, TopologyType Unknown, and no setName.
- The first hello or legacy hello response is from B, the standalone.
What if the client changed TopologyType to Single at this point? It would be unable to use the replica set; it would have to remove A from the TopologyDescription once A's hello or legacy hello response comes.
The user's intent in this case is clearly to use the replica set, despite the outdated seed list. So this spec requires clients to remove B from the TopologyDescription and keep the TopologyType as Unknown. Then when A's response arrives, the client can set its TopologyType to ReplicaSet (with or without primary).
On the other hand, if there is only one seed and the seed is discovered to be a Standalone, the TopologyType MUST be set to Single.
See the "member brought up as standalone" test scenario.
Replica set monitoring with and without a primary
The client strives to fill the "servers" list only with servers that the primary said were members of the replica set, when the client most recently contacted the primary.
The primary's view of the replica set is authoritative for two reasons:
- The primary is never on the minority side of a network partition. During a partition it is the primary's list of servers the client should use.
- Since reconfigs must be executed on the primary, the primary is the first to know of them. Reconfigs propagate to non-primaries eventually, but the client can receive hello or legacy hello responses from non-primaries that reflect any past state of the replica set. See the "Replica set discovery" test scenario.
If at any time the client believes there is no primary, the TopologyDescription's type is set to ReplicaSetNoPrimary. While there is no known primary, the client MUST add servers from non-primaries' host lists, but it MUST NOT remove servers from the TopologyDescription.
Eventually, when a primary is discovered, any hosts not in the primary's host list are removed.
Using electionId and setVersion to detect stale primaries
Replica set members running MongoDB 2.6.10+ or 3.0+ include an integer called "setVersion" and an ObjectId called "electionId" in their hello or legacy hello response. Starting with MongoDB 3.2.0, replica sets can use two different replication protocol versions; electionIds from one protocol version must not be compared to electionIds from a different protocol.
Because protocol version changes require replica set reconfiguration, clients use the tuple (electionId, setVersion) to detect stale primaries. The tuple order comparison MUST be checked in the order of electionId followed by setVersion since that order of comparison is guaranteed monotonicity.
The client remembers the greatest electionId and setVersion reported by a primary, and distrusts primaries from older electionIds or from the same electionId but with lesser setVersion.
- It compares electionIds as 12-byte sequence i.e. memory comparison.
- It compares setVersions as integer values.
This prevents the client from oscillating between the old and new primary during a split-brain period, and helps provide read-your-writes consistency with write concern "majority" and read preference "primary".
Prior to MongoDB server version 6.0 drivers had the logic opposite from the server side Replica Set Management logic by
ordering the tuple by setVersion
before the electionId
. In order to remain compatibility with backup systems, etc.
drivers continue to maintain the reversed logic when connected to a topology that reports a maxWireVersion less than
17
. Server versions 6.0 and beyond MUST order the tuple by electionId
then setVersion
.
Requirements for read-your-writes consistency
Using (electionId, setVersion) only provides read-your-writes consistency if:
- The application uses the same MongoClient instance for write-concern "majority" writes and read-preference "primary" reads, and
- All members use MongoDB 2.6.10+, 3.0.0+ or 3.2.0+ with replication protocol 0 and clocks are less than 30 seconds skewed, or
- All members run MongoDB 3.2.0 and replication protocol 1 and clocks are less skewed than the election timeout
(
electionTimeoutMillis
, which defaults to 10 seconds), or - All members run MongoDB 3.2.1+ and replication protocol 1 (in which case clocks need not be synchronized).
Scenario
Consider the following situation:
- Server A is primary.
- A network partition isolates A from the set, but the client still sees it.
- Server B is elected primary.
- The client discovers that B is primary, does a write-concern "majority" write operation on B and receives acknowledgment.
- The client receives a hello or legacy hello response from A, claiming A is still primary.
- If the client trusts that A is primary, the next read-preference "primary" read sees stale data from A that may not include the write sent to B.
See SERVER-17975, "Stale reads with WriteConcern Majority and ReadPreference Primary."
Detecting a stale primary
To prevent this scenario, the client uses electionId and setVersion to determine which primary was elected last. In this case, it would not consider "A" a primary, nor read from it because server B will have a greater electionId but the same setVersion.
Monotonicity
The electionId is an ObjectId compared bytewise in order.
(ie. 000000000000000000000001 > 000000000000000000000000, FF0000000000000000000000 > FE0000000000000000000000 etc.)
In some server versions, it is monotonic with respect to a particular servers' system clock, but is not globally monotonic across a deployment. However, if inter-server clock skews are small, it can be treated as a monotonic value.
In MongoDB 2.6.10+ (which has SERVER-13542 backported), MongoDB 3.0.0+ or MongoDB 3.2+ (under replication protocol version 0), the electionId's leading bytes are a server timestamp. As long as server clocks are skewed less than 30 seconds, electionIds can be reliably compared. (This is precise enough, because in replication protocol version 0, servers are designed not to complete more than one election every 30 seconds. Elections do not take 30 seconds--they are typically much faster than that--but there is a 30-second cooldown before the next election can complete.)
Beginning in MongoDB 3.2.0, under replication protocol version 1, the electionId begins with a timestamp, but the
cooldown is shorter. As long as inter-server clock skew is less than the configured election timeout
(electionTimeoutMillis
, which defaults to 10 seconds), then electionIds can be reliably compared.
Beginning in MongoDB 3.2.1, under replication protocol version 1, the electionId is guaranteed monotonic without relying on any clock synchronization.
Using me field to detect seed list members that do not match host names in the replica set configuration
Removal from the topology of seed list members where the "me" property does not match the address used to connect prevents clients from being able to select a server, only to fail to re-select that server once the primary has responded.
This scenario illustrates the problems that arise if this is NOT done:
- The client specifies a seed list of A, B, C
- Server A responds as a secondary with hosts D, E, F
- The client executes a query with read preference of secondary, and server A is selected
- Server B responds as a primary with hosts D, E, F. Servers A, B, C are removed, as they don't appear in the primary's hosts list
- The client iterates the cursor and attempts to execute a getMore against server A.
- Server selection fails because server A is no longer part of the topology.
With checking for "me" in place, it looks like this instead:
- The client specifies a seed list of A, B, C
- Server A responds as a secondary with hosts D, E, F, where "me" is D, and so the client adds D, E, F as type "Unknown" and starts monitoring them, but removes A from the topology.
- The client executes a query with read preference of secondary, and goes into the server selection loop
- Server D responds as a secondary where "me" is D
- Server selection completes by matching D
- The client iterates the cursor and attempts to execute a getMore against server D.
- Server selection completes by matching D.
Ignore setVersion unless the server is primary
It was thought that if all replica set members report a setVersion, and a secondary's response has a higher setVersion than any seen, that the secondary's host list could be considered as authoritative as the primary's. (See Replica set monitoring with and without a primary.)
This scenario illustrates the problem with setVersion:
- We have a replica set with servers A, B, and C.
- Server A is the primary, with setVersion 4.
- An administrator runs replSetReconfig on A, which increments its setVersion to 5.
- The client checks Server A and receives the new config.
- Server A crashes before any secondary receives the new config.
- Server B is elected primary. It has the old setVersion 4.
- The client ignores B's version of the config because its setVersion is not greater than 5.
The client may never correct its view of the topology.
Even worse:
- An administrator runs replSetReconfig on Server B, which increments its setVersion to 5.
- Server A restarts. This results in two versions of the config, both claiming to be version 5.
If the client trusted the setVersion in this scenario, it would trust whichever config it received first.
mongos 2.6 ignores setVersion and only trusts the primary. This spec requires all clients to ignore setVersion from non-primaries.
Use error messages to detect "not master" and "node is recovering"
When error codes are not available, error messages are checked for the substrings "not master" and "node is recovering". This is because older server versions returned unstable error codes or no error codes in many circumstances.
Other transient errors
There are other transient errors a server may return, e.g. retryable errors listed in the retryable writes spec. SDAM does not consider these because they do not imply the connected server should be marked as "Unknown". For example, the following errors may be returned from a mongos when it cannot route to a shard:
Error Name | Error Code |
---|---|
HostNotFound | 7 |
HostUnreachable | 6 |
NetworkTimeout | 89 |
SocketException | 9001 |
When these are returned, the mongos should not be marked as "Unknown", since it is more likely an issue with the shard.
Why ignore errors based on CMAP's generation number?
Using CMAP's generation number solves the following race condition among application threads and the monitor during error handling:
- Two concurrent writes begin on application threads A and B.
- The server restarts.
- Thread A receives the first non-timeout network error, and the client marks the server Unknown, and clears the server's pool.
- The client re-checks the server and marks it Primary.
- Thread B receives the second non-timeout network error and the client marks the server Unknown again.
The core issue is that the client processes errors in arbitrary order and may overwrite fresh information about the server's status with stale information. Using CMAP's generation number avoids the race condition because the duplicate (or stale) network error can be identified (changes in bold):
- Two concurrent writes begin on application threads A and B, with generation 1.
- The server restarts.
- Thread A receives the first non-timeout network error, and the client marks the server Unknown, and clears the server's pool. The pool's generation is now 2.
- The client re-checks the server and marks it Primary.
- Thread B receives the second non-timeout network error, and the client ignores the error because the error originated from a connection with generation 1.
Why synchronize clearing a server's pool with updating the topology?
Doing so solves the following race condition among application threads and the monitor during error handling, similar to the previous example:
- A write begins on an application thread.
- The server restarts.
- The application thread receives a non-timeout network error.
- The application thread acquires the lock on the TopologyDescription, marks the Server as Unknown, and releases the lock.
- The monitor re-checks the server and marks it Primary and its pool as "ready".
- Several other application threads enter the WaitQueue of the server's pool.
- The application thread clears the server's pool, evicting all those new threads from the WaitQueue, causing them to return errors or to retry. Additionally, the pool is now "paused", but the server is considered the Primary, meaning future operations will be routed to the server and fail until the next heartbeat marks the pool as "ready" again.
If marking the server as Unknown and clearing its pool were synchronized, then the monitor marking the server as Primary after its check would happen after the pool was cleared and thus avoid putting it an inconsistent state.
What is the purpose of topologyVersion?
topologyVersion solves the following race condition among application threads and the monitor when handling State Change Errors:
- Two concurrent writes begin on application threads A and B.
- The primary steps down.
- Thread A receives the first State Change Error, the client marks the server Unknown.
- The client re-checks the server and marks it Secondary.
- Thread B receives a delayed State Change Error and the client marks the server Unknown again.
The core issue is that the client processes errors in arbitrary order and may overwrite fresh information about the server's status with stale information. Using topologyVersion avoids the race condition because the duplicate (or stale) State Change Errors can be identified (changes in bold):
- Two concurrent writes begin on application threads A and B.
- The primary's ServerDescription.topologyVersion == tv1
- The primary steps down and sets its topologyVersion to tv2.
- Thread A receives the first State Change Error containing tv2, the client marks the server Unknown (with topologyVersion: tv2).
- The client re-checks the server and marks it Secondary (with topologyVersion: tv2).
- Thread B receives a delayed State Change Error (with topologyVersion: tv2) and the client ignores the error because the error's topologyVersion (tv2) is not greater than the current ServerDescription (tv2).
Why mark a server Unknown after an auth error?
The Authentication spec requires that when authentication fails on a server, the driver MUST clear the server's connection pool. Clearing the pool without marking the server Unknown would leave the pool in the "paused" state while the server is still selectable. When auth fails due to invalid credentials, marking the server Unknown also serves to rate limit new connections; future operations will need to wait for the server to be rediscovered.
Note that authentication may fail for a variety of reasons, for example:
- A network error, or network timeout error may occur.
- The server may return a State Change Error.
- The server may return a AuthenticationFailed command error (error code 18) indicating that the provided credentials are invalid.
Does this mean that authentication failures due to invalid credentials will manifest as server selection timeout errors? No, authentication errors are still returned to the application immediately. A subsequent operation will block until the server is rediscovered and immediately attempt authentication on a new connection.
Clients use the hostnames listed in the replica set config, not the seed list
Very often users have DNS aliases they use in their seed list instead of the hostnames in the replica set config. For example, the name "host_alias" might refer to a server also known as "host1", and the URI is:
mongodb://host_alias/?replicaSet=rs
When the client connects to "host_alias", its hello or legacy hello response includes the list of hostnames from the replica set config, which does not include the seed:
{
hosts: ["host1:27017", "host2:27017"],
setName: "rs",
... other hello or legacy hello response fields ...
}
This spec requires clients to connect to the hostnames listed in the hello or legacy hello response. Furthermore, if the response is from a primary, the client MUST remove all hostnames not listed. In this case, the client disconnects from "host_alias" and tries "host1" and "host2". (See updateRSFromPrimary.)
Thus, replica set members must be reachable from the client by the hostnames listed in the replica set config.
An alternative proposal is for clients to continue using the hostnames in the seed list. It could add new hosts from the hello or legacy hello response, and where a host is known by two names, the client can deduplicate them using the "me" field and prefer the name in the seed list.
This proposal was rejected because it does not support key features of replica sets: failover and zero-downtime reconfiguration.
In our example, if "host1" and "host2" are not reachable from the client, the client continues to use "host_alias" only. If that server goes down or is removed by a replica set reconfig, the client is suddenly unable to reach the replica set at all: by allowing the client to use the alias, we have hidden the fact that the replica set's failover feature will not work in a crisis or during a reconfig.
In conclusion, to support key features of replica sets, we require that the hostnames used in a replica set config are reachable from the client.
Backwards Compatibility
The Java driver 2.12.1 has a "heartbeatConnectRetryFrequency". Since this spec recommends the option be named "minHeartbeatFrequencyMS", the Java driver must deprecate its old option and rename it minHeartbeatFrequency (for consistency with its other options which also lack the "MS" suffix).
Reference Implementation
- Java driver 3.x
- PyMongo 3.x
- Perl driver 1.0.0 (in progress)
Future Work
MongoDB is likely to add some of the following features, which will require updates to this spec:
- Eventually consistent collections (SERVER-2956)
- Mongos discovery (SERVER-1834)
- Put individual databases into maintenance mode, instead of the whole server (SERVER-7826)
- Put setVersion in write-command responses (SERVER-13909)
Questions and Answers
When does a client see "not writable primary" or "node is recovering"?
These errors indicate one of these:
- A write was attempted on an unwritable server (arbiter, secondary, ghost, or recovering).
- A read was attempted on an unreadable server (arbiter, ghost, or recovering) or a read was attempted on a read-only server without the secondaryOk bit set.
- An operation was attempted on a server that is now shutting down.
In any case the error is a symptom that a ServerDescription's type no longer reflects reality.
On MongoDB 4.0 and earlier, a primary closes its connections when it steps down, so in many cases the next operation causes a network error rather than "not writable primary". The driver can see a "not writable primary" error in the following scenario:
- The client discovers the primary.
- The primary steps down.
- Before the client checks the server and discovers the stepdown, the application attempts an operation.
- The client's connection pool is empty, either because it has never attempted an operation on this server, or because all connections are in use by other threads.
- The client creates a connection to the old primary.
- The client attempts to write, or to read without the secondaryOk bit, and receives "not writable primary".
See "not writable primary" and "node is recovering", and the test scenario called "parsing 'not writable primary' and 'node is recovering' errors".
Why close connections when a node is shutting down?
When a server shuts down, it will return one of the "node is shutting down" errors for each attempted operation and eventually will close all connections. Keeping a connection to a server which is shutting down open would only produce errors on this connection - such a connection will never be usable for any operations. In contrast, when a server 4.2 or later returns "not writable primary" error the connection may be usable for other operations (such as secondary reads).
What's the point of periodic monitoring?
Why not just wait until a "not writable primary" error or "node is recovering" error informs the client that its TopologyDescription is wrong? Or wait until server selection fails to find a suitable server, and only scan all servers then?
Periodic monitoring accomplishes three objectives:
- Update each server's type, tags, and round trip time. Read preferences and the mongos selection algorithm require this information remains up to date.
- Discover new secondaries so that secondary reads are evenly spread.
- Detect incremental changes to the replica set configuration, so that the client remains connected to the set even while it is migrated to a completely new set of hosts.
If the application uses some servers very infrequently, monitoring can also proactively detect state changes (primary stepdown, server becoming unavailable) that would otherwise cause future errors.
Why is auto-discovery the preferred default?
Auto-discovery is most resilient and is therefore preferred.
Why is it possible for maxSetVersion to go down?
maxElectionId
and maxSetVersion
are actually considered a pair of values Drivers MAY consider implementing
comparison in code as a tuple of the two to ensure their always updated together:
// New tuple old tuple
{ electionId: 2, setVersion: 1 } > { electionId: 1, setVersion: 50 }
In this scenario, the maxSetVersion goes from 50 to 1, but the maxElectionId is raised to 2.
Acknowledgments
Jeff Yemin's code for the Java driver 2.12, and his patient explanation thereof, is the major inspiration for this spec. Mathias Stearn's beautiful design for replica set monitoring in mongos 2.6 contributed as well. Bernie Hackett gently oversaw the specification process.
Changelog
-
2015-12-17: Require clients to compare (setVersion, electionId) tuples.
-
2015-10-09: Specify electionID comparison method.
-
2015-06-16: Added cooldownMS.
-
2016-05-04: Added link to SDAM monitoring.
-
2016-07-18: Replace mentions of the "Read Preferences Spec" with "Server Selection Spec", and "secondaryAcceptableLatencyMS" with "localThresholdMS".
-
2016-07-21: Updated for Max Staleness support.
-
2016-08-04: Explain better why clients use the hostnames in RS config, not URI.
-
2016-08-31: Multi-threaded clients SHOULD use hello or legacy hello replies to update the topology when they handshake application connections.
-
2016-10-06: In updateRSWithoutPrimary the hello or legacy hello response's "primary" field should be used to update the topology description, even if address != me.
-
2016-10-29: Allow for idleWritePeriodMS to change someday.
-
2016-11-01: "Unknown" is no longer the default TopologyType, the default is now explicitly unspecified. Update instructions for setting the initial TopologyType when running the spec tests.
-
2016-11-21: Revert changes that would allow idleWritePeriodMS to change in the future.
-
2017-02-28: Update "network error when reading or writing": timeout while connecting does mark a server Unknown, unlike a timeout while reading or writing. Justify the different behaviors, and also remove obsolete reference to auto-retry.
-
2017-06-13: Move socketCheckIntervalMS to Server Selection Spec.
-
2017-08-01: Parse logicalSessionTimeoutMinutes from hello or legacy hello reply.
-
2017-08-11: Clearer specification of "incompatible" logic.
-
2017-09-01: Improved incompatibility error messages.
-
2018-03-28: Specify that monitoring must not do mechanism negotiation or authentication.
-
2019-05-29: Renamed InterruptedDueToStepDown to InterruptedDueToReplStateChange
-
2020-02-13: Drivers must run SDAM flow even when server description is equal to the last one.
-
2020-03-31: Add topologyVersion to ServerDescription. Add rules for ignoring stale application errors.
-
2020-05-07: Include error field in ServerDescription equality comparison.
-
2020-06-08: Clarify reasoning behind how SDAM determines if a topologyVersion is stale.
-
2020-12-17: Mark the pool for a server as "ready" after performing a successful check. Synchronize pool clearing with SDAM updates.
-
2021-01-17: Require clients to compare (electionId, setVersion) tuples.
-
2021-02-11: Errors encountered during auth are handled by SDAM. Auth errors mark the server Unknown and clear the pool.
-
2021-04-12: Adding in behaviour for load balancer mode.
-
2021-05-03: Require parsing "isWritablePrimary" field in responses.
-
2021-06-09: Connection pools must be created and eventually marked ready for any server if a direct connection is used.
-
2021-06-29: Updated to use modern terminology.
-
2022-01-19: Add iscryptd and 90th percentile RTT fields to ServerDescription.
-
2022-07-11: Convert integration tests to the unified format.
-
2022-09-30: Update
updateRSFromPrimary
to include logic before and after 6.0 servers -
2022-10-05: Remove spec front matter, move footnote, and reformat changelog.
-
2022-11-17: Add minimum RTT tracking and remove 90th percentile RTT.
-
2024-01-17: Add section on expected client close behaviour
-
2024-05-08: Migrated from reStructuredText to Markdown.
-
2024-08-09: Updated wire versions in tests to 4.0+.
-
2024-08-16: Updated host b wire versions in
too_new
andtoo_old
tests -
2024-11-04: Make the description of
TopologyDescription.servers
consistent with the spec tests. -
2024-11-11: Removed references to
getLastError
-
2025-01-22: Add error messages when a new primary is elected or a primary with a stale electionId or setVersion is discovered.
"localThresholdMS" was called "secondaryAcceptableLatencyMS" in the Read Preferences Spec, before it was superseded by the Server Selection Spec.
Connection Monitoring and Pooling
- Status: Accepted
- Minimum Server Version: N/A
Abstract
Drivers currently support a variety of options that allow users to configure connection pooling behavior. Users are confused by drivers supporting different subsets of these options. Additionally, drivers implement their connection pools differently, making it difficult to design cross-driver pool functionality. By unifying and codifying pooling options and behavior across all drivers, we will increase user comprehension and code base maintainability.
This specification does not apply to drivers that do not support multitasking.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Definitions
Connection
A Connection (when linked) refers to the Connection
type defined in the
Connection Pool Members section of this specification. It does not refer to an actual TCP
connection to an Endpoint. A Connection
will attempt to create and wrap such a TCP connection over the course of its
existence, but it is not equivalent to one nor does it wrap an active one at all times.
For the purposes of testing, a mocked Connection
type could be used with the pool that never actually creates a TCP
connection or performs any I/O.
Endpoint
For convenience, an Endpoint refers to either a mongod or mongos instance.
Thread
For convenience, a Thread refers to:
- A shared-address-space process (a.k.a. a thread) in multi-threaded drivers
- An Execution Frame / Continuation in asynchronous drivers
- A goroutine in Go
Behavioral Description
Which Drivers this applies to
This specification is solely concerned with drivers that implement a connection pool. A driver SHOULD implement a connection pool, but is not required to.
Connection Pool Options
All drivers that implement a connection pool MUST implement and conform to the same MongoClient options. There can be slight deviation in naming to make the options idiomatic to the driver language.
Connection Pool Behaviors
All driver connection pools MUST provide an API that allows the driver to check out a connection, check in a connection back to the pool, and clear all connections in the pool. This API is for internal use only, and SHOULD NOT be documented as a public API.
Connection Pool Monitoring
All drivers that implement a connection pool MUST provide an API that allows users to subscribe to events emitted from the pool. Conceptually, event emission is instantaneous, i.e., one may talk about the instant an event is emitted, and represents the start of an activity of delivering the event to a subscribed user.
Detailed Design
Connection Pool Options
Drivers that implement a Connection Pool MUST support the following ConnectionPoolOptions:
interface ConnectionPoolOptions {
/**
* The maximum number of Connections that may be associated
* with a pool at a given time. This includes in use and
* available connections.
* If specified, MUST be an integer >= 0.
* A value of 0 means there is no limit.
* Defaults to 100.
*/
maxPoolSize?: number;
/**
* The minimum number of Connections that MUST exist at any moment
* in a single connection pool.
* If specified, MUST be an integer >= 0. If maxPoolSize is > 0
* then minPoolSize must be <= maxPoolSize
* Defaults to 0.
*/
minPoolSize?: number;
/**
* The maximum amount of time a Connection should remain idle
* in the connection pool before being marked idle.
* If specified, MUST be a number >= 0.
* A value of 0 means there is no limit.
* Defaults to 0.
*/
maxIdleTimeMS?: number;
/**
* The maximum number of Connections a Pool may be establishing concurrently.
* Establishment of a Connection is a part of its life cycle
* starting after a ConnectionCreatedEvent and ending before a ConnectionReadyEvent.
* If specified, MUST be a number > 0.
* Defaults to 2.
*/
maxConnecting?: number;
}
Additionally, Drivers that implement a Connection Pool MUST support the following ConnectionPoolOptions UNLESS that driver meets ALL of the following conditions:
- The driver/language currently has an idiomatic timeout mechanism implemented
- The timeout mechanism conforms to the aggressive requirement of timing out a thread in the WaitQueue
interface ConnectionPoolOptions {
/**
* NOTE: This option has been deprecated in favor of timeoutMS.
*
* The maximum amount of time a thread can wait for
* either an available non-perished connection (limited by `maxPoolSize`),
* or a pending connection (limited by `maxConnecting`).
* If specified, MUST be a number >= 0.
* A value of 0 means there is no limit.
* Defaults to 0.
*/
waitQueueTimeoutMS?: number;
}
These options MUST be specified at the MongoClient level, and SHOULD be named in a manner idiomatic to the driver's language. All connection pools created by a MongoClient MUST use the same ConnectionPoolOptions.
When parsing a mongodb connection string, a user MUST be able to specify these options using the default names specified above.
Deprecated Options
The following ConnectionPoolOptions are considered deprecated. They MUST NOT be implemented if they do not already exist in a driver, and they SHOULD be deprecated and removed from drivers that implement them as early as possible:
interface ConnectionPoolOptions {
/**
* The maximum number of threads that can simultaneously wait
* for a Connection to become available.
*/
waitQueueSize?: number;
/**
* An alternative way of setting waitQueueSize, it specifies
* the maximum number of threads that can wait per connection.
* waitQueueSize === waitQueueMultiple * maxPoolSize
*/
waitQueueMultiple?: number
}
Connection Pool Members
Connection
A driver-defined wrapper around a single TCP connection to an Endpoint. A Connection has the following properties:
- Single Endpoint: A Connection MUST be associated with a single Endpoint. A Connection MUST NOT be associated with multiple Endpoints.
- Single Lifetime: A Connection MUST NOT be used after it is closed.
- Single Owner: A Connection MUST belong to exactly one Pool, and MUST NOT be shared across multiple pools
- Single Track: A Connection MUST limit itself to one request / response at a time. A Connection MUST NOT multiplex/pipeline requests to an Endpoint.
- Monotonically Increasing ID: A Connection MUST have an ID number associated with it. Connection IDs within a Pool MUST be assigned in order of creation, starting at 1 and increasing by 1 for each new Connection.
- Valid Connection: A connection MUST NOT be checked out of the pool until it has successfully and fully completed a MongoDB Handshake and Authentication as specified in the Handshake, OP_COMPRESSED, and Authentication specifications.
- Perishable: it is possible for a Connection to become Perished. A Connection is
considered perished if any of the following are true:
- Stale: The Connection 's generation does not match the generation of the parent pool
- Idle: The Connection is currently "available" (as defined below) and has been for longer than maxIdleTimeMS.
- Errored: The Connection has experienced an error that indicates it is no longer recommended for
use. Examples include, but are not limited to:
- Network Error
- Network Timeout
- Endpoint closing the connection
- Driver-Side Timeout
- Wire-Protocol Error
interface Connection {
/**
* An id number associated with the Connection
*/
id: number;
/**
* The address of the pool that owns this Connection
*/
address: string;
/**
* An integer representing the "generation" of the pool
* when this Connection was created.
*/
generation: number;
/**
* The current state of the Connection.
*
* Possible values are the following:
* - "pending": The Connection has been created but has not yet been established. Contributes to
* totalConnectionCount and pendingConnectionCount.
*
* - "available": The Connection has been established and is waiting in the pool to be checked
* out. Contributes to both totalConnectionCount and availableConnectionCount.
*
* - "in use": The Connection has been established, checked out from the pool, and has yet
* to be checked back in. Contributes to totalConnectionCount.
*
* - "closed": The Connection has had its socket closed and cannot be used for any future
* operations. Does not contribute to any connection counts.
*
* Note: this field is mainly used for the purposes of describing state
* in this specification. It is not required that drivers
* actually include this field in their implementations of Connection.
*/
state: "pending" | "available" | "in use" | "closed";
}
WaitQueue
A concept that represents pending requests for Connections. When a thread requests a Connection from a Pool, the thread enters the Pool's WaitQueue. A thread stays in the WaitQueue until it either receives a Connection or times out. A WaitQueue has the following traits:
- Thread-Safe: When multiple threads attempt to enter or exit a WaitQueue, they do so in a thread-safe manner.
- Ordered/fair: When Connections are made available, they SHOULD be issued out to threads in the order that the threads entered the WaitQueue. If this is behavior poses too much of an implementation burden, then at the very least threads that have entered the queue more recently MUST NOT be intentionally prioritized over those that entered it earlier.
- Timeout aggressively: Members of a WaitQueue MUST timeout if they are enqueued for longer than the computed timeout and MUST leave the WaitQueue immediately in this case.
The implementation details of a WaitQueue are left to the driver. Example implementations include:
- A fair Semaphore
- A Queue of callbacks
Connection Pool
A driver-defined entity that encapsulates all non-monitoring Connections associated with a single Endpoint. The pool has the following properties:
- Thread Safe: All Pool behaviors MUST be thread safe.
- Not Fork-Safe: A Pool is explicitly not fork-safe. If a Pool detects that is it being used by a forked process, it MUST immediately clear itself and update its pid
- Single Owner: A Pool MUST be associated with exactly one Endpoint, and MUST NOT be shared between Endpoints.
- Emit Events and Log Messages: A Pool MUST emit pool events and log messages when dictated by this spec (see Connection Pool Monitoring). Users MUST be able to subscribe to emitted events and log messages in a manner idiomatic to their language and driver.
- Closeable: A Pool MUST be able to be manually closed. When a Pool is closed, the following behaviors change:
- Checking in a Connection to the Pool automatically closes the Connection
- Attempting to check out a Connection from the Pool results in an Error
- Clearable: A Pool MUST be able to be cleared. Clearing the pool marks all pooled and checked out Connections as stale and lazily closes them as they are checkedIn or encountered in checkOut. Additionally, all requests are evicted from the WaitQueue and return errors that are considered non-timeout network errors.
- Pausable: A Pool MUST be able to be paused and resumed. A Pool is paused automatically when it is cleared, and it
can be resumed by being marked as "ready". While the Pool is paused, it exhibits the following behaviors:
- Attempting to check out a Connection from the Pool results in a non-timeout network error
- Connections are not created in the background to satisfy minPoolSize
- Capped: a pool is capped if maxPoolSize is set to a non-zero value. If a pool is capped, then its total number of Connections (including available and in use) MUST NOT exceed maxPoolSize
- Rate-limited: A Pool MUST limit the number of Connections being established concurrently via the maxConnecting pool option.
interface ConnectionPool {
/**
* The Queue of threads waiting for a Connection to be available
*/
waitQueue: WaitQueue;
/**
* A generation number representing the SDAM generation of the pool.
*/
generation: number;
/**
* A map representing the various generation numbers for various services
* when in load balancer mode.
*/
serviceGenerations: Map<ObjectId, [number, number]>;
/**
* The state of the pool.
*
* Possible values are the following:
* - "paused": The initial state of the pool. Connections may not be checked out nor can they
* be established in the background to satisfy minPoolSize. Clearing a pool
* transitions it to this state.
*
* - "ready": The healthy state of the pool. It can service checkOut requests and create
* connections in the background. The pool can be set to this state via the
* ready() method.
*
* - "closed": The pool is destroyed. No more Connections may ever be checked out nor any
* created in the background. The pool can be set to this state via the close()
* method. The pool cannot transition to any other state after being closed.
*/
state: "paused" | "ready" | "closed";
// Any of the following connection counts may be computed rather than
// actually stored on the pool.
/**
* An integer expressing how many total Connections
* ("pending" + "available" + "in use") the pool currently has
*/
totalConnectionCount: number;
/**
* An integer expressing how many Connections are currently
* available in the pool.
*/
availableConnectionCount: number;
/**
* An integer expressing how many Connections are currently
* being established.
*/
pendingConnectionCount: number;
/**
* Returns a Connection for use
*/
checkOut(): Connection;
/**
* Check in a Connection back to the Connection pool
*/
checkIn(connection: Connection): void;
/**
* Mark all current Connections as stale, clear the WaitQueue, and mark the pool as "paused".
* No connections may be checked out or created in this pool until ready() is called again.
* interruptInUseConnections specifies whether the pool will force interrupt "in use" connections as part of the clear.
* Default false.
*/
clear(interruptInUseConnections: Optional<Boolean>): void;
/**
* Mark the pool as "ready", allowing checkOuts to resume and connections to be created in the background.
* A pool can only transition from "paused" to "ready". A "closed" pool
* cannot be marked as "ready" via this method.
*/
ready(): void;
/**
* Marks the pool as "closed", preventing the pool from creating and returning new Connections
*/
close(): void;
}
Connection Pool Behaviors
Creating a Connection Pool
This specification does not define how a pool is to be created, leaving it up to the driver. Creation of a connection pool is generally an implementation detail of the driver, i.e., is not a part of the public API of the driver. The SDAM specification defines when the driver should create connection pools.
When a pool is created, its state MUST initially be set to "paused". Even if minPoolSize is set, the pool MUST NOT begin being populated with Connections until it has been marked as "ready". SDAM will mark the pool as "ready" on each successful check. See Connection Pool Management section in the SDAM specification for more information.
set generation to 0
set state to "paused"
emit PoolCreatedEvent and equivalent log message
Closing a Connection Pool
When a pool is closed, it MUST first close all available Connections in that pool. This results in the following behavior changes:
- In use Connections MUST be closed when they are checked in to the closed pool.
- Attempting to check out a Connection MUST result in an error.
mark pool as "closed"
for connection in availableConnections:
close connection
emit PoolClosedEvent and equivalent log message
Marking a Connection Pool as Ready
Connection Pools start off as "paused", and they are marked as "ready" by monitors after they perform successful server checks. Once a pool is "ready", it can start checking out Connections and populating them in the background.
If the pool is already "ready" when this method is invoked, then this method MUST immediately return and MUST NOT emit a PoolReadyEvent.
mark pool as "ready"
emit PoolReadyEvent and equivalent log message
allow background thread to create connections
Note that the PoolReadyEvent MUST be emitted before the background thread is allowed to resume creating new connections, and it must be the case that no observer is able to observe actions of the background thread related to creating new connections before observing the PoolReadyEvent event.
Creating a Connection (Internal Implementation)
When creating a Connection, the initial Connection is in a "pending" state. This only creates a "virtual" Connection, and performs no I/O.
connection = new Connection()
increment totalConnectionCount
increment pendingConnectionCount
set connection state to "pending"
tConnectionCreated = current instant (use a monotonic clock if possible)
emit ConnectionCreatedEvent and equivalent log message
return connection
Establishing a Connection (Internal Implementation)
Before a Connection can be marked as either "available" or "in use", it must be established. This process involves performing the initial handshake, handling OP_COMPRESSED, and performing authentication.
try:
connect connection via TCP / TLS
perform connection handshake
handle OP_COMPRESSED
perform connection authentication
tConnectionReady = current instant (use a monotonic clock if possible)
emit ConnectionReadyEvent(duration = tConnectionReady - tConnectionCreated) and equivalent log message
return connection
except error:
close connection
throw error # Propagate error in manner idiomatic to language.
Closing a Connection (Internal Implementation)
When a Connection is closed, it MUST first be marked as "closed", removing it from being counted as "available" or "in use". Once that is complete, the Connection can perform whatever teardown is necessary to close its underlying socket. The Driver SHOULD perform this teardown in a non-blocking manner, such as via the use of a background thread or async I/O.
original state = connection state
set connection state to "closed"
if original state is "available":
decrement availableConnectionCount
else if original state is "pending":
decrement pendingConnectionCount
decrement totalConnectionCount
emit ConnectionClosedEvent and equivalent log message
# The following can happen at a later time (i.e. in background
# thread) or via non-blocking I/O.
connection.socket.close()
Marking a Connection as Available (Internal Implementation)
A Connection is "available" if it is able to be checked out. A Connection MUST NOT be marked as "available" until it has been established. The pool MUST keep track of the number of currently available Connections.
increment availableConnectionCount
set connection state to "available"
add connection to availableConnections
Populating the Pool with a Connection (Internal Implementation)
"Populating" the pool involves preemptively creating and establishing a Connection which is marked as "available" for use in future operations.
Populating the pool MUST NOT block any application threads. For example, it could be performed on a background thread or via the use of non-blocking/async I/O. Populating the pool MUST NOT be performed unless the pool is "ready".
If an error is encountered while populating a connection, it MUST be handled via the SDAM machinery according to the Application Errors section in the SDAM specification.
If minPoolSize is set, the Connection Pool MUST be populated until it has at least minPoolSize total Connections. This MUST occur only while the pool is "ready". If the pool implements a background thread, it can be used for this. If the pool does not implement a background thread, the checkOut method is responsible for ensuring this requirement is met.
When populating the Pool, pendingConnectionCount has to be decremented after establishing a Connection similarly to how it is done in Checking Out a Connection to signal that another Connection is allowed to be established. Such a signal MUST become observable to any Thread after the action that marks the established Connection as "available" becomes observable to the Thread. Informally, this order guarantees that no Thread tries to start establishing a Connection when there is an "available" Connection established as a result of populating the Pool.
wait until pendingConnectionCount < maxConnecting and pool is "ready"
create connection
try:
establish connection
mark connection as available
except error:
# Defer error handling to SDAM.
topology.handle_pre_handshake_error(error)
Checking Out a Connection
A Pool MUST have a method that allows the driver to check out a Connection. Checking out a Connection involves submitting a request to the WaitQueue and, once that request reaches the front of the queue, having the Pool find or create a Connection to fulfill that request. Requests MUST be subject to a timeout which is computed per the rules in Client Side Operations Timeout: Server Selection.
To service a request for a Connection, the Pool MUST first iterate over the list of available Connections, searching for a non-perished one to be returned. If a perished Connection is encountered, such a Connection MUST be closed (as described in Closing a Connection) and the iteration of available Connections MUST continue until either a non-perished available Connection is found or the list of available Connections is exhausted.
If the list is exhausted, the total number of Connections is less than maxPoolSize, and pendingConnectionCount < maxConnecting, the pool MUST create a Connection, establish it, mark it as "in use" and return it. If totalConnectionCount == maxPoolSize or pendingConnectionCount == maxConnecting, then the pool MUST wait to service the request until neither of those two conditions are met or until a Connection becomes available, re-entering the checkOut loop in either case. This waiting MUST NOT prevent Connections from being checked into the pool. Additionally, the Pool MUST NOT service any newer checkOut requests before fulfilling the original one which could not be fulfilled. For drivers that implement the WaitQueue via a fair semaphore, a condition variable may also be needed to meet this requirement. Waiting on the condition variable SHOULD also be limited by the WaitQueueTimeout, if the driver supports one and it was specified by the user.
If the pool is "closed" or "paused", any attempt to check out a Connection MUST throw an Error. The error thrown as a result of the pool being "paused" MUST be considered a retryable error and MUST NOT be an error that marks the SDAM state unknown.
If the pool does not implement a background thread, the checkOut method is responsible for ensuring that the pool is populated with at least minPoolSize Connections.
A Connection MUST NOT be checked out until it is established. In addition, the Pool MUST NOT prevent other threads from checking out Connections while establishing a Connection.
Before a given Connection is returned from checkOut, it must be marked as "in use", and the pool's availableConnectionCount MUST be decremented.
connection = Null
tConnectionCheckOutStarted = current instant (use a monotonic clock if possible)
emit ConnectionCheckOutStartedEvent and equivalent log message
try:
enter WaitQueue
wait until at top of wait queue
# Note that in a lock-based implementation of the wait queue would
# only allow one thread in the following block at a time
while connection is Null:
if a connection is available:
while connection is Null and a connection is available:
connection = next available connection
if connection is perished:
close connection
connection = Null
else if totalConnectionCount < maxPoolSize:
if pendingConnectionCount < maxConnecting:
connection = create connection
else:
# this waiting MUST NOT prevent other threads from checking Connections
# back in to the pool.
wait until pendingConnectionCount < maxConnecting or a connection is available
continue
except pool is "closed":
tConnectionCheckOutFailed = current instant (use a monotonic clock if possible)
emit ConnectionCheckOutFailedEvent(reason="poolClosed", duration = tConnectionCheckOutFailed - tConnectionCheckOutStarted) and equivalent log message
throw PoolClosedError
except pool is "paused":
tConnectionCheckOutFailed = current instant (use a monotonic clock if possible)
emit ConnectionCheckOutFailedEvent(reason="connectionError", duration = tConnectionCheckOutFailed - tConnectionCheckOutStarted) and equivalent log message
throw PoolClearedError
except timeout:
tConnectionCheckOutFailed = current instant (use a monotonic clock if possible)
emit ConnectionCheckOutFailedEvent(reason="timeout", duration = tConnectionCheckOutFailed - tConnectionCheckOutStarted) and equivalent log message
throw WaitQueueTimeoutError
finally:
# This must be done in all drivers
leave wait queue
# If the Connection has not been established yet (TCP, TLS,
# handshake, compression, and auth), it must be established
# before it is returned.
# This MUST NOT block other threads from acquiring connections.
if connection state is "pending":
try:
establish connection
except connection establishment error:
tConnectionCheckOutFailed = current instant (use a monotonic clock if possible)
emit ConnectionCheckOutFailedEvent(reason="connectionError", duration = tConnectionCheckOutFailed - tConnectionCheckOutStarted) and equivalent log message
decrement totalConnectionCount
throw
finally:
decrement pendingConnectionCount
else:
decrement availableConnectionCount
set connection state to "in use"
# If there is no background thread, the pool MUST ensure that
# there are at least minPoolSize total connections.
do asynchronously:
while totalConnectionCount < minPoolSize:
populate the pool with a connection
tConnectionCheckedOut = current instant (use a monotonic clock if possible)
emit ConnectionCheckedOutEvent(duration = tConnectionCheckedOut - tConnectionCheckOutStarted) and equivalent log message
return connection
Checking In a Connection
A Pool MUST have a method of allowing the driver to check in a Connection. The driver MUST NOT be allowed to check in a Connection to a Pool that did not create that Connection, and MUST throw an Error if this is attempted.
When the Connection is checked in, it MUST be closed if any of the following are true:
- The Connection is perished.
- The pool has been closed.
Otherwise, the Connection is marked as available.
emit ConnectionCheckedInEvent and equivalent log message
if connection is perished OR pool is closed:
close connection
else:
mark connection as available
Clearing a Connection Pool
Clearing the pool involves different steps depending on whether the pool is in load balanced mode or not. The traditional / non-load balanced clearing behavior MUST NOT be used by pools in load balanced mode, and the load balanced pool clearing behavior MUST NOT be used in non-load balanced pools.
Clearing a non-load balanced pool
A Pool MUST have a method of clearing all Connections when instructed. Rather than iterating through every Connection, this method should simply increment the generation of the Pool, implicitly marking all current Connections as stale. It should also transition the pool's state to "paused" to halt the creation of new connections until it is marked as "ready" again. The checkOut and checkIn algorithms will handle clearing out stale Connections. If a user is subscribed to Connection Monitoring events and/or connection log messages, a PoolClearedEvent and log message MUST be emitted after incrementing the generation / marking the pool as "paused". If the pool is already "paused" when it is cleared, then the pool MUST NOT emit a PoolCleared event or log message.
As part of clearing the pool, the WaitQueue MUST also be cleared, meaning all requests in the WaitQueue MUST fail with errors indicating that the pool was cleared while the checkOut was being performed. The error returned as a result of the pool being cleared MUST be considered a retryable error and MUST NOT be an error that marks the SDAM state unknown. Clearing the WaitQueue MUST happen eagerly so that any operations waiting on Connections can retry as soon as possible. The pool MUST NOT rely on WaitQueueTimeoutMS to clear requests from the WaitQueue.
The clearing method MUST provide the option to interrupt any in-use connections as part of the clearing (henceforth referred to as the interruptInUseConnections flag in this specification). "Interrupting a Connection" is defined as canceling whatever task the Connection is currently performing and marking the Connection as perished (e.g. by closing its underlying socket). The interrupting of these Connections MUST be performed as soon as possible but MUST NOT block the pool or prevent it from processing further requests. If the pool has a background thread, and it is responsible for interrupting in-use connections, its next run MUST be scheduled as soon as possible.
The pool MUST only interrupt in-use Connections whose generation is less than or equal to the generation of the pool at
the moment of the clear (before the increment) that used the interruptInUseConnections flag. Any operations that have
their Connections interrupted in this way MUST fail with a retryable error. If possible, the error SHOULD be a
PoolClearedError with the following message: "Connection to <pool address> interrupted due to server monitor timeout"
.
Clearing a load balanced pool
A Pool MUST also have a method of clearing all Connections for a specific serviceId
for use when in
load balancer mode. This method increments the generation of the pool for that specific serviceId
in the generation
map. A PoolClearedEvent and log message MUST be emitted after incrementing the generation. Note that this method MUST
NOT transition the pool to the "paused" state and MUST NOT clear the WaitQueue.
Load Balancer Mode
For load-balanced deployments, pools MUST maintain a map from serviceId
to a tuple of (generation, connection count)
where the connection count refers to the total number of connections that exist for a specific serviceId
. The pool
MUST remove the entry for a serviceId
once the connection count reaches 0. Once the MongoDB handshake is done, the
connection MUST get the generation number that applies to its serviceId
from the map and update the map to increment
the connection count for this serviceId
.
See the Load Balancer Specification for details.
Forking
A Connection is explicitly not fork-safe. The proper behavior in the case of a fork is to ResetAfterFork by:
- clear all Connection Pools in the child process
- closing all Connections in the child-process.
Drivers that support forking MUST document that Connections to an Endpoint are not fork-safe, and document the proper way to ResetAfterFork in the driver.
Drivers MAY aggressively ResetAfterFork if the driver detects it has been forked.
Optional Behaviors
The following features of a Connection Pool SHOULD be implemented if they make sense in the driver and driver's language.
Background Thread
A Pool SHOULD have a background Thread that is responsible for monitoring the state of all available Connections. This background thread SHOULD
- Populate Connections to ensure that the pool always satisfies minPoolSize.
- Remove and close perished available Connections including "in use" connections if
interruptInUseConnections
option was set to true in the most recent pool clear. - Apply timeouts to connection establishment per Client Side Operations Timeout: Background Connection Pooling.
A pool SHOULD allow immediate scheduling of the next background thread iteration after a clear is performed.
Conceptually, the aforementioned activities are organized into sequential Background Thread Runs. A Run MUST do as much work as readily available and then end instead of waiting for more work. For example, instead of waiting for pendingConnectionCount to become less than maxConnecting when satisfying minPoolSize, a Run MUST either proceed with the rest of its duties, e.g., closing available perished connections, or end.
The duration of intervals between the end of one Run and the beginning of the next Run is not specified, but the Test Format and Runner Specification may restrict this duration, or introduce other restrictions to facilitate testing.
withConnection
A Pool SHOULD implement a scoped resource management mechanism idiomatic to their language to prevent Connections from not being checked in. Examples include Python's "with" statement and C#'s "using" statement. If implemented, drivers SHOULD use this method as the default method of checking out and checking in Connections.
Connection Pool Monitoring
All drivers that implement a connection pool MUST provide an API that allows users to subscribe to events emitted from the pool. If a user subscribes to Connection Monitoring events, these events MUST be emitted when specified in "Connection Pool Behaviors". Events SHOULD be created and subscribed to in a manner idiomatic to their language and driver.
Events
See the Load Balancer Specification for details on the serviceId
field.
/**
* Emitted when a Connection Pool is created
*/
interface PoolCreatedEvent {
/**
* The ServerAddress of the Endpoint the pool is attempting to connect to.
*/
address: string;
/**
* Any non-default pool options that were set on this Connection Pool.
*/
options: {...}
}
/**
* Emitted when a Connection Pool is marked as ready.
*/
interface PoolReadyEvent {
/**
* The ServerAddress of the Endpoint the pool is attempting to connect to.
*/
address: string;
}
/**
* Emitted when a Connection Pool is cleared
*/
interface PoolClearedEvent {
/**
* The ServerAddress of the Endpoint the pool is attempting to connect to.
*/
address: string;
/**
* The service id for which the pool was cleared for in load balancing mode.
* See load balancer specification for more information about this field.
*/
serviceId: Optional<ObjectId>;
/**
* A flag whether the pool forced interrupting "in use" connections as part of the clear.
*/
interruptInUseConnections: Optional<Boolean>;
}
/**
* Emitted when a Connection Pool is closed
*/
interface PoolClosedEvent {
/**
* The ServerAddress of the Endpoint the pool is attempting to connect to.
*/
address: string;
}
/**
* Emitted when a Connection Pool creates a Connection object.
* NOTE: This does not mean that the Connection is ready for use.
*/
interface ConnectionCreatedEvent {
/**
* The ServerAddress of the Endpoint the pool is attempting to connect to.
*/
address: string;
/**
* The ID of the Connection
*/
connectionId: int64;
}
/**
* Emitted when a Connection has finished its setup, and is now ready to use
*/
interface ConnectionReadyEvent {
/**
* The ServerAddress of the Endpoint the pool is attempting to connect to.
*/
address: string;
/**
* The ID of the Connection
*/
connectionId: int64;
/**
* The time it took to establish the connection.
* In accordance with the definition of establishment of a connection
* specified by `ConnectionPoolOptions.maxConnecting`,
* it is the time elapsed between emitting a `ConnectionCreatedEvent`
* and emitting this event as part of the same checking out.
*
* Naturally, when establishing a connection is part of checking out,
* this duration is not greater than
* `ConnectionCheckedOutEvent`/`ConnectionCheckOutFailedEvent.duration`.
*
* A driver MAY choose the type idiomatic to the driver.
* If the type chosen does not convey units, e.g., `int64`,
* then the driver MAY include units in the name, e.g., `durationMS`.
*/
duration: Duration;
}
/**
* Emitted when a Connection Pool closes a Connection
*/
interface ConnectionClosedEvent {
/**
* The ServerAddress of the Endpoint the pool is attempting to connect to.
*/
address: string;
/**
* The ID of the Connection
*/
connectionId: int64;
/**
* A reason explaining why this Connection was closed.
* Can be implemented as a string or enum.
* Current valid values are:
* - "stale": The pool was cleared, making the Connection no longer valid
* - "idle": The Connection became stale by being available for too long
* - "error": The Connection experienced an error, making it no longer valid
* - "poolClosed": The pool was closed, making the Connection no longer valid
*/
reason: string|Enum;
}
/**
* Emitted when the driver starts attempting to check out a Connection
*/
interface ConnectionCheckOutStartedEvent {
/**
* The ServerAddress of the Endpoint the pool is attempting
* to connect to.
*/
address: string;
}
/**
* Emitted when the driver's attempt to check out a Connection fails
*/
interface ConnectionCheckOutFailedEvent {
/**
* The ServerAddress of the Endpoint the pool is attempting to connect to.
*/
address: string;
/**
* A reason explaining why Connection check out failed.
* Can be implemented as a string or enum.
* Current valid values are:
* - "poolClosed": The pool was previously closed, and cannot provide new Connections
* - "timeout": The Connection check out attempt exceeded the specified timeout
* - "connectionError": The Connection check out attempt experienced an error while setting up a new Connection
*/
reason: string|Enum;
/**
* See `ConnectionCheckedOutEvent.duration`.
*/
duration: Duration;
}
/**
* Emitted when the driver successfully checks out a Connection
*/
interface ConnectionCheckedOutEvent {
/**
* The ServerAddress of the Endpoint the pool is attempting to connect to.
*/
address: string;
/**
* The ID of the Connection
*/
connectionId: int64;
/**
* The time it took to check out the connection.
* More specifically, the time elapsed between
* emitting a `ConnectionCheckOutStartedEvent`
* and emitting this event as part of the same checking out.
*
* Naturally, if a new connection was not created (`ConnectionCreatedEvent`)
* and established (`ConnectionReadyEvent`) as part of checking out,
* this duration is usually
* not greater than `ConnectionPoolOptions.waitQueueTimeoutMS`,
* but MAY occasionally be greater than that,
* because a driver does not provide hard real-time guarantees.
*
* A driver MAY choose the type idiomatic to the driver.
* If the type chosen does not convey units, e.g., `int64`,
* then the driver MAY include units in the name, e.g., `durationMS`.
*/
duration: Duration;
}
/**
* Emitted when the driver checks in a Connection back to the Connection Pool
*/
interface ConnectionCheckedInEvent {
/**
* The ServerAddress of the Endpoint the pool is attempting to connect to.
*/
address: string;
/**
* The ID of the Connection
*/
connectionId: int64;
}
Connection Pool Logging
Please refer to the logging specification for details on logging implementations in general, including log levels, log components, handling of null values in log messages, and structured versus unstructured logging.
Drivers MUST support logging of connection pool information via the following types of log messages. These messages MUST
be logged at Debug
level and use the connection
log component. These messages MUST be emitted when specified in
"Connection Pool Behaviors".
The log messages are intended to match the information contained in the events above. Drivers MAY implement connection logging support via an event subscriber if it is convenient to do so.
The types used in the structured message definitions below are demonstrative, and drivers MAY use similar types instead so long as the information is present (e.g. a double instead of an integer, or a string instead of an integer if the structured logging framework does not support numeric types).
Common Fields
All connection log messages MUST contain the following key-value pairs:
Key | Suggested Type | Value |
---|---|---|
serverHost | String | the hostname, IP address, or Unix domain socket path for the endpoint the pool is for. |
serverPort | Int | The port for the endpoint the pool is for. Optional; not present for Unix domain sockets. When the user does not specify a port and the default (27017) is used, the driver SHOULD include it here. |
Pool Created Message
In addition to the common fields defined above, this message MUST contain the following key-value pairs:
Key | Suggested Type | Value |
---|---|---|
message | String | "Connection pool created" |
maxIdleTimeMS | Int | The maxIdleTimeMS value for this pool. Optional; only required to include if the user specified a value. |
minPoolSize | Int | The minPoolSize value for this pool. Optional; only required to include if the user specified a value. |
maxPoolSize | Int | The maxPoolSize value for this pool. Optional; only required to include if the user specified a value. |
maxConnecting | Int | The maxConnecting value for this pool. Optional; only required to include if the driver supports this option and the user specified a value. |
waitQueueTimeoutMS | Int | The waitQueueTimeoutMS value for this pool. Optional; only required to include if the driver supports this option and the user specified a value. |
waitQueueSize | Int | The waitQueueSize value for this pool. Optional; only required to include if the driver supports this option and the user specified a value. |
waitQueueMultiple | Int | The waitQueueMultiple value for this pool. Optional; only required to include if the driver supports this option and the user specified a value. |
The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:
Connection pool created for {{serverHost}}:{{serverPort}} using options maxIdleTimeMS={{maxIdleTimeMS}}, minPoolSize={{minPoolSize}}, maxPoolSize={{maxPoolSize}}, maxConnecting={{maxConnecting}}, waitQueueTimeoutMS={{waitQueueTimeoutMS}}, waitQueueSize={{waitQueueSize}}, waitQueueMultiple={{waitQueueMultiple}}
Pool Ready Message
In addition to the common fields defined above, this message MUST contain the following key-value pairs:
Key | Suggested Type | Value |
---|---|---|
message | String | "Connection pool ready" |
The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:
Connection pool ready for {{serverHost}}:{{serverPort}}
Pool Cleared Message
In addition to the common fields defined above, this message MUST contain the following key-value pairs:
Key | Suggested Type | Value |
---|---|---|
message | String | "Connection pool cleared" |
serviceId | String | The hex string representation of the service ID which the pool was cleared for. Optional; only present in load balanced mode. |
The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:
Connection pool for {{serverHost}}:{{serverPort}} cleared for serviceId {{serviceId}}
Pool Closed Message
In addition to the common fields defined above, this message MUST contain the following key-value pairs:
Key | Suggested Type | Value |
---|---|---|
message | String | "Connection pool closed" |
The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:
Connection pool closed for {{serverHost}}:{{serverPort}}
Connection Created Message
In addition to the common fields defined above, this message MUST contain the following key-value pairs:
Key | Suggested Type | Value |
---|---|---|
message | String | "Connection created" |
driverConnectionId | Int64 | The driver-generated ID for the connection as defined in Connection. |
The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:
Connection created: address={{serverHost}}:{{serverPort}}, driver-generated ID={{driverConnectionId}}
Connection Ready Message
In addition to the common fields defined above, this message MUST contain the following key-value pairs:
Key | Suggested Type | Value |
---|---|---|
message | String | "Connection ready" |
driverConnectionId | Int64 | The driver-generated ID for the connection as defined in Connection. |
durationMS | Int32/Int64/Double | ConnectionReadyEvent.duration converted to milliseconds. |
The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:
Connection ready: address={{serverHost}}:{{serverPort}}, driver-generated ID={{driverConnectionId}}, established in={{durationMS}} ms
Connection Closed Message
In addition to the common fields defined above, this message MUST contain the following key-value pairs:
Key | Suggested Type | Value |
---|---|---|
message | String | "Connection closed" |
driverConnectionId | Int64 | The driver-generated ID for the connection as defined in a Connection. |
reason | String | A string describing the reason the connection was closed. The following strings MUST be used for each possible reason as defined in Events above: - Stale: "Connection became stale because the pool was cleared - Idle: "Connection has been available but unused for longer than the configured max idle time" - Error: "An error occurred while using the connection" - Pool closed: "Connection pool was closed" |
error | Flexible | If reason is Error , the associated error.The type and format of this value is flexible; see the logging specification for details on representing errors in log messages. |
The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:
Connection closed: address={{serverHost}}:{{serverPort}}, driver-generated ID={{driverConnectionId}}. Reason: {{reason}}. Error: {{error}}
Connection Checkout Started Message
In addition to the common fields defined above, this message MUST contain the following key-value pairs:
Key | Suggested Type | Value |
---|---|---|
message | String | "Connection checkout started" |
The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:
Checkout started for connection to {{serverHost}}:{{serverPort}}
Connection Checkout Failed Message
In addition to the common fields defined above, this message MUST contain the following key-value pairs:
Key | Suggested Type | Value |
---|---|---|
message | String | "Connection checkout failed" |
reason | String | A string describing the reason checkout. The following strings MUST be used for each possible reason as defined in Events above: - Timeout: "Wait queue timeout elapsed without a connection becoming available" - ConnectionError: "An error occurred while trying to establish a new connection" - Pool closed: "Connection pool was closed" |
error | Flexible | If reason is ConnectionError , the associated error. The type and format of this value is flexible; see the logging specification for details on representing errors in log messages. |
durationMS | Int32/Int64/Double | ConnectionCheckOutFailedEvent.duration converted to milliseconds. |
The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:
Checkout failed for connection to {{serverHost}}:{{serverPort}}. Reason: {{reason}}. Error: {{error}}. Duration: {{durationMS}} ms
Connection Checked Out
In addition to the common fields defined above, this message MUST contain the following key-value pairs:
Key | Suggested Type | Value |
---|---|---|
message | String | "Connection checked out" |
driverConnectionId | Int64 | The driver-generated ID for the connection as defined in Connection. |
durationMS | Int32/Int64/Double | ConnectionCheckedOutEvent.duration converted to milliseconds. |
The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:
Connection checked out: address={serverHost}}:{{serverPort}}, driver-generated ID={{driverConnectionId}}, duration={{durationMS}} ms
Connection Checked In
In addition to the common fields defined above, this message MUST contain the following key-value pairs:
Key | Suggested Type | Value |
---|---|---|
message | String | "Connection checked in" |
driverConnectionId | Int64 | The driver-generated ID for the connection as defined in Connection. |
The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:
Connection checked in: address={{serverHost}}:{{serverPort}}, driver-generated ID={{driverConnectionId}}
Connection Pool Errors
A connection pool throws errors in specific circumstances. These Errors MUST be emitted by the pool. Errors SHOULD be created and dispatched in a manner idiomatic to the Driver and Language.
/**
* Thrown when the driver attempts to check out a
* Connection from a closed Connection Pool
*/
interface PoolClosedError {
message: 'Attempted to check out a Connection from closed connection pool';
address: <pool address>;
}
/**
* Thrown when the driver attempts to check out a
* Connection from a paused Connection Pool
*/
interface PoolClearedError extends RetryableError {
message: 'Connection pool for <pool address> was cleared because another operation failed with: <original error which cleared the pool>';
address: <pool address>;
}
/**
* Thrown when a driver times out when attempting to check out
* a Connection from a Pool
*/
interface WaitQueueTimeoutError {
message: 'Timed out while checking out a Connection from connection pool';
address: <pool address>;
}
Test Plan
See tests/README
Design Rationale
Why do we set minPoolSize across all members of a replicaSet, when most traffic will be against a Primary?
Currently, we are attempting to codify our current pooling behavior with minimal changes, and minPoolSize is currently uniform across all members of a replicaSet. This has the benefit of offsetting connection swarming during a Primary Step-Down, which will be further addressed in our Advanced Pooling Behaviors.
Why do we have separate ConnectionCreated and ConnectionReady events, but only one ConnectionClosed event?
ConnectionCreated and ConnectionReady each involve different state changes in the pool.
- ConnectionCreated adds a new "pending" Connection, meaning the totalConnectionCount and pendingConnectionCount increase by one
- ConnectionReady establishes that the Connection is ready for use, meaning the availableConnectionCount increases by one
ConnectionClosed indicates that the Connection is no longer a member of the pool, decrementing totalConnectionCount and potentially availableConnectionCount. After this point, the Connection is no longer a part of the pool. Further hypothetical events would not indicate a change to the state of the pool, so they are not specified here.
Why are waitQueueSize and waitQueueMultiple deprecated?
These options were originally only implemented in three drivers (Java, C#, and Python), and provided little value. While these fields would allow for faster diagnosis of issues in the connection pool, they would not actually prevent an error from occurring.
Additionally, these options have the effect of prioritizing newer requests over older requests, which is not necessarily the behavior that users want. For example, consider a situation when a WaitQueue is full, and a request for Connection gets rejected. Then a spot opens in the WaitQueue, and a newer request gets accepted. One may say that the newer request was prioritized over the older one, which violates the fairness recommendation that the WaitQueue normally adheres to.
Because of these issues, it does not make sense to go against driver mantras and provide an additional knob. We may eventually pursue an alternative configuration to address wait queue size in Advanced Pooling Behaviors.
Users that wish to have this functionality can achieve similar results by utilizing other methods to limit concurrency.
Examples include implementing either a thread pool or an operation queue with a capped size in the user application.
Drivers that need to deprecate waitQueueSize
and/or waitQueueMultiple
SHOULD refer users to these examples.
Why is waitQueueTimeoutMS optional for some drivers?
We are anticipating eventually introducing a single client-side timeout mechanism, making us hesitant to introduce another granular timeout control. Therefore, if a driver/language already has an idiomatic way to implement their timeouts, they should leverage that mechanism over implementing waitQueueTimeoutMS.
Why must populating the pool require the use of a background thread or async I/O?
Without the use of a background thread, the pool is populated with enough connections to satisfy minPoolSize during checkOut. Connections are established as part of populating the pool though, so if Connection establishment were done in a blocking fashion, the first operations after a clearing of the pool would experience unacceptably high latency, especially for larger values of minPoolSize. Thus, populating the pool must occur on a background thread (which is acceptable to block) or via the usage of non-blocking (async) I/O.
Why should closing a connection be non-blocking?
Because idle and perished Connections are cleaned up as part of checkOut, performing blocking I/O while closing such Connections would block application threads, introducing unnecessary latency. Once a Connection is marked as "closed", it will not be checked out again, so ensuring the socket is torn down does not need to happen immediately and can happen at a later time, either via async I/O or a background thread.
Why can the pool be paused?
The distinction between the "paused" state and the "ready" state allows the pool to determine whether or not the endpoint it is associated with is available or not. This enables the following behaviors:
- The pool can halt the creation of background connection establishments until the endpoint becomes available again. Without the "paused" state, the pool would have no way of determining when to begin establishing background connections again, so it would just continually attempt, and often fail, to create connections until minPoolSize was satisfied, even after repeated failures. This could unnecessarily waste resources both server and driver side.
- The pool can evict requests that enter the WaitQueue after the pool was cleared but before the server was in a known state again. Such requests can occur when a server is selected at the same time as it becomes marked as Unknown in highly concurrent workloads. Without the "paused" state, the pool would attempt to service these requests, since it would assume they were routed to the pool because its endpoint was available, not because of a race between SDAM and Server Selection. These requests would then likely fail with potentially high latency, again wasting resources both server and driver side.
Why not emit PoolCleared events and log messages when clearing a paused pool?
If a pool is already paused when it is cleared, that means it was previously cleared and no new connections have been created since then. Thus, clearing the pool in this case is essentially a no-op, so there is no need to notify any listeners that it has occurred. The generation is still incremented, however, to ensure future errors that caused the duplicate clear will stop attempting to clear the pool again. This situation is possible if the pool is cleared by the background thread after it encounters an error establishing a connection, but the ServerDescription for the endpoint was not updated accordingly yet.
Why does the pool need to support interrupting in use connections as part of its clear logic?
If a SDAM monitor has observed a network timeout, we assume that all connections including "in use" connections are no longer healthy. In some cases connections will fail to detect the network timeout fast enough. For example, a server request can hang at the OS level in TCP retry loop up for 17 minutes before failing. Therefore these connections MUST be proactively interrupted in the case of a server monitor network timeout. Requesting an immediate background thread run will speed up this process.
Why don't we configure TCP_USER_TIMEOUT?
Ideally, a reasonable TCP_USER_TIMEOUT can help with detecting stale connections as an alternative to
interruptInUseConnections
in Clear. Unfortunately this approach is platform dependent and not each driver allows
easily configuring it. For example, C# driver can configure this socket option on linux only with target frameworks
higher or equal to .net 5.0. On macOS, there is no straight equivalent for this option, it's possible that we can find
some equivalent configuration, but this configuration will also require target frameworks higher than or equal to .net
5.0. The advantage of using Background Thread to manage perished connections is that it will work regardless of
environment setup.
Backwards Compatibility
As mentioned in Deprecated Options, some drivers currently implement the options waitQueueSize
and/or waitQueueMultiple
. These options will need to be deprecated and phased out of the drivers that have implemented
them.
Reference Implementations
- JAVA (JAVA-3079)
- RUBY (RUBY-1560)
Future Development
SDAM
This specification does not dictate how SDAM Monitoring connections are managed. SDAM specifies that "A monitor SHOULD NOT use the client's regular Connection pool". Some possible solutions for this include:
- Having each Endpoint representation in the driver create and manage a separate dedicated Connection for monitoring purposes
- Having each Endpoint representation in the driver maintain a separate pool of maxPoolSize 1 for monitoring purposes.
- Having each Pool maintain a dedicated Connection for monitoring purposes, with an API to expose that Connection.
Advanced Pooling Behaviors
This spec does not address all advanced pooling behaviors like predictive pooling or aggressive Connection creation. Future work may address this.
Add support for OP_MSG exhaustAllowed
Exhaust Cursors may require changes to how we close Connections in the future, specifically to add a way to close and remove from its pool a Connection which has unread exhaust messages.
Changelog
-
2025-01-22: Clarify durationMS in logs may be Int32/Int64/Double.
-
2024-11-27: Relaxed the WaitQueue fairness requirement.
-
2024-11-01: Fixed race condition in pool-checkout-returned-connection-maxConnecting.yml test.
-
2024-01-23: Migrated from reStructuredText to Markdown.
-
2019-06-06: Add "connectionError" as a valid reason for ConnectionCheckOutFailedEvent
-
2020-09-03: Clarify Connection states and definition. Require the use of a background thread and/or async I/O. Add tests to ensure ConnectionReadyEvents are fired after ConnectionCreatedEvents.
-
2020-09-24: Introduce maxConnecting requirement
-
2020-12-17: Introduce "paused" and "ready" states. Clear WaitQueue on pool clear.
-
2021-01-12: Clarify "clear" method behavior in load balancer mode.
-
2021-01-19: Require that timeouts be applied per the client-side operations timeout specification.
-
2021-04-12: Adding in behaviour for load balancer mode.
-
2021-06-02: Formalize the behavior of a Background Thread.
-
2021-11-08: Make maxConnecting configurable.
-
2022-04-05: Preemptively cancel in progress operations when SDAM heartbeats timeout.
-
2022-10-05: Remove spec front matter and reformat changelog.
-
2022-10-14: Add connection pool log messages and associated tests.
-
2023-04-17: Fix duplicate logging test description.
-
2023-08-04: Add durations to connection pool events.
-
2023-10-04: Commit to the currently specified requirements regarding durations in events.
Load Balancer Support
- Status: Accepted
- Minimum Server Version: 5.0
Abstract
This specification defines driver behaviour when connected to MongoDB services through a load balancer.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Specification
Terms
SDAM
An abbreviated form of "Server Discovery and Monitoring", specification defined in Server Discovery and Monitoring Specification.
Service
Any MongoDB service that can run behind a load balancer.
MongoClient Configuration
loadBalanced
To specify to the driver to operate in load balancing mode, a connection string option of loadBalanced=true
MUST be
added to the connection string. This boolean option specifies whether or not the driver is connecting to a MongoDB
cluster through a load balancer. The default value MUST be false. This option MUST only be configurable at the level of
a MongoClient
.
URI Validation
When loadBalanced=true
is provided in the connection string, the driver MUST throw an exception in the following
cases:
- The connection string contains more than one host/port.
- The connection string contains a
replicaSet
option. - The connection string contains a
directConnection
option with a value oftrue
. - The connection string contains an
srvMaxHosts
option with a positive integer value.
If a URI is provided with the mongodb+srv
scheme, the driver MUST first do the SRV and TXT lookup and then perform the
validation. For drivers that do SRV lookup asynchronously this may result in a MongoClient
being instantiated but
erroring later during operation execution.
DNS Seedlist Discovery
The connection string option for loadBalanced=true
MUST be valid in a TXT record and when present MUST be validated as
defined in the URI Validation section.
When a MongoClient is configured with an SRV URI and loadBalanced=true
, the driver MUST NOT poll for changes in the
SRV record as is done for non-load balanced sharded clusters.
Server Discovery Logging and Monitoring
Monitoring
When loadBalanced=true
is specified in the URI the topology MUST start in type LoadBalanced
and MUST remain as
LoadBalanced
indefinitely. The topology MUST contain 1 ServerDescription
with a ServerType
of -
Code:LoadBalancer
. The "address" field of the ServerDescription
MUST be set to the address field of the load
balancer. All other fields in the - Code:ServerDescription
MUST remain unset. In this mode the driver MUST NOT start a
monitoring connection. The TopologyDescription
's compatible
field MUST always be true
.
Although there is no monitoring connection in load balanced mode, drivers MUST emit the following series of SDAM events:
TopologyOpeningEvent
when the topology is created.TopologyDescriptionChangedEvent
. ThepreviousDescription
field MUST haveTopologyType
Unknown
and no servers. ThenewDescription
MUST haveTopologyType
LoadBalanced
and one server withServerType
Unknown
.ServerOpeningEvent
when the server representing the load balancer is created.ServerDescriptionChangedEvent
. ThepreviousDescription
MUST haveServerType
Unknown
. ThenewDescription
MUST haveServerType
LoadBalancer
.TopologyDescriptionChangedEvent
. ThenewDescription
MUST haveTopologyType
LoadBalanced
and one server withServerType
LoadBalancer
.
Drivers MUST also emit a ServerClosedEvent
followed by a TopologyDescriptionChangedEvent
that transitions the
Topology
to the UNKNOWN
state and a TopologyClosedEvent
when the topology is closed and MUST NOT emit any other
events when operating in this mode.
Log Messages
SDAM events details described in Monitoring apply to corresponding log messages. Please refer to the SDAM logging specification for details on SDAM logging. Drivers MUST emit the relevant SDAM log messages, such as:
- Starting Topology Monitoring
- Stopped Topology Mmonitoring
- Starting Server Monitoring
- Stopped Server Monitoring
- Topology Description Changed
Driver Sessions
Session Support
When the TopologyType
is LoadBalanced
, sessions are always supported.
Session Expiration
When in load balancer mode, drivers MUST ignore logicalSessionTimeoutMinutes
and MUST NOT prune client sessions from
the session pool when implemented by the driver.
Data-Bearing Server Type
A ServerType
of LoadBalancer
MUST be considered a data-bearing server.
Server Selection
A deployment of topology type Load Balanced contains one server of type LoadBalancer
.
For read and write operations, the single server in the topology MUST always be selected.
During command construction, the LoadBalancer server MUST be treated like a mongos and drivers MUST add a
$readPreference
field to the command when required by
Passing read preference to mongos and load balancers.
Connection Pooling
Connection Establishment
In the case of the driver having the loadBalanced=true
connection string option specified, every pooled connection
MUST add a loadBalanced
field to the - Code:hello
command in its
handshake. The value of the field MUST be true
. If
loadBalanced=true
is specified then the OP_MSG
protocol MUST be used for all steps of the connection handshake.
Example:
Driver connection string contains loadBalanced=true
:
{ hello: 1, loadBalanced: true }
Driver connection string contains loadBalanced=false
or no - Code:loadBalanced
option:
{ hello: 1 }
When the server's hello response does not contain a serviceId
field, the driver MUST throw an exception with the
message "Driver attempted to initialize in load balancing mode, but the server does not support this mode."
For single threaded drivers that do not use a connection pool, the driver MUST have only 1 socket connection to the load balancer in load balancing mode.
Connection Pinning
Some features in MongoDB such as cursors and transactions require sending multiple commands to the same mongos in a
sharded cluster. In load balanced mode, it is not possible to target the same mongos behind a load balancer when pooling
connections. To account for this, drivers MUST pin to a single connection for these features. When using a pinned
connection, the driver MUST emit only 1 - Code:ConnectionCheckOutStartedEvent
, and only 1 ConnectionCheckedOutEvent
or ConnectionCheckOutFailedEvent
. Similarly, the driver MUST only publish 1 - Code:ConnectionCheckedInEvent
.
Behaviour With Cursors
When the driver is in load balancing mode and executing any cursor-initiating command, the driver MUST NOT check the
connection back into the pool unless the command fails or the server returns a cursor ID of 0
(i.e. all documents are
returned in a single batch). Otherwise, the driver MUST continue to use the same connection for all subsequent -
Code:getMore
commands for the cursor. The driver MUST check the connection back into the pool if the server returns a
cursor ID of 0
in a getMore
response (i.e. the cursor is drained). When the cursor's close
method is invoked,
either explicitly or via an implicit resource cleanup mechanism, the driver MUST use the same connection to execute a
killCursors
command if necessary and then check the connection back into the pool regardless of the result.
For multi-threaded drivers, cursors with pinned connections MUST either document to the user that calling next()
and
close()
operations on the cursor concurrently is not permitted, or explicitly prevent cursors from executing those
operations simultaneously.
If a getMore
fails with a network error, drivers MUST leave the connection pinned to the cursor. When the cursor's
close
method is invoked, drivers MUST NOT execute a killCursors
command because the pinned connection is no longer
valid and MUST return the connection back to the pool.
Behaviour With Transactions
When executing a transaction in load balancing mode, drivers MUST follow the rules outlined in Sharded Transactions with one exception: drivers MUST use the same connection for all commands in the transaction (excluding retries of commitTranscation and abortTransaction in some cases). Pinning to a single connection ensures that all commands in the transaction target the same service behind the load balancer. The rules for pinning to a connection and releasing a pinned connection are the same as those for server pinning in non-load balanced sharded transactions as described in When to unpin. Drivers MUST NOT use the same connection for two concurrent transactions run under different sessions from the same client.
Connection Tracking
The driver connection pool MUST track the purpose for which connections are checked out in the following 3 categories:
- Connections checked out for cursors
- Connections checked out for transactions
- Connections checked out for operations not falling under the previous 2 categories
When the connection pool's maxPoolSize
is reached and the pool times out waiting for a new connection the
WaitQueueTimeoutError
MUST include a new detailed message, "Timeout waiting for connection from the connection pool.
maxPoolSize: n, connections in use by cursors: n, connections in use by transactions: n, connections in use by other
operations: n".
Error Handling
Initial Handshake Errors
When establishing a new connection in load balanced mode, drivers MUST NOT perform SDAM error handling for any errors
that occur before the MongoDB Handshake (i.e. hello
command) is complete. Errors during the MongoDB Handshake MUST
also be ignored for SDAM error handling purposes. Once the initial handshake is complete, the connection MUST determine
its generation number based on the serviceId
field in the handshake response. Any errors that occur during the rest of
connection establishment (e.g. errors during authentication commands) MUST go through the SDAM error handling flow but
MUST NOT mark the server as - Code:Unknown
and when requiring the connection pool to be cleared, MUST only clear
connections for the serviceId
.
Post-Handshake Errors
When the driver is operating in load balanced mode and an application operation receives a state change error, the
driver MUST NOT make any changes to the TopologyDescription
or the ServerDescription
of the load balancer (i.e. it
MUST NOT mark the load balancer as Unknown
). If the error requires the connection pool to be cleared, the driver MUST
only clear connections with the same serviceId
as the connection which errored.
Events
When in load balancer mode the driver MUST now include the serviceId
in the - Code:CommandStartedEvent
,
CommandSucceededEvent
, and - Code:CommandFailedEvent
. The driver MAY decide how to expose this information. Drivers
that have a ConnectionId
object for example, MAY choose to provide a - Code:serviceId
in that object. The
serviceId
field is only present when in load balancer mode and connected to a service that is behind a load balancer.
Additionally the PoolClearedEvent
MUST also contain a serviceId
field.
Downstream Visible Behavioral Changes
Services MAY add a command line option or other configuration parameter, that tells the service it is running behind a load balancer. Services MAY also dynamically determine whether they are behind a load balancer.
All services which terminate TLS MUST be configured to return a TLS certificate for a hostname which matches the hostname the client is connecting to.
All services behind a load balancer that have been started with the aforementioned option MUST add a top level
serviceId
field to their response to the hello
command. This field MUST be a BSON ObjectId
and SHOULD NOT change
while the service is running. When a driver is configured to not be in load balanced mode and the service is configured
behind a load balancer, the service MAY return an error from the driver's hello
command that the driver is not
configured to use it properly.
All services that have the behaviour of reaping idle cursors after a specified period of time MAY also close the connection associated with the cursor when the cursor is reaped. Conversely, those services MAY reap a cursor when the connection associated with the cursor is closed.
All services that have the behaviour of reaping idle transactions after a specified period of time MAY also close the connection associated with the transaction when the transaction is reaped. Conversely, those services must abort a transaction when the connection associated with the transaction is closed.
Any applications that connect directly to services and not through the load balancer MUST connect via the regular
service port as they normally would and not the port specified by the loadBalancerPort
option. The loadBalanced=true
URI option MUST be omitted in this case.
Q&A
Why use a connection string option instead of a new URI scheme?
Use of a connection string option would allow the driver to continue to use SRV records that pointed at a load balancer
instead of a replica set without needing to change the URI provided to the MongoClient
. The SRV records could also
provide the default loadBalanced=true
in the TXT records.
Why explicitly opt-in to this behaviour instead of letting mongos inform the driver of the load balancer?
Other versions of this design proposed a scheme in which the application does not have to opt-in to load balanced mode.
Instead, the server would send a special field in hello
command responses to indicate that it was running behind a
load balancer and the driver would change its behavior accordingly. We opted to take an approach that required code
changes instead because load balancing changes driver behavior in ways that could cause unexpected application errors,
so it made sense to have applications consciously opt-in to this mode. For example, connection pinning creates new
stresses on connection pools because we go from a total of numMongosServers * maxPoolSize
connections to simply
maxPoolSize. Furthermore, connections get pinned to open cursors and transactions, further straining resource
availability. Due to this change, applications may also need to increase the configured maxPoolSize
when opting into
this mode.
Why does this specification instruct drivers to not check connections back into the connection pool in some circumstances?
In the case of a load balancer fronting multiple services, it is possible that a connection to the load balancer could result in a connection behind the load balancer to a different service. In order to guarantee these operations execute on the same service they need to be executed on the same socket - not checking a connection back into the pool for the entire operation guarantees this.
What reason has a client side connection reaper for idle cursors not been put into this specification?
It was discussed as a potential solution for maxed out connection pools that the drivers could potentially behave similar to the server and close long running cursors after a specified time period and return their connections to the pool. Due to the high complexity of that solution it was determined that better error messaging when the connection pool was maxed out would suffice in order for users to easily debug when the pool ran out of connections and fix their applications or adjust their pool options accordingly.
Why are we requiring mongos servers to add a new serviceId field in hello responses rather than reusing the existing topologyVersion.processId?
This option was previously discussed, but we opted to add a new hello
response field in order to not mix intentions.
Why does this specification not address load balancer restarts or maintenance?
The Layer 4 load balancers that would be in use for this feature lack the ability that a layer 7 load balancer could potentially have to be able to understand the MongoDB wire protocol and respond to monitoring requests.
Design Rationales
Services cannot dynamically switch from running behind a load balancer and not running behind a load balancer. Based on that, this design forces the application to opt-in to this behaviour and make potential changes that require restarts to their applications. If this were to change, see alternative designs below.
Alternative Designs
Service PROXY Detection
An alternative to the driver using a connection string option to put it into load balancing mode would be for the service the driver is connected to to inform the driver it is behind a load balancer. A possible solution for this would be for all services to understand the PROXY protocol such as Data Lake does, and to alter their hello responses to inform the driver they are behind a load balancer, potentially with the IP address of the load balancer itself.
The benefit of this solution would be that no changes would be required from the application side, and could also not require a restart of any application. A single request to the service through the load balancer could automatically trigger the change in the hello response and cause the driver to switch into load balancing mode pointing at the load balancer's IP address. Also with this solution it would provide services the ability to record the original IP addresses of the application that was connecting to it as they are provided the PROXY protocol's header bytes.
The additional complexity of this alternative on the driver side is that instead of starting in a single mode and remaining there for the life of the application, the driver would need to deal with additional state changes based on the results of the server monitors. From a service perspective, every service would need to be updated to understand the PROXY protocol header bytes prepended to the initial connection and modify their states and hello responses accordingly. Additionally load balancers would need to have additional configuration as noted in the reference section below, and only load balancers that support the PROXY protocol would be supported.
Changelog
- 2024-04-25: Clarify that
TopologyDescriptionChangedEvent
must be emitted on topology close - 2024-03-06: Migrated from reStructuredText to Markdown.
- 2022-10-05: Remove spec front matter and reformat changelog.
- 2022-01-18: Clarify that
OP_MSG
must be used in load balanced mode. - 2021-12-22: Clarify that pinned connections in transactions are exclusive.
- 2021-10-14: Note that
loadBalanced=true
conflicts withsrvMaxHosts
.
Authentication
- Status: Accepted
- Minimum Server Version: 2.6
Abstract
MongoDB supports various authentication strategies across various versions. When authentication is turned on in the database, a driver must authenticate before it is allowed to communicate with the server. This spec defines when and how a driver performs authentication with a MongoDB server.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
References
Server Discovery and Monitoring
Specification
Definitions
-
Credential
The pieces of information used to establish the authenticity of a user. This is composed of an identity and some form of evidence such as a password or a certificate.
-
FQDN
Fully Qualified Domain Name
-
Mechanism
A SASL implementation of a particular type of credential negotiation.
-
Source
The authority used to establish credentials and/or privileges in reference to a mongodb server. In practice, it is the database to which sasl authentication commands are sent.
-
Realm
The authority used to establish credentials and/or privileges in reference to GSSAPI.
-
SASL
Simple Authentication and Security Layer - RFC 4422
Client Implementation
MongoCredential
Drivers SHOULD contain a type called MongoCredential
. It SHOULD contain some or all of the following information.
-
username (string)
- Applies to all mechanisms.
- Optional for MONGODB-X509, MONGODB-AWS, and MONGODB-OIDC.
-
source (string)
- Applies to all mechanisms.
- Always '$external' for GSSAPI and MONGODB-X509.
- This is the database to which the authenticate command will be sent.
- This is the database to which sasl authentication commands will be sent.
-
password (string)
- Does not apply to all mechanisms.
-
mechanism (string)
- Indicates which mechanism to use with the credential.
-
mechanism_properties
- Includes additional properties for the given mechanism.
Each mechanism requires certain properties to be present in a MongoCredential for authentication to occur. See the individual mechanism definitions in the "MongoCredential Properties" section. All requirements listed for a mechanism must be met for authentication to occur.
Credential delimiter in URI implies authentication
The presence of a credential delimiter (i.e. @) in the URI connection string is evidence that the user has unambiguously specified user information and MUST be interpreted as a user configuring authentication credentials (even if the username and/or password are empty strings).
Authentication source and URI database do not imply authentication
The presence of a database name in the URI connection string MUST NOT be interpreted as a user configuring authentication credentials. The URI database name is only used as a default source for some mechanisms when authentication has been configured and a source is required but has not been specified. See individual mechanism definitions for details.
Similarly, the presence of the authSource
option in the URI connection string without other credential data such as
Userinfo or authentication parameters in connection options MUST NOT be interpreted as a request for authentication.
Errors
Drivers SHOULD raise an error as early as possible when detecting invalid values in a credential. For instance, if a
mechanism_property
is specified for MONGODB-CR, the driver should raise an error indicating that the
property does not apply.
Drivers MUST raise an error if any required information for a mechanism is missing. For instance, if a username
is not
specified for SCRAM-SHA-256, the driver must raise an error indicating the the property is missing.
Naming
Naming of this information MUST be idiomatic to the driver's language/framework but still remain consistent. For instance, python would use "mechanism_properties" and .NET would use "MechanismProperties".
Naming of mechanism properties MUST be case-insensitive. For instance, SERVICE_NAME and service_name refer to the same property.
Authentication
A MongoClient instance MUST be considered a single logical connection to the server/deployment.
Socket connections from a MongoClient to deployment members can be one of two types:
- Monitoring-only socket: multi-threaded drivers maintain monitoring sockets separate from sockets in connection pools.
- General-use socket: for multi-threaded drivers, these are sockets in connection pools used for (non-monitoring) user operations; in single-threaded drivers, these are used for both monitoring and user operations.
Authentication (including mechanism negotiation) MUST NOT happen on monitoring-only sockets.
If one or more credentials are provided to a MongoClient, then whenever a general-use socket is opened, drivers MUST immediately conduct an authentication handshake over that socket.
Drivers SHOULD require all credentials to be specified upon construction of the MongoClient. This is defined as eager authentication and drivers MUST support this mode.
Authentication Handshake
An authentication handshake consists of an initial hello
or legacy hello command possibly followed by one or more
authentication conversations.
Drivers MUST follow the following steps for an authentication handshake:
- Upon opening a general-use socket to a server for a given MongoClient, drivers MUST issue a
MongoDB Handshake immediately. This allows a driver to determine the server
type. If the
hello
or legacy hello of the MongoDB Handshake fails with an error, drivers MUST treat this as an authentication error. - If the server is of type RSArbiter, no authentication is possible and the handshake is complete.
- Inspect the value of
maxWireVersion
. If the value is greater than or equal to6
, then the driver MUST useOP_MSG
for authentication. Otherwise, it MUST useOP_QUERY
. - If credentials exist: 3.1. A driver MUST authenticate with all credentials provided to the MongoClient. 3.2. A single invalid credential is the same as all credentials being invalid.
If the authentication handshake fails for a socket, drivers MUST mark the server Unknown and clear the server's connection pool. (See Q & A below and SDAM's Why mark a server Unknown after an auth error for rationale.)
All blocking operations executed as part of the authentication handshake MUST apply timeouts per the Client Side Operations Timeout specification.
Mechanism Negotiation via Handshake
- Since: 4.0
If an application provides a username but does not provide an authentication mechanism, drivers MUST negotiate a
mechanism via a hello
or legacy hello command requesting a user's supported SASL mechanisms:
{hello: 1, saslSupportedMechs: "<dbname>.<username>"}
In this example <dbname>
is the authentication database name that either SCRAM-SHA-1 or SCRAM-SHA-256 would use (they
are the same; either from the connection string or else defaulting to 'admin') and <username>
is the username provided
in the auth credential. The username MUST NOT be modified from the form provided by the user (i.e. do not normalize with
SASLprep), as the server uses the raw form to look for conflicts with legacy credentials.
If the handshake response includes a saslSupportedMechs
field, then drivers MUST use the contents of that field to
select a default mechanism as described later. If the command succeeds and the response does not include a
saslSupportedMechs
field, then drivers MUST use the legacy default mechanism rules for servers older than 4.0.
Drivers MUST NOT validate the contents of the saslSupportedMechs
attribute of the initial handshake reply. Drivers
MUST NOT raise an error if the saslSupportedMechs
attribute of the reply includes an unknown mechanism.
Single-credential drivers
When the authentication mechanism is not specified, drivers that allow only a single credential per client MUST perform mechanism negotiation as part of the MongoDB Handshake portion of the authentication handshake. This lets authentication proceed without a separate negotiation round-trip exchange with the server.
Multi-credential drivers
The use of multiple credentials within a driver is discouraged, but some legacy drivers still allow this. Such drivers may not have user credentials when connections are opened and thus will not be able to do negotiation.
Drivers with a list of credentials at the time a connection is opened MAY do mechanism negotiation on the initial handshake, but only for the first credential in the list of credentials.
When authenticating each credential, if the authentication mechanism is not specified and has not been negotiated for that credential:
- If the connection handshake results indicate the server version is 4.0 or later, drivers MUST send a new
hello
or legacy hello negotiation command for the credential to determine the default authentication mechanism. - Otherwise, when the server version is earlier than 4.0, the driver MUST select a default authentication mechanism for
the credential following the instructions for when the
saslSupportedMechs
field is not present in a legacy hello response.
Caching credentials in SCRAM
In the implementation of SCRAM authentication mechanisms (e.g. SCRAM-SHA-1 and SCRAM-SHA-256), drivers MUST maintain a cache of computed SCRAM credentials. The cache entries SHOULD be identified by the password, salt, iteration count, and a value that uniquely identifies the authentication mechanism (e.g. "SHA1" or "SCRAM-SHA-256").
The cache entry value MUST be either the saltedPassword
parameter or the combination of the clientKey
and
serverKey
parameters.
Reauthentication
On any operation that requires authentication, the server may raise the error ReauthenticationRequired
(391),
typically if the user's credential has expired. Drivers MUST immediately attempt a reauthentication on the connection
using suitable credentials, as specified by the particular authentication mechanism when this error is raised, and then
re-attempt the operation. This attempt MUST be irrespective of whether the operation is considered retryable. Drivers
MUST NOT resend a hello
message during reauthentication, instead using SASL messages directly. Any errors that could
not be recovered from during reauthentication, or that were encountered during the subsequent re-attempt of the
operation MUST be raised to the user.
Currently the only authentication mechanism on the server that supports reauthentication is MONGODB-OIDC. See the MONGODB-OIDC section on reauthentication for more details. Note that in order to implement the unified spec tests for reauthentication, it may be necessary to add reauthentication support for whichever auth mechanism is used when running the authentication spec tests.
Default Authentication Methods
- Since: 3.0
- Revised: 4.0
If the user did not provide a mechanism via the connection string or via code, the following logic describes how to select a default.
If a saslSupportedMechs
field was present in the handshake response for mechanism negotiation, then it MUST be
inspected to select a default mechanism:
{
"hello" : true,
"saslSupportedMechs": ["SCRAM-SHA-1", "SCRAM-SHA-256"],
...
"ok" : 1
}
If SCRAM-SHA-256 is present in the list of mechanism, then it MUST be used as the default; otherwise, SCRAM-SHA-1 MUST be used as the default, regardless of whether SCRAM-SHA-1 is in the list. Drivers MUST NOT attempt to use any other mechanism (e.g. PLAIN) as the default.
If saslSupportedMechs
is not present in the handshake response for mechanism negotiation, then SCRAM-SHA-1 MUST be
used when talking to servers >= 3.0. Prior to server 3.0, MONGODB-CR MUST be used.
When a user has specified a mechanism, regardless of the server version, the driver MUST honor this.
Determining Server Version
Drivers SHOULD use the server's wire version ranges to determine the server's version.
MONGODB-CR
- Since: 1.4
- Deprecated: 3.0
- Removed: 4.0
MongoDB Challenge Response is a nonce and MD5 based system. The driver sends a getnonce
command, encodes and hashes
the password using the returned nonce, and then sends an authenticate
command.
Conversation
-
Send
getnonce
commandCMD = { getnonce: 1 } RESP = { nonce: <nonce> }
-
Compute key
passwordDigest = HEX( MD5( UTF8( username + ':mongo:' + password ))) key = HEX( MD5( UTF8( nonce + username + passwordDigest )))
-
Send
authenticate
commandCMD = { authenticate: 1, nonce: nonce, user: username, key: key }
As an example, given a username of "user" and a password of "pencil", the conversation would appear as follows:
CMD = {getnonce : 1}
RESP = {nonce: "2375531c32080ae8", ok: 1}
CMD = {authenticate: 1, user: "user", nonce: "2375531c32080ae8", key: "21742f26431831d5cfca035a08c5bdf6"}
RESP = {ok: 1}
MongoCredential Properties
-
username
MUST be specified and non-zero length.
-
source
MUST be specified. Defaults to the database name if supplied on the connection string or
admin
. -
password
MUST be specified.
-
mechanism
MUST be "MONGODB-CR"
-
mechanism_properties
MUST NOT be specified.
MONGODB-X509
- Since: 2.6
- Changed: 3.4
MONGODB-X509 is the usage of X.509 certificates to validate a client where the distinguished subject name of the client certificate acts as the username.
When connected to MongoDB 3.4:
- You MUST NOT raise an error when the application only provides an X.509 certificate and no username.
- If the application does not provide a username you MUST NOT send a username to the server.
- If the application provides a username you MUST send that username to the server.
When connected to MongoDB 3.2 or earlier:
- You MUST send a username to the server.
- If no username is provided by the application, you MAY extract the username from the X.509 certificate instead of requiring the application to provide it.
- If you choose not to automatically extract the username from the certificate you MUST error when no username is provided by the application.
Conversation
-
Send
authenticate
command (MongoDB 3.4+)CMD = {"authenticate": 1, "mechanism": "MONGODB-X509"} RESP = {"dbname" : "$external", "user" : "C=IS,ST=Reykjavik,L=Reykjavik,O=MongoDB,OU=Drivers,CN=client", "ok" : 1}
-
Send
authenticate
command with username:username = $(openssl x509 -subject -nameopt RFC2253 -noout -inform PEM -in my-cert.pem)
CMD = {authenticate: 1, mechanism: "MONGODB-X509", user: "C=IS,ST=Reykjavik,L=Reykjavik,O=MongoDB,OU=Drivers,CN=client"} RESP = {"dbname" : "$external", "user" : "C=IS,ST=Reykjavik,L=Reykjavik,O=MongoDB,OU=Drivers,CN=client", "ok" : 1}
MongoCredential Properties
-
username
SHOULD NOT be provided for MongoDB 3.4+ MUST be specified and non-zero length for MongoDB prior to 3.4
-
source
MUST be "$external". Defaults to
$external
. -
password
MUST NOT be specified.
-
mechanism
MUST be "MONGODB-X509"
-
mechanism_properties
MUST NOT be specified.
TODO: Errors
SASL Mechanisms
- Since: 2.4 Enterprise
SASL mechanisms are all implemented using the same sasl commands and interpreted as defined by the SASL specification RFC 4422.
-
Send the
saslStart
command.CMD = { saslStart: 1, mechanism: <mechanism_name>, payload: BinData(...), autoAuthorize: 1 } RESP = { conversationId: <number>, code: <code>, done: <boolean>, payload: <payload> }
- conversationId: the conversation identifier. This will need to be remembered and used for the duration of the conversation.
- code: A response code that will indicate failure. This field is not included when the command was successful.
- done: a boolean value indicating whether or not the conversation has completed.
- payload: a sequence of bytes or a base64 encoded string (depending on input) to pass into the SASL library to transition the state machine.
-
Continue with the
saslContinue
command whiledone
isfalse
.CMD = { saslContinue: 1, conversationId: conversationId, payload: BinData(...) } RESP = { conversationId: <number>, code: <code>, done: <boolean>, payload: <payload> }
Many languages will have the ability to utilize 3rd party libraries. The server uses cyrus-sasl and it would make sense for drivers with a choice to also choose cyrus. However, it is important to ensure that when utilizing a 3rd party library it does implement the mechanism on all supported OS versions and that it interoperates with the server. For instance, the cyrus sasl library offered on RHEL 6 does not implement SCRAM-SHA-1. As such, if your driver supports RHEL 6, you'll need to implement SCRAM-SHA-1 from scratch.
GSSAPI
-
Since:
2.4 Enterprise
2.6 Enterprise on Windows
GSSAPI is kerberos authentication as defined in RFC 4752. Microsoft has a proprietary implementation called SSPI which is compatible with both Windows and Linux clients.
MongoCredential properties:
-
username
MUST be specified and non-zero length.
-
source
MUST be "$external". Defaults to
$external
. -
password
MAY be specified. If omitted, drivers MUST NOT pass the username without password to SSPI on Windows and instead use the default credentials.
-
mechanism
MUST be "GSSAPI"
-
mechanism_properties
-
SERVICE_NAME
Drivers MUST allow the user to specify a different service name. The default is "mongodb".
-
CANONICALIZE_HOST_NAME
Drivers MAY allow the user to request canonicalization of the hostname. This might be required when the hosts report different hostnames than what is used in the kerberos database. The value is a string of either "none", "forward", or "forwardAndReverse". "none" is the default and performs no canonicalization. "forward" performs a forward DNS lookup to canonicalize the hostname. "forwardAndReverse" performs a forward DNS lookup and then a reverse lookup on that value to canonicalize the hostname. The driver MUST fallback to the provided host if any lookup errors or returns no results. Drivers MAY decide to also keep the legacy boolean values where
true
equals the "forwardAndReverse" behaviour andfalse
equals "none". -
SERVICE_REALM
Drivers MAY allow the user to specify a different realm for the service. This might be necessary to support cross-realm authentication where the user exists in one realm and the service in another.
-
SERVICE_HOST
Drivers MAY allow the user to specify a different host for the service. This is stored in the service principal name instead of the standard host name. This is generally used for cases where the initial role is being created from localhost but the actual service host would differ.
-
Hostname Canonicalization
Valid values for CANONICALIZE_HOST_NAME are true
, false
, "none", "forward", "forwardAndReverse". If a value is
provided that does not match one of these the driver MUST raise an error.
If CANONICALIZE_HOST_NAME is false
, "none", or not provided, the driver MUST NOT canonicalize the host name.
If CANONICALIZE_HOST_NAME is true
, "forward", or "forwardAndReverse", the client MUST canonicalize the name of each
host it uses for authentication. There are two options. First, if the client's underlying GSSAPI library provides
hostname canonicalization, the client MAY rely on it. For example, MIT Kerberos has
a configuration option for canonicalization.
Second, the client MAY implement its own canonicalization. If so, the canonicalization algorithm MUST be:
addresses = fetch addresses for host
if no addresses:
throw error
address = first result in addresses
while true:
cnames = fetch CNAME records for host
if no cnames:
break
# Unspecified which CNAME is used if > 1.
host = one of the records in cnames
if forwardAndReverse or true:
reversed = do a reverse DNS lookup for address
canonicalized = lowercase(reversed)
else:
canonicalized = lowercase(host)
For example, here is a Python implementation of this algorithm using getaddrinfo
(for address and CNAME resolution)
and getnameinfo
(for reverse DNS).
from socket import *
import sys
def canonicalize(host, mode):
# Get a CNAME for host, if any.
af, socktype, proto, canonname, sockaddr = getaddrinfo(
host, None, 0, 0, IPPROTO_TCP, AI_CANONNAME)[0]
print('address from getaddrinfo: [%s]' % (sockaddr[0],))
print('canonical name from getaddrinfo: [%s]' % (canonname,))
if (mode == true or mode == 'forwardAndReverse'):
try:
# NI_NAMEREQD requests an error if getnameinfo fails.
name = getnameinfo(sockaddr, NI_NAMEREQD)
except gaierror as exc:
print('getname info failed: "%s"' % (exc,))
return canonname.lower()
return name[0].lower()
else:
return canonname.lower()
canonicalized = canonicalize(sys.argv[1])
print('canonicalized: [%s]' % (canonicalized,))
Beware of a bug in older glibc where getaddrinfo
uses PTR records instead of CNAMEs if the address family hint is
AF_INET6, and beware of a bug in older MIT Kerberos that causes it to always do reverse DNS lookup even if the rdns
configuration option is set to false
.
PLAIN
- Since: 2.6 Enterprise
The PLAIN mechanism, as defined in RFC 4616, is used in MongoDB to perform LDAP
authentication. It cannot be used to perform any other type of authentication. Since the credentials are stored outside
of MongoDB, the $external
database must be used for authentication.
Conversation
As an example, given a username of "user" and a password of "pencil", the conversation would appear as follows:
CMD = {saslStart: 1, mechanism: "PLAIN", payload: BinData(0, "AHVzZXIAcGVuY2ls")}
RESP = {conversationId: 1, payload: BinData(0,""), done: true, ok: 1}
If your sasl client is also sending the authzid, it would be "user" and the conversation would appear as follows:
CMD = {saslStart: 1, mechanism: "PLAIN", payload: BinData(0, "dXNlcgB1c2VyAHBlbmNpbA==")}
RESP = {conversationId: 1, payload: BinData(0,""), done: true, ok: 1}
MongoDB supports either of these forms.
MongoCredential Properties
-
username
MUST be specified and non-zero length.
-
source
MUST be specified. Defaults to the database name if supplied on the connection string or
$external
. -
password
MUST be specified.
-
mechanism
MUST be "PLAIN"
-
mechanism_properties
MUST NOT be specified.
SCRAM-SHA-1
- Since: 3.0
SCRAM-SHA-1 is defined in RFC 5802.
Page 11 of the RFC specifies that user names be prepared with SASLprep, but drivers MUST NOT do so.
Page 8 of the RFC identifies the "SaltedPassword" as
:= Hi(Normalize(password), salt, i)
. The password
variable MUST be the mongodb hashed variant. The mongo hashed
variant is computed as hash = HEX( MD5( UTF8( username + ':mongo:' + plain_text_password )))
, where
plain_text_password
is actually plain text. The username
and password
MUST NOT be prepared with SASLprep before
hashing.
For example, to compute the ClientKey according to the RFC:
// note that "salt" and "i" have been provided by the server
function computeClientKey(username, plain_text_password) {
mongo_hashed_password = HEX( MD5( UTF8( username + ':mongo:' + plain_text_password )));
saltedPassword = Hi(Normalize(mongo_hashed_password), salt, i);
clientKey = HMAC(saltedPassword, "Client Key");
}
In addition, SCRAM-SHA-1 requires that a client create a randomly generated nonce. It is imperative, for security sake, that this be as secure and truly random as possible. For instance, Java provides both a Random class as well as a SecureRandom class. SecureRandom is cryptographically generated while Random is just a pseudo-random generator with predictable outcomes.
Additionally, drivers MUST enforce a minimum iteration count of 4096 and MUST error if the authentication conversation specifies a lower count. This mitigates downgrade attacks by a man-in-the-middle attacker.
Drivers MUST NOT advertise support for channel binding, as the server does not support it and legacy servers may fail
authentication if drivers advertise support. I.e. the client-first-message MUST start with n,
.
Drivers MUST add a top-level options
field to the saslStart command, whose value is a document containing a field
named skipEmptyExchange
whose value is true. Older servers will ignore the options
field and continue with the
longer conversation as shown in the "Backwards Compatibility" section. Newer servers will set the done
field to true
when it responds to the client at the end of the second round trip, showing proof that it knows the password. This will
shorten the conversation by one round trip.
Conversation
As an example, given a username of "user" and a password of "pencil" and an r value of "fyko+d2lbbFgONRv9qkxdawL", a SCRAM-SHA-1 conversation would appear as follows:
CMD = "n,,n=user,r=fyko+d2lbbFgONRv9qkxdawL"
RESP = "r=fyko+d2lbbFgONRv9qkxdawLHo+Vgk7qvUOKUwuWLIWg4l/9SraGMHEE,s=rQ9ZY3MntBeuP3E1TDVC4w==,i=10000"
CMD = "c=biws,r=fyko+d2lbbFgONRv9qkxdawLHo+Vgk7qvUOKUwuWLIWg4l/9SraGMHEE,p=MC2T8BvbmWRckDw8oWl5IVghwCY="
RESP = "v=UMWeI25JD1yNYZRMpZ4VHvhZ9e0="
This same conversation over MongoDB's SASL implementation would appear as follows:
CMD = {saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, "biwsbj11c2VyLHI9ZnlrbytkMmxiYkZnT05Sdjlxa3hkYXdM"), options: { skipEmptyExchange: true }}
RESP = {conversationId : 1, payload: BinData(0,"cj1meWtvK2QybGJiRmdPTlJ2OXFreGRhd0xIbytWZ2s3cXZVT0tVd3VXTElXZzRsLzlTcmFHTUhFRSxzPXJROVpZM01udEJldVAzRTFURFZDNHc9PSxpPTEwMDAw"), done: false, ok: 1}
CMD = {saslContinue: 1, conversationId: 1, payload: BinData(0, "Yz1iaXdzLHI9ZnlrbytkMmxiYkZnT05Sdjlxa3hkYXdMSG8rVmdrN3F2VU9LVXd1V0xJV2c0bC85U3JhR01IRUUscD1NQzJUOEJ2Ym1XUmNrRHc4b1dsNUlWZ2h3Q1k9")}
RESP = {conversationId: 1, payload: BinData(0,"dj1VTVdlSTI1SkQxeU5ZWlJNcFo0Vkh2aFo5ZTA9"), done: true, ok: 1}
MongoCredential Properties
-
username
MUST be specified and non-zero length.
-
source
MUST be specified. Defaults to the database name if supplied on the connection string or
admin
. -
password
MUST be specified.
-
mechanism
MUST be "SCRAM-SHA-1"
-
mechanism_properties
MUST NOT be specified.
SCRAM-SHA-256
- Since: 4.0
SCRAM-SHA-256 extends RFC 5802 and is formally defined in RFC 7677.
The MongoDB SCRAM-SHA-256 mechanism works similarly to the SCRAM-SHA-1 mechanism, with the following changes:
- The SCRAM algorithm MUST use SHA-256 as the hash function instead of SHA-1.
- User names MUST NOT be prepared with SASLprep. This intentionally contravenes the "SHOULD" provision of RFC 5802.
- Passwords MUST be prepared with SASLprep, per RFC 5802. Passwords are used directly for key derivation ; they MUST NOT be digested as they are in SCRAM-SHA-1.
Additionally, drivers MUST enforce a minimum iteration count of 4096 and MUST error if the authentication conversation specifies a lower count. This mitigates downgrade attacks by a man-in-the-middle attacker.
Drivers MUST add a top-level options
field to the saslStart command, whose value is a document containing a field
named skipEmptyExchange
whose value is true. Older servers will ignore the options
field and continue with the
longer conversation as shown in the "Backwards Compatibility" section. Newer servers will set the done
field to true
when it responds to the client at the end of the second round trip, showing proof that it knows the password. This will
shorten the conversation by one round trip.
Conversation
As an example, given a username of "user" and a password of "pencil" and an r value of "rOprNGfwEbeRWgbNEkqO", a SCRAM-SHA-256 conversation would appear as follows:
CMD = "n,,n=user,r=rOprNGfwEbeRWgbNEkqO"
RESP = "r=rOprNGfwEbeRWgbNEkqO%hvYDpWUa2RaTCAfuxFIlj)hNlF$k0,s=W22ZaJ0SNY7soEsUEjb6gQ==,i=4096"
CMD = "c=biws,r=rOprNGfwEbeRWgbNEkqO%hvYDpWUa2RaTCAfuxFIlj)hNlF$k0,p=dHzbZapWIk4jUhN+Ute9ytag9zjfMHgsqmmiz7AndVQ="
RESP = "v=6rriTRBi23WpRR/wtup+mMhUZUn/dB5nLTJRsjl95G4="
This same conversation over MongoDB's SASL implementation would appear as follows:
CMD = {saslStart: 1, mechanism:"SCRAM-SHA-256", options: {skipEmptyExchange: true}, payload: BinData(0, "biwsbj11c2VyLHI9ck9wck5HZndFYmVSV2diTkVrcU8=")}
RESP = {conversationId: 1, payload: BinData(0, "cj1yT3ByTkdmd0ViZVJXZ2JORWtxTyVodllEcFdVYTJSYVRDQWZ1eEZJbGopaE5sRiRrMCxzPVcyMlphSjBTTlk3c29Fc1VFamI2Z1E9PSxpPTQwOTY="), done: false, ok: 1}
CMD = {saslContinue: 1, conversationId: 1, payload: BinData(0, "Yz1iaXdzLHI9ck9wck5HZndFYmVSV2diTkVrcU8laHZZRHBXVWEyUmFUQ0FmdXhGSWxqKWhObEYkazAscD1kSHpiWmFwV0lrNGpVaE4rVXRlOXl0YWc5empmTUhnc3FtbWl6N0FuZFZRPQ==")}
RESP = {conversationId: 1, payload: BinData(0, "dj02cnJpVFJCaTIzV3BSUi93dHVwK21NaFVaVW4vZEI1bkxUSlJzamw5NUc0PQ=="), done: true, ok: 1}
MongoCredential Properties
-
username
MUST be specified and non-zero length.
-
source
MUST be specified. Defaults to the database name if supplied on the connection string or
admin
. -
password
MUST be specified.
-
mechanism
MUST be "SCRAM-SHA-256"
-
mechanism_properties
MUST NOT be specified.
MONGODB-AWS
- Since: 4.4
MONGODB-AWS authenticates using AWS IAM credentials (an access key ID and a secret access key), temporary AWS IAM credentials obtained from an AWS Security Token Service (STS) Assume Role request, an OpenID Connect ID token that supports AssumeRoleWithWebIdentity, or temporary AWS IAM credentials assigned to an EC2 instance or ECS task. Temporary credentials, in addition to an access key ID and a secret access key, includes a security (or session) token.
MONGODB-AWS requires that a client create a randomly generated nonce. It is imperative, for security sake, that this be as secure and truly random as possible. Additionally, the secret access key and only the secret access key is sensitive. Drivers MUST take proper precautions to ensure we do not leak this info.
All messages between MongoDB clients and servers are sent as BSON V1.1 Objects in the payload field of saslStart and saslContinue. All fields in these messages have a "short name" which is used in the serialized BSON representation and a human-readable "friendly name" which is used in this specification. They are as follows:
Name | Friendly Name | Type | Description |
---|---|---|---|
r | client nonce | BinData Subtype 0 | 32 byte cryptographically secure random number |
p | gs2-cb-flag | int32 | The integer representation of the ASCII character 'n' or 'y', i.e., 110 or 121 |
s | server nonce | BinData Subtype 0 | 64 bytes total, 32 bytes from the client first message and a 32 byte cryptographically secure random number generated by the server |
h | sts host | string | FQDN of the STS service |
a | authorization header | string | Authorization header for AWS Signature Version 4 |
d | X-AMZ-Date | string | Current date in UTC. See AWS Signature Version 4 |
t | X-AMZ-Security-Token | string | Optional AWS security token |
Drivers MUST NOT advertise support for channel binding, as the server does not support it and legacy servers may fail
authentication if drivers advertise support. The client-first-message MUST set the gs2-cb-flag to the integer
representation of the ASCII character n
, i.e., 110
.
Conversation
The first message sent by drivers MUST contain a client nonce
and gs2-cb-flag
. In response, the server will send a
server nonce
and sts host
. Drivers MUST validate that the server nonce is exactly 64 bytes and the first 32 bytes
are the same as the client nonce. Drivers MUST also validate that the length of the host is greater than 0 and less than
or equal to 255 bytes per RFC 1035. Drivers MUST reject FQDN names with empty
labels (e.g., "abc..def"), names that start with a period (e.g., ".abc.def") and names that end with a period (e.g.,
"abc.def."). Drivers MUST respond to the server's message with an authorization header
and a date
.
As an example, given a client nonce value of "dzw1U2IwSEtgaWI0IUxZMVJqc2xuQzNCcUxBc05wZjI=", a MONGODB-AWS conversation decoded from BSON to JSON would appear as follows:
Client First
{
"r" : new BinData(0, "dzw1U2IwSEtgaWI0IUxZMVJqc2xuQzNCcUxBc05wZjI="),
"p" : 110
}
Server First
{
"s" : new BinData(0, "dzw1U2IwSEtgaWI0IUxZMVJqc2xuQzNCcUxBc05wZjIGS0J9EgLwzEZ9dIzr/hnnK2mgd4D7F52t8g9yTC5cIA=="),
"h" : "sts.amazonaws.com"
}
Client Second
{
"a" : "AWS4-HMAC-SHA256 Credential=AKIAICGVLKOKZVY3X3DA/20191107/us-east-1/sts/aws4_request, SignedHeaders=content-length;content-type;host;x-amz-date;x-mongodb-gs2-cb-flag;x-mongodb-server-nonce, Signature=ab62ce1c75f19c4c8b918b2ed63b46512765ed9b8bb5d79b374ae83eeac11f55",
"d" : "20191107T002607Z"
"t" : "<security_token>"
}
Note that X-AMZ-Security-Token
is required when using temporary credentials. When using regular credentials, it MUST
be omitted. Each message above will be encoded as BSON V1.1 objects and sent to the peer as the value of payload
.
Therefore, the SASL conversation would appear as:
Client First
{
"saslStart" : 1,
"mechanism" : "MONGODB-AWS"
"payload" : new BinData(0, "NAAAAAVyACAAAAAAWj0lSjp8M0BMKGU+QVAzRSpWfk0hJigqO1V+b0FaVz4QcABuAAAAAA==")
}
Server First
{
"conversationId" : 1,
"done" : false,
"payload" : new BinData(0, "ZgAAAAVzAEAAAAAAWj0lSjp8M0BMKGU+QVAzRSpWfk0hJigqO1V+b0FaVz5Rj7x9UOBHJLvPgvgPS9sSzZUWgAPTy8HBbI1cG1WJ9gJoABIAAABzdHMuYW1hem9uYXdzLmNvbQAA"),
"ok" : 1.0
}
Client Second:
{
"saslContinue" : 1,
"conversationId" : 1,
"payload" : new BinData(0, "LQEAAAJhAAkBAABBV1M0LUhNQUMtU0hBMjU2IENyZWRlbnRpYWw9QUtJQUlDR1ZMS09LWlZZM1gzREEvMjAxOTExMTIvdXMtZWFzdC0xL3N0cy9hd3M0X3JlcXVlc3QsIFNpZ25lZEhlYWRlcnM9Y29udGVudC1sZW5ndGg7Y29udGVudC10eXBlO2hvc3Q7eC1hbXotZGF0ZTt4LW1vbmdvZGItZ3MyLWNiLWZsYWc7eC1tb25nb2RiLXNlcnZlci1ub25jZSwgU2lnbmF0dXJlPThhMTI0NGZjODYyZTI5YjZiZjc0OTFmMmYwNDE5NDY2ZGNjOTFmZWU1MTJhYTViM2ZmZjQ1NDY3NDEwMjJiMmUAAmQAEQAAADIwMTkxMTEyVDIxMDEyMloAAA==")
}
In response to the Server First message, drivers MUST send an authorization header
. Drivers MUST follow the
Signature Version 4 Signing Process to
calculate the signature for the authorization header
. The required and optional headers and their associated values
drivers MUST use for the canonical request (see
Summary of Signing Steps) are
specified in the table below. The following pseudocode shows the construction of the Authorization header.
Authorization: algorithm Credential=access key ID/credential scope, SignedHeaders=SignedHeaders, Signature=signature
The following example shows a finished Authorization header.
Authorization: AWS4-HMAC-SHA256 Credential=AKIDEXAMPLE/20150830/us-east-1/iam/aws4_request, SignedHeaders=content-type;host;x-amz-date, Signature=5d672d79c15b13162d9279b0855cfba6789a8edb4c82c400e06b5924a6f2b5d7
The following diagram is a summary of the steps drivers MUST follow to calculate the signature.
Name | Value |
---|---|
HTTP Request Method | POST |
URI | / |
Content-Type* | application/x-www-form-urlencoded |
Content-Length* | 43 |
Host* | Host field from Server First Message |
Region | Derived from Host - see Region Calculation below |
X-Amz-Date* | See Amazon Documentation |
X-Amz-Security-Token* | Optional, see Amazon Documentation |
X-MongoDB-Server-Nonce* | Base64 string of server nonce |
X-MongoDB-GS2-CB-Flag* | ASCII lower-case character 'n' or 'y' or 'p' |
X-MongoDB-Optional-Data* | Optional data, base64 encoded representation of the optional object provided by the client |
Body | Action=GetCallerIdentity&Version=2011-06-15 |
[!NOTE]
*
, Denotes a header that MUST be included in SignedHeaders, if present.
Region Calculation
To get the region from the host, the driver MUST follow the algorithm expressed in pseudocode below. :
if the host is invalid according to the rules described earlier
the region is undefined and the driver must raise an error.
else if the host is "aws.amazonaws.com"
the region is "us-east-1"
else if the host contains the character '.' (a period)
split the host by its periods. The region is the second label.
else // the valid host string contains no periods and is not "aws.amazonaws.com"
the region is "us-east-1"
Examples are provided below.
Host | Region | Notes |
---|---|---|
sts.amazonaws.com | us-east-1 | the host is "sts.amazonaws.com"; use us-east-1 |
sts.us-west-2.amazonaws.com | us-west-2 | use the second label |
sts.us-west-2.amazonaws.com.ch | us-west-2 | use the second label |
example.com | com | use the second label |
localhost | us-east-1 | no ". " character; use the default region |
sts..com | < Error > | second label is empty |
.amazonaws.com | < Error > | starts with a period |
sts.amazonaws. | < Error > | ends with a period |
"" | < Error > | empty string |
"string longer than 255" | < Error > | string longer than 255 bytes |
MongoCredential Properties
-
username
MAY be specified. The non-sensitive AWS access key.
-
source
MUST be "$external". Defaults to
$external
. -
password
MAY be specified. The sensitive AWS secret key.
-
mechanism
MUST be "MONGODB-AWS"
-
mechanism_properties
-
AWS_SESSION_TOKEN
Drivers MUST allow the user to specify an AWS session token for authentication with temporary credentials.
-
Obtaining Credentials
Drivers will need AWS IAM credentials (an access key, a secret access key and optionally a session token) to complete the steps in the Signature Version 4 Signing Process. If a username and password are provided drivers MUST use these for the AWS IAM access key and AWS IAM secret key, respectively. If, additionally, a session token is provided Drivers MUST use it as well. If a username is provided without a password (or vice-versa) or if only a session token is provided Drivers MUST raise an error. In other words, regardless of how Drivers obtain credentials the only valid combination of credentials is an access key ID and a secret access key or an access key ID, a secret access key and a session token.
AWS recommends using an SDK to "take care of some of the heavy lifting necessary in successfully making API calls, including authentication, retry behavior, and more".
A recommended pattern for drivers with existing custom implementation is to not further enhance existing implementations, and take an optional dependency on the AWS SDK. If the SDK is available, use it, otherwise fallback to the existing implementation.
One thing to be mindful of when adopting an AWS SDK is that they typically will check for credentials in a shared AWS credentials file when one is present, which may be confusing for users relying on the previous authentication handling behavior. It would be helpful to include a note like the following:
"Because we are now using the AWS SDK to handle credentials, if you have a shared AWS credentials or config file, then
those credentials will be used by default if AWS auth environment variables are not set. To override this behavior, set
AWS_SHARED_CREDENTIALS_FILE=""
in your shell or set the equivalent environment variable value in your script or
application. Alternatively, you can create an AWS profile specifically for your MongoDB credentials and set the
AWS_PROFILE
environment variable to that profile name."
The order in which Drivers MUST search for credentials is:
- The URI
- Environment variables
- Using
AssumeRoleWithWebIdentity
ifAWS_WEB_IDENTITY_TOKEN_FILE
andAWS_ROLE_ARN
are set. - The ECS endpoint if
AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
is set. Otherwise, the EC2 endpoint.
[!NOTE] See Should drivers support accessing Amazon EC2 instance metadata in Amazon ECS in Q & A
Drivers are not expected to handle AssumeRole requests directly. See description of
AssumeRole
below, which is distinct fromAssumeRoleWithWebIdentity
requests that are meant to be handled directly by the driver.
URI
An example URI for authentication with MONGODB-AWS using AWS IAM credentials passed through the URI is as follows:
"mongodb://<access_key>:<secret_key>@mongodb.example.com/?authMechanism=MONGODB-AWS"
Users MAY have obtained temporary credentials through an
AssumeRole request. If so, then in addition
to a username and password, users MAY also provide an AWS_SESSION_TOKEN
as a mechanism_property
.
"mongodb://<access_key>:<secret_key>@mongodb.example.com/?authMechanism=MONGODB-AWS&authMechanismProperties=AWS_SESSION_TOKEN:<security_token>"
Environment variables
AWS Lambda runtimes set several
environment variables
during initialization. To support AWS Lambda runtimes Drivers MUST check a subset of these variables, i.e.,
AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, and AWS_SESSION_TOKEN
, for the access key ID, secret access key and
session token, respectively if AWS credentials are not explicitly provided in the URI. The AWS_SESSION_TOKEN
may or
may not be set. However, if AWS_SESSION_TOKEN
is set Drivers MUST use its value as the session token. Drivers
implemented in programming languages that support altering environment variables MUST always read environment variables
dynamically during authorization, to handle the case where another part the application has refreshed the credentials.
However, if environment variables are not present during initial authorization, credentials may be fetched from another source and cached. Even if the environment variables are present in subsequent authorization attempts, the driver MUST use the cached credentials, or refresh them if applicable. This behavior is consistent with how the AWS SDKs behave.
AssumeRoleWithWebIdentity
AWS EKS clusters can be configured to automatically provide a valid OpenID Connect ID token and associated role ARN. These can be exchanged for temporary credentials using an AssumeRoleWithWebIdentity request.
If the AWS_WEB_IDENTITY_TOKEN_FILE
and AWS_ROLE_ARN
environment variables are set, drivers MUST make an
AssumeRoleWithWebIdentity
request to obtain temporary credentials. AWS recommends using an AWS Software Development
Kit (SDK) to make STS requests.
The WebIdentityToken
value is obtained by reading the contents of the file given by AWS_WEB_IDENTITY_TOKEN_FILE
. The
RoleArn
value is obtained from AWS_ROLE_ARN
. If AWS_ROLE_SESSION_NAME
is set, it MUST be used for the
RoleSessionName
parameter, otherwise a suitable random name can be chosen. No other request parameters need to be set
if using an SDK.
If not using an AWS SDK, the request must be made manually. If making a manual request, the Version
should be
specified as well. An example manual POST request looks like the following:
https://sts.amazonaws.com/
?Action=AssumeRoleWithWebIdentity
&RoleSessionName=app1
&RoleArn=<role_arn>
&WebIdentityToken=<token_file_contents>
&Version=2011-06-15
with the header:
Accept: application/json
The JSON response from the STS endpoint will contain credentials in this format:
{
"Credentials": {
"AccessKeyId": <access_key>,
"Expiration": <date>,
"RoleArn": <assumed_role_arn>,
"SecretAccessKey": <secret_access_key>,
"SessionToken": <session_token>
}
}
Note that the token is called SessionToken
and not Token
as it would be with other credential responses.
ECS endpoint
If a username and password are not provided and the aforementioned environment variables are not set, drivers MUST query
a link-local AWS address for temporary credentials. If temporary credentials cannot be obtained then drivers MUST fail
authentication and raise an error. Drivers SHOULD enforce a 10 second read timeout while waiting for incoming content
from both the ECS and EC2 endpoints. If the environment variable AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
is set then
drivers MUST assume that it was set by an AWS ECS agent and use the URI
http://169.254.170.2/$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
to obtain temporary credentials. Querying the URI will
return the JSON response:
{
"AccessKeyId": <access_key>,
"Expiration": <date>,
"RoleArn": <task_role_arn>,
"SecretAccessKey": <secret_access_key>,
"Token": <security_token>
}
EC2 endpoint
If the environment variable AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
is unset, drivers MUST use the EC2 endpoint,
http://169.254.169.254/latest/meta-data/iam/security-credentials/<role-name>
with the required header,
X-aws-ec2-metadata-token: <secret-token>
to access the EC2 instance's metadata. Drivers MUST obtain the role name from querying the URI
http://169.254.169.254/latest/meta-data/iam/security-credentials/
The role name request also requires the header X-aws-ec2-metadata-token
. Drivers MUST use v2 of the EC2 Instance
Metadata Service
(IMDSv2)
to access the secret token. In other words, Drivers MUST
-
Start a session with a simple HTTP PUT request to IMDSv2.
- The URL is
http://169.254.169.254/latest/api/token
. - The required header is
X-aws-ec2-metadata-token-ttl-seconds
. Its value is the number of seconds the secret token should remain valid with a max of six hours (21600
seconds).
- The URL is
-
Capture the secret token IMDSv2 returned as a response to the PUT request. This token is the value for the header
X-aws-ec2-metadata-token
.
The curl recipe below demonstrates the above. It retrieves a secret token that's valid for 30 seconds. It then uses that token to access the EC2 instance's credentials:
$ TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30"`
$ ROLE_NAME=`curl http://169.254.169.254/latest/meta-data/iam/security-credentials/ -H "X-aws-ec2-metadata-token: $TOKEN"`
$ curl http://169.254.169.254/latest/meta-data/iam/security-credentials/$ROLE_NAME -H "X-aws-ec2-metadata-token: $TOKEN"
Drivers can test this process using the mock EC2 server in
mongo-enterprise-modules.
The script must be run with python3
:
python3 ec2_metadata_http_server.py
To re-direct queries from the EC2 endpoint to the mock server, replace the link-local address (http://169.254.169.254
)
with the IP and port of the mock server (by default, http://localhost:8000
). For example, the curl script above
becomes:
$ TOKEN=`curl -X PUT "http://localhost:8000/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30"`
$ ROLE_NAME=`curl http://localhost:8000/latest/meta-data/iam/security-credentials/ -H "X-aws-ec2-metadata-token: $TOKEN"`
$ curl http://localhost:8000/latest/meta-data/iam/security-credentials/$ROLE_NAME -H "X-aws-ec2-metadata-token: $TOKEN"
The JSON response from both the actual and mock EC2 endpoint will be in this format:
{
"Code": "Success",
"LastUpdated" : <date>,
"Type": "AWS-HMAC",
"AccessKeyId" : <access_key>,
"SecretAccessKey": <secret_access_key>,
"Token" : <security_token>,
"Expiration": <date>
}
From the JSON response drivers MUST obtain the access_key
, secret_key
and security_token
which will be used during
the
Signature Version 4 Signing Process.
Caching Credentials
Credentials fetched by the driver using AWS endpoints MUST be cached and reused to avoid hitting AWS rate limitations. AWS recommends using a suitable Software Development Kit (SDK) for your language. If that SDK supports credential fetch and automatic refresh/caching, then that mechanism can be used in lieu of manual caching.
If using manual caching, the "Expiration" field MUST be stored and used to determine when to clear the cache. Credentials are considered valid if they are more than five minutes away from expiring; to the reduce the chance of expiration before they are validated by the server. Credentials that are retrieved from environment variables MUST NOT be cached.
If there are no current valid cached credentials, the driver MUST initiate a credential request. To avoid adding a
bottleneck that would override the maxConnecting
setting, the driver MUST not place a lock on making a request. The
cache MUST be written atomically.
If AWS authentication fails for any reason, the cache MUST be cleared.
[!NOTE] Five minutes was chosen based on the AWS documentation for IAM roles for EC2 : "We make new credentials available at least five minutes before the expiration of the old credentials". The intent is to have some buffer between when the driver fetches the credentials and when the server verifies them.
MONGODB-OIDC
- Since: 7.0 Enterprise
MONGODB-OIDC authenticates using an OpenID Connect (OIDC) access token.
There are two OIDC authentication flows that drivers can support: machine-to-machine ("machine") and human-in-the-loop ("human"). Drivers MUST support the machine authentication flow. Drivers MAY support the human authentication flow.
The MONGODB-OIDC specification refers to the following OIDC concepts:
- Identity Provider (IdP): A service that manages user accounts and authenticates users or applications, such as Okta or OneLogin. In the Human Authentication Flow, the OIDC Human Callback interacts directly the IdP. In the Machine Authentication Flow, only the MongoDB server interacts directly the IdP.
- Access token: Used to authenticate requests to protected resources. OIDC access tokens are signed JWT strings.
- Refresh token: Some OIDC providers may return a refresh token in addition to an access token. A refresh token can be used to retrieve new access tokens without requiring a human to re-authorize the application. Refresh tokens are typically only supported by the Human Authentication Flow.
Machine Authentication Flow
The machine authentication flow is intended to be used in cases where human interaction is not necessary or practical, such as to authenticate database access for a web service. Some OIDC documentation refers to the machine authentication flow as "workload authentication".
Drivers MUST implement all behaviors described in the MONGODB-OIDC specification, unless the section or block specifically says that it only applies to the Human Authentication Flow.
Human Authentication Flow
The human authentication flow is intended to be used for applications that involve direct human interaction, such as database tools or CLIs. Some OIDC documentation refers to the human authentication flow as "workforce authentication".
Drivers that support the Human Authentication Flow MUST implement all behaviors described in the MONGODB-OIDC specification, including sections or blocks that specifically say that it only applies the Human Authentication Flow.
MongoCredential Properties
-
username
MAY be specified. Its meaning varies depending on the OIDC provider integration used.
-
source
MUST be "$external". Defaults to
$external
. -
password
MUST NOT be specified.
-
mechanism
MUST be "MONGODB-OIDC"
-
mechanism_properties
-
ENVIRONMENT
Drivers MUST allow the user to specify the name of a built-in OIDC application environment integration to use to obtain credentials. If provided, the value MUST be one of
["test", "azure", "gcp", "k8s"]
. If bothENVIRONMENT
and an OIDC Callback or OIDC Human Callback are provided for the sameMongoClient
, the driver MUST raise an error. -
TOKEN_RESOURCE
The URI of the target resource. If
TOKEN_RESOURCE
is provided andENVIRONMENT
is not one of["azure", "gcp"]
orTOKEN_RESOURCE
is not provided andENVIRONMENT
is one of["azure", "gcp"]
, the driver MUST raise an error. Note: because theTOKEN_RESOURCE
is often itself a URL, drivers MUST document that aTOKEN_RESOURCE
with a comma,
must be given as aMongoClient
configuration and not as part of the connection string, and that theTOKEN_RESOURCE
value can contain a colon:
character. -
OIDC_CALLBACK
An OIDC Callback that returns OIDC credentials. Drivers MAY allow the user to specify an OIDC Callback using a
MongoClient
configuration instead of a mechanism property, depending on what is idiomatic for the driver. Drivers MUST NOT support both theOIDC_CALLBACK
mechanism property and aMongoClient
configuration. -
OIDC_HUMAN_CALLBACK
An OIDC Human Callback that returns OIDC credentials. Drivers MAY allow the user to specify a OIDC Human Callback using a
MongoClient
configuration instead of a mechanism property, depending on what is idiomatic for the driver. Drivers MUST NOT support both theOIDC_HUMAN_CALLBACK
mechanism property and aMongoClient
configuration. Drivers MUST return an error if both an OIDC Callback andOIDC Human Callback
are provided for the sameMongoClient
. This property is only required for drivers that support the Human Authentication Flow. -
ALLOWED_HOSTS
The list of allowed hostnames or ip-addresses (ignoring ports) for MongoDB connections. The hostnames may include a leading "*." wildcard, which allows for matching (potentially nested) subdomains.
ALLOWED_HOSTS
is a security feature and MUST default to["*.mongodb.net", "*.mongodb-qa.net", "*.mongodb-dev.net", "*.mongodbgov.net", "localhost", "127.0.0.1", "::1"]
. When MONGODB-OIDC authentication using a OIDC Human Callback is attempted against a hostname that does not match any of list of allowed hosts, the driver MUST raise a client-side error without invoking any user-provided callbacks. This value MUST NOT be allowed in the URI connection string. The hostname check MUST be performed after SRV record resolution, if applicable. This property is only required for drivers that support the Human Authentication Flow.
-
Built-in OIDC Environment Integrations
Drivers MUST support all of the following built-in OIDC application environment integrations.
Test
The test integration is enabled by setting auth mechanism property ENVIRONMENT:test
. It is meant for driver testing
purposes, and is not meant to be documented as a user-facing feature.
If enabled, drivers MUST generate a token using a script in the auth_oidc
folder in Drivers
Evergreen Tools. The driver MUST then set the OIDC_TOKEN_FILE
environment variable to the path to that file. At
runtime, the driver MUST use the OIDC_TOKEN_FILE
environment variable and read the OIDC access token from that path.
The driver MUST use the contents of that file as value in the jwt
field of the saslStart
payload.
Drivers MAY implement the "test" integration so that it conforms to the function signature of the OIDC Callback to prevent having to re-implement the "test" integration logic in the OIDC prose tests.
Azure
The Azure provider integration is enabled by setting auth mechanism property ENVIRONMENT:azure
.
If enabled, drivers MUST use an internal machine callback that calls the Azure Instance Metadata Service and parse the JSON response body, as follows:
Make an HTTP GET request to
http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=<resource>&client_id=<client_id>
with headers
Accept: application/json
Metadata: true
where <resource>
is the url-encoded value of the TOKEN_RESOURCE
mechanism property and <client_id>
is the
username
from the connection string. If a username
is not provided, the client_id
query parameter should be
omitted. The timeout should equal the callbackTimeoutMS
parameter given to the callback.
curl -X GET \
-H "Accept: application/json" \
-H "Metadata: true" \
--max-time $CALLBACK_TIMEOUT_MS \
"http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=$ENCODED_TOKEN_RESOURCE"
The JSON response will be in this format:
{
"access_token": "eyJ0eXAi...",
"refresh_token": "",
"expires_in": "3599",
"expires_on": "1506484173",
"not_before": "1506480273",
"resource": "https://management.azure.com/",
"token_type": "Bearer"
}
The driver MUST use the returned "access_token"
value as the access token in a JwtStepRequest
. If the response does
not return a status code of 200, the driver MUST raise an error including the HTTP response body.
For more details, see How to use managed identities for Azure resources on an Azure VM to acquire an access token.
The callback itself MUST not perform any caching, and the driver MUST cache its tokens in the same way as if a custom callback had been provided by the user.
For details on test environment setup, see the README in Drivers-Evergreen-Tools.
GCP
The GCP provider integration is enabled by setting auth mechanism property ENVIRONMENT:gcp
.
If enabled, drivers MUST use an internal machine callback that calls the Google Cloud VM metadata endpoint and parse the JSON response body, as follows:
Make an HTTP GET request to
http://metadata/computeMetadata/v1/instance/service-accounts/default/identity?audience=<resource>
with headers
Metadata-Flavor: Google
where <resource>
is the url-encoded value of the TOKEN_RESOURCE
mechanism property. The timeout should equal the
callbackTimeoutMS
parameter given to the callback.
curl -X GET \
-H "Metadata-Flavor: Google" \
--max-time $CALLBACK_TIMEOUT_MS \
"http://metadata/computeMetadata/v1/instance/service-accounts/default/identity?audience=$ENCODED_TOKEN_RESOURCE"
The response body will be the access token itself.
The driver MUST use the returned value as the access token in a JwtStepRequest
. If the response does not return a
status code of 200, the driver MUST raise an error including the HTTP response body.
For more details, see View and query VM metadata.
The callback itself MUST not perform any caching, and the driver MUST cache its tokens in the same way as if a custom callback had been provided by the user.
For details on test environment setup, see the README in Drivers-Evergreen-Tools.
Kubernetes
The Kubernetes integration is enabled by setting auth mechanism property ENVIRONMENT:k8s
. In this configuration, the
driver is expected to be running inside a Kubernetes environment with a configured
ServiceAccount.
If enabled, drivers MUST read the contents of the token from the local file path found using the following algorithm:
if 'AZURE_FEDERATED_TOKEN_FILE' in os.environ:
fname = os.environ['AZURE_FEDERATED_TOKEN_FILE']
elif 'AWS_WEB_IDENTITY_TOKEN_FILE' in os.environ:
fname = os.environ['AWS_WEB_IDENTITY_TOKEN_FILE']
else:
fname = '/var/run/secrets/kubernetes.io/serviceaccount/token'
Where AZURE_FEDERATED_TOKEN_FILE
contains the file path on Azure Kubernetes Service (AKS),
AWS_WEB_IDENTITY_TOKEN_FILE
contains the file path on Elastic Kubernetes Service (EKS), and
/var/run/secrets/kubernetes.io/serviceaccount/token
is the default path for a Kubernetes
ServiceAccount token,
which is used by Google Kubernetes Engine (GKE).
The callback itself MUST not perform any caching, and the driver MUST cache its tokens in the same way as if a custom callback had been provided by the user.
For details on test environment setup, see the README in Drivers-Evergreen-Tools.
OIDC Callback
Drivers MUST allow users to provide a callback that returns an OIDC access token. The purpose of the callback is to allow users to integrate with OIDC providers not supported by the Built-in Provider Integrations. Callbacks can be synchronous or asynchronous, depending on the driver and/or language. Asynchronous callbacks should be preferred when other operations in the driver use asynchronous functions.
Drivers MUST provide a way for the callback to be either automatically canceled, or to cancel itself. This can be as a
timeout argument to the callback, a cancellation context passed to the callback, or some other language-appropriate
mechanism. The timeout value MUST be min(remaining connectTimeoutMS, remaining timeoutMS)
as described in the Server
Selection section of the CSOT spec. If CSOT is not applied, then the driver MUST use 1 minute as the timeout.
The driver MUST pass the following information to the callback:
-
timeout
: A timeout, in milliseconds, a deadline, or atimeoutContext
. -
username
: The username given as part of the connection string orMongoClient
parameter. -
version
: The callback API version number. The version number is used to communicate callback API changes that are not breaking but that users may want to know about and review their implementation. Drivers MUST pass1
for the initial callback API version number and increment the version number anytime the API changes. Note that this may eventually lead to some drivers having different callback version numbers.For example, users may add the following check in their callback:
if(params.version > 1) { throw new Error("OIDC callback API has changed!"); }
The callback MUST be able to return the following information:
accessToken
: An OIDC access token string. The driver MUST NOT attempt to validateaccessToken
directly.expiresIn
: An optional expiry duration for the access token. Drivers with optional parameters MAY interpret a missing value as infinite. Drivers MUST error if a negative value is returned. Drivers SHOULD use the most idiomatic type for representing a duration in the driver's language. Note that the access token expiry value is currently not used in Credential Caching, but is intended to support future caching optimizations.
The signature and naming of the callback API is up to the driver's discretion. Drivers MUST ensure that additional optional input parameters and return values can be added to the callback signature in the future without breaking backward compatibility.
An example callback API might look like:
interface OIDCCallbackParams {
callbackTimeoutMS: int;
username: str;
version: int;
}
interface OIDCCredential {
accessToken: string;
expiresInSeconds: Optional<int>;
}
function oidcCallback(params: OIDCCallbackParams): OIDCCredential
OIDC Human Callback
The human callback is an OIDC callback that includes additional information that is required when using the Human Authentication Flow. Drivers that support the Human Authentication Flow MUST implement the human callback.
In addition to the information described in the OIDC Callback section, drivers MUST be able to pass the following information to the callback:
idpInfo
: Information used to authenticate with the IdP.issuer
: A URL which describes the Authentication Server. This identifier should be the iss of provided access tokens, and be viable for RFC8414 metadata discovery and RFC9207 identification.clientId
: A unique client ID for this OIDC client.requestScopes
: A list of additional scopes to request from IdP.
refreshToken
: The refresh token, if applicable, to be used by the callback to request a new token from the issuer.
In addition to the information described in the OIDC Callback section, the callback MUST be able to return the following information:
refreshToken
: An optional refresh token that can be used to fetch new access tokens.
The signature and naming of the callback API is up to the driver's discretion. Drivers MAY use a single callback API for both callback types or separate callback APIs for each callback type. Drivers MUST ensure that additional optional input parameters and return values can be added to the callback signature in the future without breaking backward compatibility.
An example human callback API might look like:
interface IdpInfo {
issuer: string;
clientId: Optional<string>;
requestScopes: Optional<Array<string>>;
}
interface OIDCCallbackParams {
username: str;
callbackTimeoutMS: int;
version: int;
idpInfo: Optional<IdpInfo>;
refreshToken: Optional<any>;
}
interface OIDCCredential {
accessToken: string;
expiresInSeconds: Optional<int>;
refreshToken: Optional<any>;
}
function oidcCallback(params: OIDCCallbackParams): OIDCCredential
When a human callback is provided, drivers MUST use the following behaviors when calling the callback:
- The driver MUST pass the
IdpInfo
and the refresh token (if available) to the callback.- If there is no cached
IdpInfo
, drivers MUST start a Two-Step conversation before calling the human callback. See the Conversation and Credential Caching sections for more details.
- If there is no cached
- The timeout duration MUST be 5 minutes. This is to account for the human interaction required to complete the callback. In this case, the callback is not subject to CSOT.
Conversation
OIDC supports two conversation styles: one-step and two-step. The server detects whether the driver is using a one-step
or two-step conversation based on the structure of the saslStart
payload.
One-Step
A one-step conversation is used for OIDC providers that allow direct access to an access token. For example, an OIDC provider configured for machine-to-machine authentication may provide an access token via a local file pre-loaded on an application host.
Drivers MUST use a one-step conversation when using a cached access token, one of the Built-in Provider Integrations, or an OIDC Callback (not an OIDC Human Callback).
The one-step conversation starts with a saslStart
containing a JwtStepRequest
payload. The value of jwt
is the
OIDC access token string.
interface JwtStepRequest:
// Compact serialized JWT with signature.
jwt: string;
}
An example OIDC one-step SASL conversation with access token string "abcd1234" looks like:
// Client:
{
saslStart: 1,
mechanism: "MONGODB-OIDC",
db: "$external"
// payload is a BSON generic binary field containing a JwtStepRequest BSON
// document: {"jwt": "abcd1234"}
payload: BinData(0, "FwAAAAJqd3QACQAAAGFiY2QxMjM0AAA=")
}
// Server:
{
conversationId : 1,
payload: BinData(0, ""),
done: true,
ok: 1
}
Two-Step
A two-step conversation is used for OIDC providers that require an extra authorization step before issuing a credential. For example, an OIDC provider configured for end-user authentication may require redirecting the user to a webpage so they can authorize the request.
Drivers that support the Human Authentication Flow MUST implement the two-step conversation. Drivers MUST use a two-step conversation when using a OIDC Human Callback and when there is no cached access token.
The two-step conversation starts with a saslStart
containing a PrincipalStepRequest
payload. The value of n
is the
username
from the connection string. If a username
is not provided, field n
should be omitted.
interface PrincipalStepRequest {
// Name of the OIDC user principal.
n: Optional<string>;
}
The server uses n
(if provided) to select an appropriate IdP. Note that the principal name is optional as it may be
provided by the IdP in environments where only one IdP is used.
The server responds to the PrincipalStepRequest
with IdpInfo
for the selected IdP:
interface IdpInfo {
// A URL which describes the Authentication Server. This identifier should
// be the iss of provided access tokens, and be viable for RFC8414 metadata
// discovery and RFC9207 identification.
issuer: string;
// A unique client ID for this OIDC client.
clientId: string;
// A list of additional scopes to request from IdP.
requestScopes: Optional<Array<string>>;
}
The driver passes the IdP information to the OIDC Human Callback, which should return an OIDC credential containing an access token and, optionally, a refresh token.
The driver then sends a saslContinue
with a JwtStepRequest
payload to complete authentication. The value of jwt
is
the OIDC access token string.
interface JwtStepRequest:
// Compact serialized JWT with signature.
jwt: string;
}
An example OIDC two-step SASL conversation with username "myidp" and access token string "abcd1234" looks like:
// Client:
{
saslStart: 1,
mechanism: "MONGODB-OIDC",
db: "$external",
// payload is a BSON generic binary field containing a PrincipalStepRequest
// BSON document: {"n": "myidp"}
payload: BinData(0, "EgAAAAJuAAYAAABteWlkcAAA")
}
// Server:
{
conversationId : 1,
// payload is a BSON generic binary field containing an IdpInfo BSON document:
// {"issuer": "https://issuer", "clientId": "abcd", "requestScopes": ["a","b"]}
payload: BinData(0, "WQAAAAJpc3N1ZXIADwAAAGh0dHBzOi8vaXNzdWVyAAJjbGllbnRJZAAFAAAAYWJjZAAEcmVxdWVzdFNjb3BlcwAXAAAAAjAAAgAAAGEAAjEAAgAAAGIAAAA="),
done: false,
ok: 1
}
// Client:
{
saslContinue: 1,
conversationId: 1,
// payload is a BSON generic binary field containing a JwtStepRequest BSON
// document: {"jwt": "abcd1234"}
payload: BinData(0, "FwAAAAJqd3QACQAAAGFiY2QxMjM0AAA=")
}
// Server:
{
conversationId: 1,
payload: BinData(0, ""),
done: true,
ok: 1
}
Credential Caching
Some OIDC providers may impose rate limits, incur per-request costs, or be slow to return. To minimize those issues, drivers MUST cache and reuse access tokens returned by OIDC providers.
Drivers MUST cache the most recent access token per MongoClient
(henceforth referred to as the Client Cache).
Drivers MAY store the Client Cache on the MongoClient
object or any object that guarantees exactly 1 cached access
token per MongoClient
. Additionally, drivers MUST cache the access token used to authenticate a connection on the
connection object (henceforth referred to as the Connection Cache).
Drivers MUST ensure that only one call to the configured provider or OIDC callback can happen at a time. To avoid adding
a bottleneck that would override the maxConnecting
setting, the driver MUST NOT hold an exclusive lock while running
saslStart
or saslContinue
.
Example code for credential caching using the read-through cache pattern:
def get_access_token():
# Lock the OIDC authenticator so that only one caller can modify the cache
# and call the configured OIDC provider at a time.
client.oidc_cache.lock()
# Check if we can use the access token from the Client Cache or if we need
# to fetch and cache a new access token from the OIDC provider.
access_token = client.oidc_cache.access_token
is_cache = True
if access_token is None
credential = oidc_provider()
is_cache = False
client.oidc_cache.access_token = credential.access_token
client.oidc_cache.unlock()
return access_token, is_cache
Drivers MUST have a way to invalidate a specific access token from the Client Cache. Invalidation MUST only clear the cached access token if it is the same as the invalid access token and MUST be an atomic operation (e.g. using a mutex or a compare-and-swap operation).
Example code for invalidation:
def invalidate(access_token):
client.oidc_cache.lock()
if client.oidc_cache.access_token == access_token:
client.oidc_cache.access_token = None
client.oidc_cache.unlock()
Drivers that support the Human Authentication Flow MUST also cache the IdPInfo
and
refresh token in the Client Cache when a OIDC Human Callback is configured.
Authentication
Use the following algorithm to authenticate a new connection:
- Check if the the Client Cache has an access token.
- If it does, cache the access token in the Connection Cache and perform a
One-Step
SASL conversation using the access token in the Client Cache. If the server returns a Authentication error (18), invalidate that access token. Raise any other errors to the user. On success, exit the algorithm.
- If it does, cache the access token in the Connection Cache and perform a
- Call the configured built-in provider integration or the OIDC callback to retrieve a new access token. Wait until it has been at least 100ms since the last callback invocation, to avoid overloading the callback.
- Cache the new access token in the Client Cache and Connection Cache.
- Perform a
One-Step
SASL conversation using the new access token. Raise any errors to the user.
Example code to authenticate a connection using the get_access_token
and invalidate
functions described above:
def auth(connection):
access_token, is_cache = get_access_token()
# If there is a cached access token, try to authenticate with it. If
# authentication fails with an Authentication error (18),
# invalidate the access token, fetch a new access token, and try
# to authenticate again.
# If the server fails for any other reason, do not clear the cache.
if is_cache:
try:
connection.oidc_cache.access_token = access_token
sasl_start(connection, payload={"jwt": access_token})
return
except ServerError as e:
if e.code == 18:
invalidate(access_token)
access_token, _ = get_access_token()
connection.oidc_cache.access_token = access_token
sasl_start(connection, payload={"jwt": access_token})
For drivers that support the Human Authentication Flow, use the following algorithm to authenticate a new connection when a OIDC Human Callback is configured:
- Check if the Client Cache has an access token.
- If it does, cache the access token in the Connection Cache and perform a One-Step SASL conversation using the access token. If the server returns an Authentication error (18), invalidate the access token token from the Client Cache, clear the Connection Cache, and restart the authentication flow. Raise any other errors to the user. On success, exit the algorithm.
- Check if the Client Cache has a refresh token.
- If it does, call the OIDC Human Callback with the cached refresh token and
IdpInfo
to get a new access token. Cache the new access token in the Client Cache and Connection Cache. Perform a One-Step SASL conversation using the new access token. If the the server returns an Authentication error (18), clear the refresh token, invalidate the access token from the Client Cache, clear the Connection Cache, and restart the authentication flow. Raise any other errors to the user. On success, exit the algorithm.
- If it does, call the OIDC Human Callback with the cached refresh token and
- Start a new Two-Step SASL conversation.
- Run a
PrincipalStepRequest
to get theIdpInfo
. - Call the OIDC Human Callback with the new
IdpInfo
to get a new access token and optional refresh token. Drivers MUST NOT pass a cached refresh token to the callback when performing a new Two-Step conversation. - Cache the new
IdpInfo
and refresh token in the Client Cache and the new access token in the Client Cache and Connection Cache. - Attempt to authenticate using a
JwtStepRequest
with the new access token. Raise any errors to the user.
Speculative Authentication
Drivers MUST implement speculative authentication for MONGODB-OIDC during the hello
handshake. Drivers MUST NOT
attempt speculative authentication if the Client Cache does not have a cached access token. Drivers MUST NOT
invalidate tokens from the Client Cache if speculative authentication does not succeed.
Use the following algorithm to perform speculative authentication:
- Check if the Client Cache has an access token.
- If it does, cache the access token in the Connection Cache and send a
JwtStepRequest
with the cached access token in the speculative authentication SASL payload. If the response is missing a speculative authentication document or the speculative authentication document indicates authentication was not successful, clear the the Connection Cache and proceed to the next step.
- If it does, cache the access token in the Connection Cache and send a
- Authenticate with the standard authentication handshake.
Example code for speculative authentication using the auth
function described above:
def speculative_auth(connection):
access_token = client.oidc_cache.access_token
if access_token != None:
connection.oidc_cache.access_token = access_token
res = hello(connection, payload={"jwt": access_token})
if res.speculative_authenticate.done:
return
connection.oidc_cache.access_token = None
auth(connection)
Reauthentication
If any operation fails with ReauthenticationRequired
(error code 391) and MONGODB-OIDC is in use, the driver MUST
reauthenticate the connection. Drivers MUST NOT resend a hello
message during reauthentication, instead using SASL
messages directly. Drivers MUST NOT try to use Speculative Authentication during reauthentication. See the main
reauthentication section for more information.
To reauthenticate a connection, invalidate the access token stored on the connection (i.e. the Connection Cache) from the Client Cache, fetch a new access token, and re-run the SASL conversation.
Example code for reauthentication using the auth
function described above:
def reauth(connection):
invalidate(connection.oidc_cache.access_token)
connection.oidc_cache.access_token = None
auth(connection)
Connection String Options
mongodb://[username[:password]@]host1[:port1][,[host2:[port2]],...[hostN:[portN]]][/database][?options]
Auth Related Options
-
authMechanism
MONGODB-CR, MONGODB-X509, GSSAPI, PLAIN, SCRAM-SHA-1, SCRAM-SHA-256, MONGODB-AWS
Sets the Mechanism property on the MongoCredential. When not set, the default will be one of SCRAM-SHA-256, SCRAM-SHA-1 or MONGODB-CR, following the auth spec default mechanism rules.
-
authSource
Sets the Source property on the MongoCredential.
For GSSAPI, MONGODB-X509 and MONGODB-AWS authMechanisms the authSource defaults to $external
. For PLAIN the authSource
defaults to the database name if supplied on the connection string or $external
. For MONGODB-CR, SCRAM-SHA-1 and
SCRAM-SHA-256 authMechanisms, the authSource defaults to the database name if supplied on the connection string or
admin
.
-
authMechanismProperties=PROPERTY_NAME:PROPERTY_VALUE,PROPERTY_NAME2:PROPERTY_VALUE2
A generic method to set mechanism properties in the connection string.
For example, to set REALM and CANONICALIZE_HOST_NAME, the option would be
authMechanismProperties=CANONICALIZE_HOST_NAME:forward,SERVICE_REALM:AWESOME
.
-
gssapiServiceName (deprecated)
An alias for
authMechanismProperties=SERVICE_NAME:mongodb
.
Errors
Drivers MUST raise an error if the authSource
option is specified in the connection string with an empty value, e.g.
mongodb://localhost/admin?authSource=
.
Implementation
-
Credentials MAY be specified in the connection string immediately after the scheme separator "//".
-
A realm MAY be passed as a part of the username in the url. It would be something like dev@MONGODB.COM, where dev is the username and MONGODB.COM is the realm. Per the RFC, the @ symbol should be url encoded using %40.
- When GSSAPI is specified, this should be interpreted as the realm.
- When non-GSSAPI is specified, this should be interpreted as part of the username.
-
It is permissible for only the username to appear in the connection string. This would be identified by having no colon follow the username before the '@' hostname separator.
-
The source is determined by the following:
- if authSource is specified, it is used.
- otherwise, if database is specified, it is used.
- otherwise, the admin database is used.
Test Plan
Connection string tests have been defined in the associated files:
SCRAM-SHA-256 and mechanism negotiation
Testing SCRAM-SHA-256 requires server version 3.7.3 or later with featureCompatibilityVersion
of "4.0" or later.
Drivers that allow specifying auth parameters in code as well as via connection string should test both for the test cases described below.
Step 1
Create three test users, one with only SHA-1, one with only SHA-256 and one with both. For example:
db.runCommand({createUser: 'sha1', pwd: 'sha1', roles: ['root'], mechanisms: ['SCRAM-SHA-1']})
db.runCommand({createUser: 'sha256', pwd: 'sha256', roles: ['root'], mechanisms: ['SCRAM-SHA-256']})
db.runCommand({createUser: 'both', pwd: 'both', roles: ['root'], mechanisms: ['SCRAM-SHA-1', 'SCRAM-SHA-256']})
Step 2
For each test user, verify that you can connect and run a command requiring authentication for the following cases:
- Explicitly specifying each mechanism the user supports.
- Specifying no mechanism and relying on mechanism negotiation.
For the example users above, the dbstats
command could be used as a test command.
For a test user supporting both SCRAM-SHA-1 and SCRAM-SHA-256, drivers should verify that negotiation selects SCRAM-SHA-256. This may require monkey patching, manual log analysis, etc.
Step 3
For test users that support only one mechanism, verify that explicitly specifying the other mechanism fails.
For a non-existent username, verify that not specifying a mechanism when connecting fails with the same error type that
would occur with a correct username but incorrect password or mechanism. (Because negotiation with a non-existent user
name at one point during server development caused a handshake error, we want to verify this is seen by users as similar
to other authentication errors, not as a network or database command error on the hello
or legacy hello commands
themselves.)
Step 4
To test SASLprep behavior, create two users:
- username: "IX", password "IX"
- username: "\u2168" (ROMAN NUMERAL NINE), password "\u2163" (ROMAN NUMERAL FOUR)
To create the users, use the exact bytes for username and password without SASLprep or other normalization and specify SCRAM-SHA-256 credentials:
db.runCommand({createUser: 'IX', pwd: 'IX', roles: ['root'], mechanisms: ['SCRAM-SHA-256']})
db.runCommand({createUser: '\\u2168', pwd: '\\u2163', roles: ['root'], mechanisms: ['SCRAM-SHA-256']})
For each user, verify that the driver can authenticate with the password in both SASLprep normalized and non-normalized forms:
- User "IX": use password forms "IX" and "I\u00ADX"
- User "\u2168": use password forms "IV" and "I\u00ADV"
As a URI, those have to be UTF-8 encoded and URL-escaped, e.g.:
mongodb://IX:IX@mongodb.example.com/admin
mongodb://IX:I%C2%ADX@mongodb.example.com/admin
mongodb://%E2%85%A8:IV@mongodb.example.com/admin
mongodb://%E2%85%A8:I%C2%ADV@mongodb.example.com/admin
Speculative Authentication
See the speculative authentication section in the MongoDB Handshake spec.
Minimum iteration count
For SCRAM-SHA-1 and SCRAM-SHA-256, test that the minimum iteration count is respected. This may be done via unit testing of an underlying SCRAM library.
Backwards Compatibility
Drivers may need to remove support for association of more than one credential with a MongoClient, including
- Deprecation and removal of MongoClient constructors that take as an argument more than a single credential
- Deprecation and removal of methods that allow lazy authentication (i.e post-MongoClient construction)
Drivers need to support both the shorter and longer SCRAM-SHA-1 and SCRAM-SHA-256 conversations over MongoDB's SASL implementation. Earlier versions of the server required an extra round trip due to an implementation decision. This was accomplished by sending no bytes back to the server, as seen in the following conversation (extra round trip emphasized):
CMD = {saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, "biwsbj11c2VyLHI9ZnlrbytkMmxiYkZnT05Sdjlxa3hkYXdM"), options: {skipEmptyExchange: true}}
RESP = {conversationId : 1, payload: BinData(0,"cj1meWtvK2QybGJiRmdPTlJ2OXFreGRhd0xIbytWZ2s3cXZVT0tVd3VXTElXZzRsLzlTcmFHTUhFRSxzPXJROVpZM01udEJldVAzRTFURFZDNHc9PSxpPTEwMDAw"), done: false, ok: 1}
CMD = {saslContinue: 1, conversationId: 1, payload: BinData(0, "Yz1iaXdzLHI9ZnlrbytkMmxiYkZnT05Sdjlxa3hkYXdMSG8rVmdrN3F2VU9LVXd1V0xJV2c0bC85U3JhR01IRUUscD1NQzJUOEJ2Ym1XUmNrRHc4b1dsNUlWZ2h3Q1k9")}
RESP = {conversationId: 1, payload: BinData(0,"dj1VTVdlSTI1SkQxeU5ZWlJNcFo0Vkh2aFo5ZTA9"), done: false, ok: 1}
# Extra round trip
CMD = {saslContinue: 1, conversationId: 1, payload: BinData(0, "")}
RESP = {conversationId: 1, payload: BinData(0,""), done: true, ok: 1}
The extra round trip will be removed in server version 4.4 when options: { skipEmptyExchange: true }
is specified
during saslStart
.
Reference Implementation
The Java and .NET drivers currently uses eager authentication and abide by this specification.
Q & A
Q: According to Authentication Handshake, we are calling hello
or legacy hello for every
socket. Isn't this a lot?
Drivers should be pooling connections and, as such, new sockets getting opened should be relatively infrequent. It's simply part of the protocol for setting up a socket to be used.
Q: Where is information related to user management?
Not here currently. Should it be? This is about authentication, not user management. Perhaps a new spec is necessary.
Q: It's possible to continue using authenticated sockets even if new sockets fail authentication. Why can't we do that so that applications continue to work.
Yes, that's technically true. The issue with doing that is for drivers using connection pooling. An application would function normally until an operation needed an additional connection(s) during a spike. Each new connection would fail to authenticate causing intermittent failures that would be very difficult to understand for a user.
Q: Should a driver support multiple credentials?
No.
Historically, the MongoDB server and drivers have supported multiple credentials, one per authSource, on a single connection. It was necessary because early versions of MongoDB allowed a user to be granted privileges to access the database in which the user was defined (or all databases in the special case of the "admin" database). But with the introduction of role-based access control in MongoDB 2.6, that restriction was removed and it became possible to create applications that access multiple databases with a single authenticated user.
Role-based access control also introduces the potential for accidental privilege escalation. An application may, for example, authenticate user A from authSource X, and user B from authSource Y, thinking that user A has privileges only on collections in X and user B has privileges only on collections in Y. But with role-based access control that restriction no longer exists, and it's possible that user B has, for example, more privileges on collections in X than user A does. Due to this risk it's generally safer to create a single user with only the privileges required for a given application, and authenticate only that one user in the application.
In addition, since only a single credential is supported per authSource, certain mechanisms are restricted to a single credential and some credentials cannot be used in conjunction (GSSAPI and X509 both use the "$external" database).
Finally, MongoDB 3.6 introduces sessions, and allows at most a single authenticated user on any connection which makes use of one. Therefore any application that requires multiple authenticated users will not be able to make use of any feature that builds on sessions (e.g. retryable writes).
Drivers should therefore guide application creators in the right direction by supporting the association of at most one credential with a MongoClient instance.
Q: Should a driver support lazy authentication?
No, for the same reasons as given in the previous section, as lazy authentication is another mechanism for allowing multiple credentials to be associated with a single MongoClient instance.
Q: Why does SCRAM sometimes SASLprep and sometimes not?
When MongoDB implemented SCRAM-SHA-1, it required drivers to NOT SASLprep usernames and passwords. The primary reason for this was to allow a smooth upgrade path from MongoDB-CR using existing usernames and passwords. Also, because MongoDB's SCRAM-SHA-1 passwords are hex characters of a digest, SASLprep of passwords was irrelevant.
With the introduction of SCRAM-SHA-256, MongoDB requires users to explicitly create new SCRAM-SHA-256 credentials distinct from those used for MONGODB-CR and SCRAM-SHA-1. This means SCRAM-SHA-256 passwords are not digested and any Unicode character could now appear in a password. Therefore, the SCRAM-SHA-256 mechanism requires passwords to be normalized with SASLprep, in accordance with the SCRAM RFC.
However, usernames must be unique, which creates a similar upgrade path problem. SASLprep maps multiple byte representations to a single normalized one. An existing database could have multiple existing users that map to the same SASLprep form, which makes it impossible to find the correct user document for SCRAM authentication given only a SASLprep username. After considering various options to address or workaround this problem, MongoDB decided that the best user experience on upgrade and lowest technical risk of implementation is to require drivers to continue to not SASLprep usernames in SCRAM-SHA-256.
Q: Should drivers support accessing Amazon EC2 instance metadata in Amazon ECS?
No. While it's possible to allow access to EC2 instance metadata in ECS, for security reasons, Amazon states it's best practice to avoid this. (See accessing EC2 metadata in ECS and IAM Roles for Tasks)
Changelog
-
2024-10-02: Add Kubernetes built-in OIDC provider integration.
-
2024-08-19: Clarify Reauthentication and Speculative Authentication combination behavior.
-
2024-05-29: Disallow comma character when
TOKEN_RESOURCE
is given in a connection string. -
2024-05-03: Clarify timeout behavior for OIDC machine callback. Add
serverless:forbid
to OIDC unified tests. Add an additional prose test for the behavior ofALLOWED_HOSTS
. -
2024-04-24: Clarify that TOKEN_RESOURCE for MONGODB-OIDC must be url-encoded.
-
2024-04-22: Fix API description for GCP built-in OIDC provider.
-
2024-04-22: Updated OIDC authentication flow and prose tests.
-
2024-04-22: Clarify that driver should not validate
saslSupportedMechs
content. -
2024-04-03: Added GCP built-in OIDC provider integration.
-
2024-03-29: Updated OIDC test setup and descriptions.
-
2024-03-21: Added Azure built-in OIDC provider integration.
-
2024-03-09: Rename OIDC integration name and values.
-
2024-01-31: Migrated from reStructuredText to Markdown.
-
2024-01-17: Added MONGODB-OIDC machine auth flow spec and combine with human auth flow specs.
-
2023-04-28: Added MONGODB-OIDC auth mechanism
-
2022-11-02: Require environment variables to be read dynamically.
-
2022-10-28: Recommend the use of AWS SDKs where available.
-
2022-10-07: Require caching of AWS credentials fetched by the driver.
-
2022-10-05: Remove spec front matter and convert version history to changelog.
-
2022-09-07: Add support for AWS AssumeRoleWithWebIdentity.
-
2022-01-20: Require that timeouts be applied per the client-side operations timeout spec.
-
2022-01-14: Clarify that
OP_MSG
must be used for authentication when it is supported. -
2021-04-23: Updated to use hello and legacy hello.
-
2021-03-04: Note that errors encountered during auth are handled by SDAM.
-
2020-03-06: Add reference to the speculative authentication section of the handshake spec.
-
2020-02-15: Rename MONGODB-IAM to MONGODB-AWS
-
2020-02-04: Support shorter SCRAM conversation starting in version 4.4 of the server.
-
2020-01-31: Clarify that drivers must raise an error when a connection string has an empty value for authSource.
-
2020-01-23: Clarify when authentication will occur.
-
2020-01-22: Clarify that authSource in URI is not treated as a user configuring auth credentials.
-
2019-12-05: Added MONGODB-IAM auth mechanism
-
2019-07-13: Clarify database to use for auth mechanism negotiation.
-
2019-04-26: Test format changed to improve specificity of behavior assertions.
- Clarify that database name in URI is not treated as a user configuring auth credentials.
-
2018-08-08: Unknown users don't cause handshake errors. This was changed before server 4.0 GA in SERVER-34421, so the auth spec no longer refers to such a possibility.
-
2018-04-17: Clarify authSource defaults
- Fix PLAIN authSource rule to allow user provided values
- Change SCRAM-SHA-256 rules such that usernames are NOT normalized; this follows a change in the server design and should be available in server 4.0-rc0.
-
2018-03-29: Clarify auth handshake and that it only applies to non-monitoring sockets.
-
2018-03-15: Describe CANONICALIZE_HOST_NAME algorithm.
-
2018-03-02: Added SCRAM-SHA-256 and mechanism negotiation as provided by server 4.0
- Updated default mechanism determination
- Clarified SCRAM-SHA-1 rules around SASLprep
- Require SCRAM-SHA-1 and SCRAM-SHA-256 to enforce a minimum iteration count
-
2017-11-10: Updated minimum server version to 2.6
- Updated the Q & A to recommend support for at most a single credential per MongoClient
- Removed lazy authentication section
- Changed the list of server types requiring authentication
- Made providing username for X509 authentication optional
-
2015-02-04: Added SCRAM-SHA-1 sasl mechanism
- Added connection handshake
- Changed connection string to support mechanism properties in generic form
- Added example conversations for all mechanisms except GSSAPI
- Miscellaneous wording changes for clarification
- Added MONGODB-X509
- Added PLAIN sasl mechanism
- Added support for GSSAPI mechanism property gssapiServiceName
Server Monitoring
- Status: Accepted
- Minimum Server Version: 2.4
Abstract
This spec defines how a driver monitors a MongoDB server. In summary, the client monitors each server in the topology. The scope of server monitoring is to provide the topology with updated ServerDescriptions based on hello or legacy hello command responses.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Specification
Terms
See the terms in the main SDAM spec.
check
The client checks a server by attempting to call hello or legacy hello on it, and recording the outcome.
client
A process that initiates a connection to a MongoDB server. This includes mongod and mongos processes in a replica set or sharded cluster, as well as drivers, the shell, tools, etc.
scan
The process of checking all servers in the deployment.
suitable
A server is judged "suitable" for an operation if the client can use it for a particular operation. For example, a write requires a standalone, primary, or mongos. Suitability is fully specified in the Server Selection Spec.
significant topology change
A change in the server's state that is relevant to the client's view of the server, e.g. a change in the server's replica set member state, or its replica set tags. In SDAM terms, a significant topology change on the server means the client's ServerDescription is out of date. Standalones and mongos do not currently experience significant topology changes but they may in the future.
regular hello or legacy hello command
A default {hello: 1}
or legacy hello command where the server responds immediately.
streamable hello or legacy hello command
The hello or legacy hello command feature which allows the server to stream multiple replies back to the client.
RTT
Round trip time. The client's measurement of the duration of one hello or legacy hello call. The RTT is used to support localThresholdMS from the Server Selection spec and timeoutMS from the Client Side Operations Timeout Spec.
FaaS
A Function-as-a-Service (FaaS) environment like AWS Lambda.
serverMonitoringMode
The serverMonitoringMode option configures which server monitoring protocol to use. Valid modes are "stream", "poll", or "auto". The default value MUST be "auto":
- With "stream" mode, the client MUST use the streaming protocol when the server supports it or fall back to the polling protocol otherwise.
- With "poll" mode, the client MUST use the polling protocol.
- With "auto" mode, the client MUST behave the same as "poll" mode when running on a FaaS platform or the same as
"stream" mode otherwise. The client detects that it's running on a FaaS platform via the same rules for generating
the
client.env
handshake metadata field in the MongoDB Handshake spec.
Multi-threaded or asynchronous drivers MUST implement this option. See Why disable the streaming protocol on FaaS platforms like AWS Lambda? and Why introduce a knob for serverMonitoringMode?
Monitoring
The client monitors servers using the hello or legacy hello commands. In MongoDB 4.4+, a monitor uses the Streaming Protocol to continuously stream hello or legacy hello responses from the server. In MongoDB <= 4.2, a monitor uses the Polling Protocol pausing heartbeatFrequencyMS between checks. Clients check servers sooner in response to certain events.
If a server API version is requested, then the driver must use hello for
monitoring. If a server API version is not requested, the initial handshake using the legacy hello command must include
helloOk: true
. If the response contains helloOk: true
, then the driver must use the hello
command for monitoring.
If the response does not contain helloOk: true
, then the driver must use the legacy hello command for monitoring.
The socket used to check a server MUST use the same connectTimeoutMS as regular sockets. Multi-threaded clients SHOULD set monitoring sockets' socketTimeoutMS to the connectTimeoutMS. (See socket timeout for monitoring is connectTimeoutMS. Drivers MAY let users configure the timeouts for monitoring sockets separately if necessary to preserve backwards compatibility.)
The client begins monitoring a server when:
- ... the client is initialized and begins monitoring each seed. See initial servers.
- ... updateRSWithoutPrimary or updateRSFromPrimary discovers new replica set members.
The following subsections specify how monitoring works, first in multi-threaded or asynchronous clients, and second in single-threaded clients. This spec provides detailed requirements for monitoring because it intends to make all drivers behave consistently.
Multi-threaded or asynchronous monitoring
Servers are monitored in parallel
All servers' monitors run independently, in parallel: If some monitors block calling hello or legacy hello over slow connections, other monitors MUST proceed unimpeded.
The natural implementation is a thread per server, but the decision is left to the implementer. (See thread per server.)
Servers are monitored with dedicated sockets
A monitor SHOULD NOT use the client's regular connection pool to acquire a socket; it uses a dedicated socket that does not count toward the pool's maximum size.
Drivers MUST NOT authenticate on sockets used for monitoring nor include SCRAM mechanism negotiation (i.e.
saslSupportedMechs
), as doing so would make monitoring checks more expensive for the server.
Servers are checked periodically
Each monitor checks its server and notifies the client of the outcome so the client can update the TopologyDescription.
After each check, the next check SHOULD be scheduled heartbeatFrequencyMS later; a check MUST NOT run while a previous check is still in progress.
Requesting an immediate check
At any time, the client can request that a monitor check its server immediately. (For example, after a "not writable primary" error. See error handling.) If the monitor is sleeping when this request arrives, it MUST wake and check as soon as possible. If a hello or legacy hello call is already in progress, the request MUST be ignored. If the previous check ended less than minHeartbeatFrequencyMS ago, the monitor MUST sleep until the minimum delay has passed, then check the server.
Application operations are unblocked when a server is found
Each time a check completes, threads waiting for a suitable server are unblocked. Each unblocked thread MUST proceed if the new TopologyDescription now contains a suitable server.
Clients update the topology from each handshake
When a monitor check creates a new connection, the connection handshake response MUST be used to satisfy the check and update the topology.
When a client successfully calls hello or legacy hello to handshake a new connection for application operations, it SHOULD use the hello or legacy hello reply to update the ServerDescription and TopologyDescription, the same as with a hello or legacy hello reply on a monitoring socket. If the hello or legacy hello call fails, the client SHOULD mark the server Unknown and update its TopologyDescription, the same as a failed server check on monitoring socket.
Clients use the streaming protocol when supported
When a monitor discovers that the server supports the streamable hello or legacy hello command and the client does not have streaming disabled, it MUST use the streaming protocol.
Single-threaded monitoring
cooldownMS
After a single-threaded client gets a network error trying to check a server, the client skips re-checking the server until cooldownMS has passed.
This avoids spending connectTimeoutMS on each unavailable server during each scan.
This value MUST be 5000 ms, and it MUST NOT be configurable.
Scanning
Single-threaded clients MUST scan all servers synchronously, inline with regular application operations. Before each operation, the client checks if heartbeatFrequencyMS has passed since the previous scan ended, or if the topology is marked "stale"; if so it scans all the servers before selecting a server and performing the operation.
Selection failure triggers an immediate scan. When a client that uses single-threaded monitoring fails to select a suitable server for any operation, it scans the servers, then attempts selection again, to see if the scan discovered suitable servers. It repeats, waiting minHeartbeatFrequencyMS after each scan, until a timeout.
Scanning order
If the topology is a replica set, the client attempts to contact the primary as soon as possible to get an authoritative list of members. Otherwise, the client attempts to check all members it knows of, in order from the least-recently to the most-recently checked.
When all servers have been checked the scan is complete. New servers discovered during the scan MUST be checked before the scan is complete. Sometimes servers are removed during a scan so they are not checked, depending on the order of events.
The scanning order is expressed in this pseudocode:
scanStartTime = now()
# You'll likely need to convert units here.
beforeCoolDown = scanStartTime - cooldownMS
while true:
serversToCheck = all servers with lastUpdateTime before scanStartTime
remove from serversToCheck any Unknowns with lastUpdateTime > beforeCoolDown
if no serversToCheck:
# This scan has completed.
break
if a server in serversToCheck is RSPrimary:
check it
else if there is a PossiblePrimary:
check it
else if any servers are not of type Unknown or RSGhost:
check the one with the oldest lastUpdateTime
if several servers have the same lastUpdateTime, choose one at random
else:
check the Unknown or RSGhost server with the oldest lastUpdateTime
if several servers have the same lastUpdateTime, choose one at random
This algorithm might be better understood with an example:
- The client is configured with one seed and TopologyType Unknown. It begins a scan.
- When it checks the seed, it discovers a secondary.
- The secondary's hello or legacy hello response includes the "primary" field with the address of the server that the secondary thinks is primary.
- The client creates a ServerDescription with that address, type PossiblePrimary, and lastUpdateTime "infinity ago". (See updateRSWithoutPrimary.)
- On the next iteration, there is still no RSPrimary, so the new PossiblePrimary is the top-priority server to check.
- The PossiblePrimary is checked and replaced with an RSPrimary. The client has now acquired an authoritative host list. Any new hosts in the list are added to the TopologyDescription with lastUpdateTime "infinity ago". (See updateRSFromPrimary.)
- The client continues scanning until all known hosts have been checked.
Another common case might be scanning a pool of mongoses. When the client first scans its seed list, they all have the default lastUpdateTime "infinity ago", so it scans them in random order. This randomness provides some load-balancing if many clients start at once. A client's subsequent scans of the mongoses are always in the same order, since their lastUpdateTimes are always in the same order by the time a scan ends.
minHeartbeatFrequencyMS
If a client frequently rechecks a server, it MUST wait at least minHeartbeatFrequencyMS milliseconds since the previous check ended, to avoid pointless effort. This value MUST be 500 ms, and it MUST NOT be configurable (no knobs).
heartbeatFrequencyMS
The interval between server checks, counted from the end of the previous check until the beginning of the next one.
For multi-threaded and asynchronous drivers it MUST default to 10 seconds and MUST be configurable. For single-threaded drivers it MUST default to 60 seconds and MUST be configurable. It MUST be called heartbeatFrequencyMS unless this breaks backwards compatibility.
For both multi- and single-threaded drivers, the driver MUST NOT permit users to configure it less than minHeartbeatFrequencyMS (500ms).
(See heartbeatFrequencyMS in the main SDAM spec.)
Awaitable hello or legacy hello Server Specification
As of MongoDB 4.4 the hello or legacy hello command can wait to reply until there is a topology change or a maximum time has elapsed. Clients opt in to this "awaitable hello" feature by passing new parameters "topologyVersion" and "maxAwaitTimeMS" to the hello or legacy hello commands. Exhaust support has also been added, which clients can enable in the usual manner by setting the OP_MSG exhaustAllowed flag.
Clients use the awaitable hello feature as the basis of the streaming heartbeat protocol to learn much sooner about stepdowns, elections, reconfigs, and other events.
topologyVersion
A server that supports awaitable hello or legacy hello includes a "topologyVersion" field in all hello or legacy hello replies and State Change Error replies. The topologyVersion is a subdocument with two fields, "processId" and "counter":
{
topologyVersion: {processId: <ObjectId>, counter: <int64>},
( ... other fields ...)
}
processId
An ObjectId maintained in memory by the server. It is reinitialized by the server using the standard ObjectId logic each time this server process starts.
counter
An int64 State change counter, maintained in memory by the server. It begins at 0 when the server starts, and it is incremented whenever there is a significant topology change.
maxAwaitTimeMS
To enable awaitable hello or legacy hello, the client includes a new int64 field "maxAwaitTimeMS" in the hello or legacy hello request. This field determines the maximum duration in milliseconds a server will wait for a significant topology change before replying.
Feature Discovery
To discover if the connected server supports awaitable hello or legacy hello, a client checks the most recent hello or legacy hello command reply. If the reply includes "topologyVersion" then the server supports awaitable hello or legacy hello.
Awaitable hello or legacy hello Protocol
To initiate an awaitable hello or legacy hello command, the client includes both maxAwaitTimeMS and topologyVersion in the request, for example:
{
hello: 1,
maxAwaitTimeMS: 10000,
topologyVersion: {processId: <ObjectId>, counter: <int64>},
( ... other fields ...)
}
Clients MAY additionally set the OP_MSG exhaustAllowed flag to enable streaming hello or legacy hello. With streaming hello or legacy hello, the server MAY send multiple hello or legacy hello responses without waiting for further requests.
A server that implements the new protocol follows these rules:
- Always include the server's topologyVersion in hello, legacy hello, and State Change Error replies.
- If the request includes topologyVersion without maxAwaitTimeMS or vice versa, return an error.
- If the request omits topologyVersion and maxAwaitTimeMS, reply immediately.
- If the request includes topologyVersion and maxAwaitTimeMS, then reply immediately if the server's topologyVersion.processId does not match the request's, otherwise reply when the server's topologyVersion.counter is greater than the request's, or maxAwaitTimeMS elapses, whichever comes first.
- Following the OP_MSG spec, if the request omits the exhaustAllowed flag, the server MUST NOT
set the moreToCome flag on the reply. If the request's exhaustAllowed flag is set, the server MAY set the moreToCome
flag on the reply. If the server sets moreToCome, it MUST continue streaming replies without awaiting further
requests. Between replies it MUST wait until the server's topologyVersion.counter is incremented or maxAwaitTimeMS
elapses, whichever comes first. If the reply includes
ok: 0
the server MUST NOT set the moreToCome flag. - On a topology change that changes the horizon parameters, the server will close all application connections.
Example awaitable hello conversation:
Client | Server |
---|---|
hello handshake -> | |
<- reply with topologyVersion | |
hello as OP_MSG with maxAwaitTimeMS and topologyVersion -> | |
wait for change or timeout | |
<- OP_MSG with topologyVersion | |
... |
Example streaming hello conversation (awaitable hello with exhaust):
Client | Server |
---|---|
hello handshake -> | |
<- reply with topologyVersion | |
hello as OP_MSG with exhaustAllowed, maxAwaitTimeMS, and topologyVersion -> | |
wait for change or timeout | |
<- OP_MSG with moreToCome and topologyVersion | |
wait for change or timeout | |
<- OP_MSG with moreToCome and topologyVersion | |
... | |
<- OP_MSG without moreToCome | |
... |
Streaming Protocol
The streaming protocol is used to monitor MongoDB 4.4+ servers and optimally reduces the time it takes for a client to discover server state changes. Multi-threaded or asynchronous drivers MUST use the streaming protocol when connected to a server that supports the awaitable hello or legacy hello commands. This protocol requires an extra thread and an extra socket for each monitor to perform RTT calculations.
Streaming disabled
The streaming protocol MUST be disabled when either:
- the client is configured with serverMonitoringMode=poll, or
- the client is configured with serverMonitoringMode=auto and a FaaS platform is detected, or
- the server does not support streaming (eg MongoDB < 4.4).
When the streaming protocol is disabled the client MUST use the polling protocol and MUST NOT start an extra thread or connection for Measuring RTT.
See Why disable the streaming protocol on FaaS platforms like AWS Lambda?.
Streaming hello or legacy hello
The streaming hello or legacy hello protocol uses awaitable hello or legacy hello with the OP_MSG exhaustAllowed flag to continuously stream hello or legacy hello responses from the server. Drivers MUST set the OP_MSG exhaustAllowed flag with the awaitable hello or legacy hello command and MUST process each hello or legacy hello response. (I.e., they MUST process responses strictly in the order they were received.)
A client follows these rules when processing the hello or legacy hello exhaust response:
- If the response indicates a command error, or a network error or timeout occurs, the client MUST close the connection and restart the monitoring protocol on a new connection. (See Network or command error during server check.)
- If the response is successful (includes "ok:1") and includes the OP_MSG moreToCome flag, then the client begins reading the next response.
- If the response is successful (includes "ok:1") and does not include the OP_MSG moreToCome flag, then the client initiates a new awaitable hello or legacy hello with the topologyVersion field from the previous response.
Socket timeout
Clients MUST use connectTimeoutMS as the timeout for the connection handshake. When connectTimeoutMS=0, the timeout is unlimited and MUST remain unlimited for awaitable hello and legacy hello replies. Otherwise, connectTimeoutMS is non-zero and clients MUST use connectTimeoutMS + heartbeatFrequencyMS as the timeout for awaitable hello and legacy hello replies.
Measuring RTT
When using the streaming protocol, clients MUST issue a hello or legacy hello command to each server to measure RTT every heartbeatFrequencyMS. The RTT command MUST be run on a dedicated connection to each server. Clients MUST NOT use dedicated connections to measure RTT when the streaming protocol is not used. (See Monitors MUST use a dedicated connection for RTT commands.)
Clients MUST update the RTT from the hello or legacy hello duration of the initial connection handshake. Clients MUST NOT update RTT based on streaming hello or legacy hello responses.
Clients MUST ignore the response to the hello or legacy hello command when measuring RTT. Errors encountered when running a hello or legacy hello command MUST NOT update the topology. (See Why don't clients mark a server unknown when an RTT command fails?)
Clients MUST track the minimum RTT out of the (at most) last 10 samples. Clients MUST report the minimum RTT as 0 until at least 2 samples have been gathered.
When constructing a ServerDescription from a streaming hello or legacy hello response, clients MUST set the average and minimum round trip times from the RTT task as the "roundTripTime" and "minRoundTripTime" fields, respectively.
See the pseudocode in the RTT thread section for an example implementation.
SDAM Monitoring
Clients MUST publish a ServerHeartbeatStartedEvent before attempting to read the next hello or legacy hello exhaust response. (See Why must streaming hello or legacy hello clients publish ServerHeartbeatStartedEvents?)
Clients MUST NOT publish any events when running an RTT command. (See Why don't streaming hello or legacy hello clients publish events for RTT commands?)
Heartbeat frequency
In the polling protocol, a client sleeps between each hello or legacy hello check (for at least minHeartbeatFrequencyMS and up to heartbeatFrequencyMS). In the streaming protocol, after processing an "ok:1" hello or legacy hello response, the client MUST NOT sleep and MUST begin the next check immediately.
Clients MUST set maxAwaitTimeMS to heartbeatFrequencyMS.
hello or legacy hello Cancellation
When a client is closed, clients MUST cancel all hello and legacy hello checks; a monitor blocked waiting for the next streaming hello or legacy hello response MUST be interrupted such that threads may exit promptly without waiting maxAwaitTimeMS.
When a client marks a server Unknown from Network error when reading or writing, clients MUST cancel the hello or legacy hello check on that server and close the current monitoring connection. (See Drivers cancel in-progress monitor checks.)
Polling Protocol
The polling protocol is used to monitor MongoDB < 4.4 servers or when streaming is disabled. The client checks a server with a hello or legacy hello command and then sleeps for heartbeatFrequencyMS before running another check.
Marking the connection pool as ready (CMAP only)
When a monitor completes a successful check against a server, it MUST mark the connection pool for that server as "ready", and doing so MUST be synchronized with the update to the topology (e.g. by marking the pool as ready in onServerDescriptionChanged). This is required to ensure a server does not get selected while its pool is still paused. See the Connection Pool definition in the CMAP specification for more details on marking the pool as "ready".
Error handling
Network or command error during server check
When a server check fails due to a network error (including a network timeout) or a command error (ok: 0
),
the client MUST follow these steps:
-
Close the current monitoring connection.
-
Mark the server Unknown.
-
Clear the connection pool for the server (See Clear the connection pool on both network and command errors). For CMAP compliant drivers, clearing the pool MUST be synchronized with marking the server as Unknown (see Why synchronize clearing a server's pool with updating the topology?). If this was a network timeout error, then the pool MUST be cleared with interruptInUseConnections = true (see Why does the pool need to support closing in use connections as part of its clear logic?)
-
If this was a network error and the server was in a known state before the error, the client MUST NOT sleep and MUST begin the next check immediately. (See retry hello or legacy hello calls once and JAVA-1159.)
-
Otherwise, wait for heartbeatFrequencyMS (or minHeartbeatFrequencyMS if a check is requested) before restarting the monitoring protocol on a new connection.
- Note that even in the streaming protocol, a monitor in this state will wait for an application operation to request an immediate check or for the heartbeatFrequencyMS timeout to expire before beginning the next check.
See the pseudocode in the Monitor thread
section.
Note that this rule applies only to server checks during monitoring. It does not apply when multi-threaded clients update the topology from each handshake.
Implementation notes
This section intends to provide generous guidance to driver authors. It is complementary to the reference implementations. Words like "should", "may", and so on are used more casually here.
Monitor thread
Most platforms can use an event object to control the monitor thread. The event API here is assumed to be like the standard Python Event. heartbeatFrequencyMS is configurable, minHeartbeatFrequencyMS is always 500 milliseconds:
class Monitor(Thread):
def __init__():
# Monitor options:
serverAddress = serverAddress
connectTimeoutMS = connectTimeoutMS
heartbeatFrequencyMS = heartbeatFrequencyMS
minHeartbeatFrequencyMS = 500
stableApi = stableApi
if serverMonitoringMode == "stream":
streamingEnabled = True
elif serverMonitoringMode == "poll":
streamingEnabled = False
else: # serverMonitoringMode == "auto"
streamingEnabled = not isFaas()
# Internal Monitor state:
connection = Null
# Server API versioning implies that the server supports hello.
helloOk = stableApi != Null
description = default ServerDescription
lock = Mutex()
rttMonitor = RttMonitor(serverAddress, stableApi)
def run():
while this monitor is not stopped:
previousDescription = description
try:
description = checkServer(previousDescription)
except CheckCancelledError:
if this monitor is stopped:
# The client was closed.
return
# The client marked this server Unknown and cancelled this
# check during "Network error when reading or writing".
# Wait before running the next check.
wait()
continue
with client.lock:
topology.onServerDescriptionChanged(description, connection pool for server)
if description.error != Null:
# Clear the connection pool only after the server description is set to Unknown.
clear(interruptInUseConnections: isNetworkTimeout(description.error)) connection pool for server
# Immediately proceed to the next check if the previous response
# was successful and included the topologyVersion field, or the
# previous response included the moreToCome flag, or the server
# has just transitioned to Unknown from a network error.
serverSupportsStreaming = description.type != Unknown and description.topologyVersion != Null
connectionIsStreaming = connection != Null and connection.moreToCome
transitionedWithNetworkError = isNetworkError(description.error) and previousDescription.type != Unknown
if streamingEnabled and serverSupportsStreaming and not rttMonitor.started:
# Start the RttMonitor.
rttMonitor.run()
if (streamingEnabled and (serverSupportsStreaming or connectionIsStreaming)) or transitionedWithNetworkError:
continue
wait()
def setUpConnection():
# Take the mutex to avoid a data race because this code writes to the connection field and a concurrent
# cancelCheck call could be reading from it.
with lock:
# Server API versioning implies that the server supports hello.
helloOk = stableApi != Null
connection = new Connection(serverAddress)
if connectTimeoutMS != 0:
set connection timeout to connectTimeoutMS
# Do any potentially blocking operations after releasing the mutex.
create the socket and perform connection handshake
def checkServer(previousDescription):
try:
# The connection is null if this is the first check. It's closed if there was an error during the previous
# check or the previous check was cancelled.
if helloOk:
helloCommand = hello
else
helloCommand = legacy hello
if not connection or connection.isClosed():
setUpConnection()
rttMonitor.addSample(connection.handshakeDuration)
response = connection.handshakeResponse
elif connection.moreToCome:
response = read next helloCommand exhaust response
elif streamingEnabled and previousDescription.topologyVersion:
# Initiate streaming hello or legacy hello
if connectTimeoutMS != 0:
set connection timeout to connectTimeoutMS+heartbeatFrequencyMS
response = call {helloCommand: 1, helloOk: True, topologyVersion: previousDescription.topologyVersion, maxAwaitTimeMS: heartbeatFrequencyMS}
else:
# The server does not support topologyVersion or streamingEnabled=False.
response = call {helloCommand: 1, helloOk: True}
# If the server supports hello, then response.helloOk will be true
# and hello will be used for subsequent monitoring commands.
# If the server does not support hello, then response.helloOk will be undefined
# and legacy hello will be used for subsequent monitoring commands.
helloOk = response.helloOk
return ServerDescription(response, rtt=rttMonitor.average(), ninetiethPercentileRtt=rttMonitor.ninetiethPercentile())
except Exception as exc:
close connection
rttMonitor.reset()
return ServerDescription(type=Unknown, error=exc)
def wait():
start = gettime()
# Can be awakened by requestCheck().
event.wait(heartbeatFrequencyMS)
event.clear()
waitTime = gettime() - start
if waitTime < minHeartbeatFrequencyMS:
# Cannot be awakened.
sleep(minHeartbeatFrequencyMS - waitTime)
Requesting an immediate check:
def requestCheck():
event.set()
hello or legacy hello Cancellation:
def cancelCheck():
# Take the mutex to avoid reading the connection value while setUpConnection is writing to it.
# Copy the connection value in the lock but do the actual cancellation outside.
with lock:
tempConnection = connection
if tempConnection:
interrupt connection read
close tempConnection
RTT thread
The requirements in the Measuring RTT section can be satisfied with an additional thread that periodically runs the hello or legacy hello command on a dedicated connection, for example:
class RttMonitor(Thread):
def __init__():
# Options:
serverAddress = serverAddress
connectTimeoutMS = connectTimeoutMS
heartbeatFrequencyMS = heartbeatFrequencyMS
stableApi = stableApi
# Internal state:
connection = Null
# Server API versioning implies that the server supports hello.
helloOk = stableApi != Null
lock = Mutex()
movingAverage = MovingAverage()
# Track the min RTT seen in the most recent 10 samples.
recentSamples = deque(maxlen=10)
def reset():
with lock:
movingAverage.reset()
recentSamples.clear()
def addSample(rtt):
with lock:
movingAverage.update(rtt)
recentSamples.append(rtt)
def average():
with lock:
return movingAverage.get()
def min():
with lock:
# Need at least 2 RTT samples.
if len(recentSamples) < 2:
return 0
return min(recentSamples)
def run():
while this monitor is not stopped:
try:
rtt = pingServer()
addSample(rtt)
except Exception as exc:
# Don't call reset() here. The Monitor thread is responsible
# for resetting the average RTT.
close connection
connection = Null
helloOk = stableApi != Null
# Can be awakened when the client is closed.
event.wait(heartbeatFrequencyMS)
event.clear()
def setUpConnection():
# Server API versioning implies that the server supports hello.
helloOk = stableApi != Null
connection = new Connection(serverAddress)
if connectTimeoutMS != 0:
set connection timeout to connectTimeoutMS
perform connection handshake
def pingServer():
if helloOk:
helloCommand = hello
else
helloCommand = legacy hello
if not connection:
setUpConnection()
return RTT of the connection handshake
start = time()
response = call {helloCommand: 1, helloOk: True}
rtt = time() - start
helloOk = response.helloOk
return rtt
Design Alternatives
Alternating hello or legacy hello to check servers and RTT without adding an extra connection
The streaming hello or legacy hello protocol is optimal in terms of latency; clients are always blocked waiting for the server to stream updated hello or legacy hello information, they learn of server state changes as soon as possible. However, streaming hello or legacy hello has two downsides:
- Streaming hello or legacy hello requires a new connection to each server to calculate the RTT.
- Streaming hello or legacy hello requires a new thread (or threads) to calculate the RTT of each server.
To address these concerns we designed the alternating hello or legacy hello protocol. This protocol would have alternated between awaitable hello or legacy hello and regular hello or legacy hello. The awaitable hello or legacy hello replaces the polling protocol's client side sleep and allows the client to receive updated hello or legacy hello responses sooner. The regular hello or legacy hello allows the client to maintain accurate RTT calculations without requiring any extra threads or sockets.
We reject this design because streaming hello or legacy hello is strictly better at reducing the client's time-to-recovery. We determined that one extra connection per server per MongoClient is reasonable for all drivers. Applications that upgrade may see a modest increase in connections and memory usage on the server. We don't expect this increase to be problematic; however, we have several projects planned for future MongoDB releases to make the streaming hello or legacy hello protocol cheaper server-side which should mitigate the cost of the extra monitoring connections.
Use TCP smoothed round-trip time instead of measuring RTT explicitly
TCP sockets internally maintain a "smoothed round-trip time" or SRTT. Drivers could use this SRTT instead of measuring RTT explicitly via hello or legacy hello commands. The server could even include this value on all hello or legacy hello responses. We reject this idea for a few reasons:
- Not all programming languages have an API to access the TCP socket's RTT.
- On Windows, RTT access requires Admin privileges.
- TCP's SRTT would likely differ substantially from RTT measurements in the current protocol. For example, the SRTT can be reset on retransmission timeouts.
Rationale
Thread per server
Mongos uses a monitor thread per replica set, rather than a thread per server. A thread per server is impractical if mongos is monitoring a large number of replica sets. But a driver only monitors one.
In mongos, threads trying to do reads and writes join the effort to scan the replica set. Such threads are more likely to be abundant in mongos than in drivers, so mongos can rely on them to help with monitoring.
In short: mongos has different scaling concerns than a multi-threaded or asynchronous driver, so it allocates threads differently.
Socket timeout for monitoring is connectTimeoutMS
When a client waits for a server to respond to a connection, the client does not know if the server will respond eventually or if it is down. Users can help the client guess correctly by supplying a reasonable connectTimeoutMS for their network: on some networks a server is probably down if it hasn't responded in 10 ms, on others a server might still be up even if it hasn't responded in 10 seconds.
The socketTimeoutMS, on the other hand, must account for both network latency and the operation's duration on the server. Applications should typically set a very long or infinite socketTimeoutMS so they can wait for long-running MongoDB operations.
Multi-threaded clients use distinct sockets for monitoring and for application operations. A socket used for monitoring does two things: it connects and calls hello or legacy hello. Both operations are fast on the server, so only network latency matters. Thus both operations SHOULD use connectTimeoutMS, since that is the value users supply to help the client guess if a server is down, based on users' knowledge of expected latencies on their networks.
A monitor SHOULD NOT use the client's regular connection pool
If a multi-threaded driver's connection pool enforces a maximum size and monitors use sockets from the pool, there are two bad options: either monitors compete with the application for sockets, or monitors have the exceptional ability to create sockets even when the pool has reached its maximum size. The former risks starving the monitor. The latter is more complex than it is worth. (A lesson learned from PyMongo 2.6's pool, which implemented this option.)
Since this rule is justified for drivers that enforce a maximum pool size, this spec recommends that all drivers follow the same rule for the sake of consistency.
Monitors MUST use a dedicated connection for RTT commands
When using the streaming protocol, a monitor needs to maintain an extra dedicated connection to periodically update its average round trip time in order to support localThresholdMS from the Server Selection spec.
It could pop a connection from its regular pool, but we rejected this option for a few reasons:
- Under contention the RTT task may block application operations from completing in a timely manner.
- Under contention the application may block the RTT task from completing in a timely manner.
- Under contention the RTT task may often result in an extra connection anyway because the pool creates new connections under contention up to maxPoolSize.
- This would be inconsistent with the rule that a monitor SHOULD NOT use the client's regular connection pool.
The client could open and close a new connection for each RTT check. We rejected this design, because if we ping every heartbeatFrequencyMS (default 10 seconds) then the cost to the client and the server of creating and destroying the connection might exceed the cost of keeping a dedicated connection open.
Instead, the client must use a dedicated connection reserved for RTT commands. Despite the cost of the additional connection per server, we chose this option as the safest and least likely to result in surprising behavior under load.
Monitors MUST use the hello or legacy hello command to measure RTT
In the streaming protocol, clients could use the "ping", "hello", or legacy hello commands to measure RTT. This spec chooses "hello" or legacy hello for consistency with the polling protocol as well as consistency with the initial RTT provided the connection handshake which also uses the hello or legacy hello commands. Additionally, mongocryptd does not allow the ping command but does allow hello or legacy hello.
Why not use awaitedTimeMS
in the server response to calculate RTT in the streaming protocol?
One approach to calculating RTT in the streaming protocol would be to have the server return an awaitedTimeMS
in its
hello
or legacy hello response. A driver could then determine the RTT by calculating the difference between the
initial request, or last response, and the awaitedTimeMS
.
We rejected this design because of a number of issue with the unreliability of clocks in distributed systems. Clocks skew between local and remote system clocks. This approach mixes two notions of time: the local clock times the whole operation while the remote clock times the wait. This means that if these clocks tick at different rates, or there are anomalies like clock changes, you will get bad results. To make matters worse, you will be comparing times from multiple servers that could each have clocks ticking at different rates. This approach will bias toward servers with the fastest ticking clock, since it will seem like it spends the least time on the wire.
Additionally, systems using NTP will experience clock "slew". ntpd "slews" time by up to 500 parts-per-million to have the local time gradually approach the "true" time without big jumps - over a 10 second window that means a 5ms difference. If both sides are slewing in opposite directions, that can result in an effective difference of 10ms. Both of these times are close enough to localThresholdMS to significantly affect which servers are viable in NEAREST calculations.
Ensuring that all measurements use the same clock obviates the need for a more complicated solution, and mitigates the above mentioned concerns.
Why don't clients mark a server unknown when an RTT command fails?
In the streaming protocol, clients use the hello or legacy hello command on a dedicated connection to measure a server's RTT. However, errors encountered when running the RTT command MUST NOT mark a server Unknown. We reached this decision because the dedicated RTT connection does not come from a connection pool and thus does not have a generation number associated with it. Without a generation number we cannot handle errors from the RTT command without introducing race conditions. Introducing such a generation number would add complexity to this design without much benefit. It is safe to ignore these errors because the Monitor will soon discover the server's state regardless (either through an updated streaming response, an error on the streaming connection, or by handling an error on an application connection).
Drivers cancel in-progress monitor checks
When an application operation fails with a non-timeout network error, drivers cancel that monitor's in-progress check.
We assume that a non-timeout network error on one application connection implies that all other connections to that server are also bad. This means that it is redundant to continue reading on the current monitoring connection. Instead, we cancel the current monitor check, close the monitoring connection, and start a new check soon. Note that we rely on the connection/pool generation number checking to avoid races and ensure that the monitoring connection is only closed once.
This approach also handles the rare case where the client sees a network error on an application connection but the monitoring connection is still healthy. If we did not cancel the monitor check in this scenario, then the server would remain in the Unknown state until the next hello or legacy hello response (up to maxAwaitTimeMS). A potential real world example of this behavior is when Azure closes an idle connection in the application pool.
Retry hello or legacy hello calls once
A monitor's connection to a server is long-lived and used only for hello or legacy hello calls. So if a server has responded in the past, a network error on the monitor's connection means that there was a network glitch, or a server restart since the last check, or that the server is truly down. To handle the case that the server is truly down, the monitor makes the server unselectable by marking it Unknown. To handle the case of a transient network glitch or restart, the monitor immediately runs the next check without waiting.
Clear the connection pool on both network and command errors
A monitor clears the connection pool when a server check fails with a network or command error (Network or command error during server check). When the check fails with a network error it is likely that all connections to that server are also closed. (See JAVA-1252). When the check fails with a network timeout error, a monitor MUST set interruptInUseConnections to true. See, Why does the pool need to support closing in use connections as part of its clear logic?.
When the server is shutting down, it may respond to hello or legacy hello commands with ShutdownInProgress errors before closing connections. In this case, the monitor clears the connection pool because all connections will be closed soon. Other command errors are unexpected but are handled identically.
Why must streaming hello or legacy hello clients publish ServerHeartbeatStartedEvents?
The SDAM Monitoring spec guarantees that every ServerHeartbeatStartedEvent has either a correlating ServerHeartbeatSucceededEvent or ServerHeartbeatFailedEvent. This is consistent with Command Monitoring on exhaust cursors where the driver publishes a fake CommandStartedEvent before reading the next getMore response.
Why don't streaming hello or legacy hello clients publish events for RTT commands?
In the streaming protocol, clients MUST NOT publish any events (server, topology, command, CMAP, etc..) when running an RTT command. We considered introducing new RTT events (ServerRTTStartedEvent, ServerRTTSucceededEvent, ServerRTTFailedEvent) but it's not clear that there is a demand for this. Applications can still monitor changes to a server's RTT by listening to TopologyDescriptionChangedEvents.
What is the purpose of the "awaited" field on server heartbeat events?
ServerHeartbeatSucceededEvents published from awaitable hello or legacy hello responses will regularly have 10 second durations. The spec introduces the "awaited" field on server heartbeat events so that applications can differentiate a slow heartbeat in the polling protocol from a normal awaitable hello or legacy hello heartbeat in the new protocol.
Why disable the streaming protocol on FaaS platforms like AWS Lambda?
The streaming protocol relies on the assumption that the client can read the server's heartbeat responses in a timely manner, otherwise the client will be acting on stale information. In many FaaS platforms, like AWS Lambda, host applications will be suspended and resumed many minutes later. This behavior causes a build up of heartbeat responses and the client can end up spending a long time in a catch up phase processing outdated responses. This problem was discovered in DRIVERS-2246.
Additionally, the streaming protocol requires an extra connection and thread per monitored server which is expensive on platforms like AWS Lambda. The extra connection is particularly inefficient when thousands of AWS instances and thus thousands of clients are used.
We decided to make polling the default behavior when running on FaaS platforms like AWS Lambda to improve scalability, performance, and reliability.
Why introduce a knob for serverMonitoringMode?
The serverMonitoringMode knob provides a workaround in cases where the polling protocol would be a better choice but the driver is not running on a FaaS platform. It also provides a workaround in case the FaaS detection logic becomes outdated or inaccurate.
Changelog
-
2024-05-02: Migrated from reStructuredText to Markdown.
-
2020-02-20: Extracted server monitoring from SDAM into this new spec.
-
2020-03-09: A monitor check that creates a new connection MUST use the connection's handshake to update the topology.
-
2020-04-20: Add streaming heartbeat protocol.
-
2020-05-20: Include rationale for why we don't use
awaitedTimeMS
-
2020-06-11: Support connectTimeoutMS=0 in streaming heartbeat protocol.
-
2020-12-17: Mark the pool for a server as "ready" after performing a successful check. Synchronize pool clearing with SDAM updates.
-
2021-06-21: Added support for hello/helloOk to handshake and monitoring.
-
2021-06-24: Remove optimization mention that no longer applies
-
2022-01-19: Add 90th percentile RTT tracking.
-
2022-02-24: Rename Versioned API to Stable API
-
2022-04-05: Preemptively cancel in progress operations when SDAM heartbeats timeout.
-
2022-10-05: Remove spec front matter reformat changelog.
-
2022-11-17: Add minimum RTT tracking and remove 90th percentile RTT.
-
2023-10-05: Add serverMonitoringMode and default to the polling protocol on FaaS. Clients MUST NOT use dedicated connections to measure RTT when using the polling protocol.
Polling SRV Records for mongos Discovery
- Status: Accepted
- Minimum Server Version: N/A
Abstract
Currently the Initial DNS Seedlist Discovery functionality provides a static seedlist when a MongoClient is constructed. Periodically polling the DNS SRV records would allow for the mongos proxy list to be updated without having to change client configuration.
This specification builds on top of the original Initial DNS Seedlist Discovery specification, and modifies the Server Discovery and Monitoring specification's definition of monitoring a set of mongos servers in a Sharded TopologyType.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Specification
Terms
rescan, rescanning
A rescan is the periodic scan of all DNS SRV records to discover a new set of mongos hosts.
rescanSRVIntervalMS
An internal value representing how often the DNS SRV records should be queried for.
Implementation
If the initial topology was created through a mongodb+srv://
URI, then drivers MUST implement this specification by
periodically rescanning the SRV DNS records. There MUST NOT be an option to turn this behaviour off.
Drivers MUST NOT implement this specification if they do not adhere fully to the Initial DNS Seedlist Discovery specification.
This feature is only available when the Server Discovery has determined that the TopologyType is Sharded, or Unknown. Drivers MUST NOT rescan SRV DNS records when the Topology is not Sharded (i.e. Single, ReplicaSetNoPrimary, or ReplicaSetWithPrimary).
The discovery of a set of mongos servers is explained in the seedlist discovery section of the original specification. The behaviour of the periodic rescan is similar, but not identical to the behaviour of initial seedlist discovery. Periodic scan MUST follow these rules:
- The driver will query the DNS server for SRV records on
{hostname}.{domainname}
, prefixed with the SRV service name and protocol. The SRV service name is provided in the srvServiceName URI option and defaults tomongodb
. The protocol is alwaystcp
. After prefixing, the URI should look like:_{srvServiceName}._tcp.{hostname}.{domainname}
. - A driver MUST verify that the host names returned through SRV records have the same parent
{domainname}
. When this verification fails, a driver:- MUST NOT add such a non-compliant host name to the topology
- MUST NOT raise an error
- SHOULD log the non-compliance, including the host name
- MUST NOT initiate a connection to any such host
- If the DNS request returns no verified hosts in SRV records, no SRV records at all, or a DNS error happens, the
driver:
- MUST NOT change the topology
- MUST NOT raise an error
- SHOULD log this situation, including the reason why the DNS records could not be found, if possible
- MUST temporarily set rescanSRVIntervalMS to heartbeatFrequencyMS until at least one verified SRV record is obtained.
- For all verified host names, as returned through the DNS SRV query, the driver:
- MUST remove all hosts that are part of the topology, but are no longer in the returned set of valid hosts
- MUST NOT remove all hosts, and then re-add the ones that were returned. Hosts that have not changed, MUST be left alone and unchanged.
- If srvMaxHosts is zero or greater than or equal to the number of valid hosts, each valid new host MUST be added to the topology as Unknown.
- If srvMaxHosts is greater than
zero and less than the number of valid hosts, valid new hosts MUST be randomly selected and added to the topology
as Unknown until the topology has
srvMaxHosts
hosts. Drivers MUST use the same randomization algorithm as they do for initial selection.
- Priorities and weights in SRV records MUST continue to be ignored, and MUST NOT dictate which mongos server is used for new connections.
The rescan needs to happen periodically. As SRV records contain a TTL value, this value can be used to indicate when a rescan needs to happen. Different SRV records can have different TTL values. The rescanSRVIntervalMS value MUST be set to the lowest of the individual TTL values associated with the different SRV records in the most recent rescan, but MUST NOT be lower than 60 seconds. If a driver is unable to access the TTL values of SRV records, it MUST rescan every 60 seconds.
Drivers SHOULD endeavour to rescan and obtain a new list of mongos servers every rescanSRVIntervalMS value. The rescanSRVIntervalMS period SHOULD be calculated from the end of the previous rescan (or the end of the initial DNS seedlist discovery scan).
Multi-Threaded Drivers
A threaded driver MUST use a separate monitoring thread for scanning the DNS records so that DNS lookups don't block other operations.
Single-Threaded Drivers
The rescan MUST happen before scanning all servers as part of the normal scanning functionality, but only if rescanSRVIntervalMS has passed.
Test Plan
See README.md in the accompanying test directory.
Motivation for Change
The original Initial DNS Seedlist Discovery specification only regulates the initial list of mongos hosts to be used instead of a single hostname from a connection URI. Although this makes the initial configuration of a set of mongos servers a lot easier, it does not provide a method for updating the list of mongos servers in the topology.
Since the introduction of the mongodb+srv://
schema to provide an initial seedlist, some users have requested
additional functionality to be able to update the configured list of mongos hosts that make up the initially seeded
topology:
Design Rationale
From the scope document
Should DNS polling use heartbeatFrequencyMS or DNS cache TTLs?
We have selected to use lowest TTLs among all DNS SRV records, with a caveat that the rescan frequency is not lower than 60 seconds.
Should DNS polling also have a "fast polling" mode when no servers are available?
We have not opted to have a "fast polling" mode, but we did include a provision that a rescan needs to happen when DNS records are not available. In that case, a rescan would happen every heartbeatFrequencyMS. The rationale being that polling DNS really often really fast does not make a lot of sense due to DNS caching, which often uses the TTL already anyway, but when we have no TTL records to reference we still need a fallback frequency.
For the design
No option to turn off periodic rescanning
The design does not allow for an option to turn off the periodic rescanning of SRV records on the basis that we try to have as few options as possible: the "no knobs" philosophy.
Backwards Compatibility
This specification changes the behaviour of server monitoring by introducing a repeating DNS lookup of the SRV records.
Although this is an improvement in the mongodb+srv://
scheme it can nonetheless break expectations with users that
were familiar with the old behaviour. We do not expect this to negatively impact users.
Reference Implementation
Reference implementations are made for the following drivers:
- Perl
- C#
Security Implication
This specification has no security implications beyond the ones associated with the original Initial DNS Seedlist Discovery specification.
Future work
No future work is expected.
Changelog
-
2024-08-22: Migrated from reStructuredText to Markdown.
-
2022-10-05: Revise spec front matter and reformat changelog.
-
2021-10-14: Specify behavior for
srvMaxHosts
MongoClient option. -
2021-09-15: Clarify that service name only defaults to
mongodb
, and should be defined by thesrvServiceName
URI option.
Server Selection
- Status: Accepted
- Minimum Server Version: 2.4
Abstract
MongoDB deployments may offer more than one server that can service an operation. This specification describes how MongoDB drivers and mongos shall select a server for either read or write operations. It includes the definition of a "read preference" document, configuration options, and algorithms for selecting a server for different deployment topologies.
Meta
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Motivation for Change
This specification builds upon the prior "Driver Read Preference" specification, which had a number of omissions, flaws or other deficiencies:
- Mandating features that implied monotonicity for situations where monotonicity is not guaranteed
- Mandating features that are not supported by mongos
- Neglecting to specify a single, standard way to calculate average latency times
- Specifying complex command-helper rules
- Omitting rules for applying read preferences to a single server or to select among multiple mongos servers
- Omitting test cases for verification of spec compliance
This revision addresses these problems as well as improving structure and specificity.
Additionally, it adds specifications for server selection more broadly:
- Selection of a server for write operations
- Server selection retry and timeout
Specification
Scope and general requirements
This specification describes how MongoDB drivers and mongos select a server for read and write operations, including commands, OP_QUERY, OP_INSERT, OP_UPDATE, and OP_DELETE. For read operations, it describes how drivers and mongos shall interpret a read preference document.
This specification does not apply to OP_GET_MORE or OP_KILL_CURSORS operations on cursors, which need to go to the same server that received an OP_QUERY and returned a cursor ID.
For operations that are part of a sharded transaction this specification only applies to the initial operation which starts the transaction on a mongos. This specification does not apply to subsequent operations that are part of the sharded transaction because all operations in a sharded transaction need to go to the same mongos server.
Drivers and mongos MUST conform to the semantics of this document, but SHOULD use language-appropriate data models or variable names.
This specification does not apply to commands issued for server monitoring or authentication.
Terms
Available
Describes a server that is believed to be reachable over the network and able to respond to requests. A server of type Unknown or PossiblePrimary is not available; other types are available.
Client
Software that communicates with a MongoDB deployment. This includes both drivers and mongos.
Candidate
Describes servers in a deployment that enter the selection process, determined by the read preference mode
parameter
and the servers' type. Depending on the mode
, candidate servers might only include secondaries or might apply to all
servers in the deployment.
Deployment
One or more servers that collectively provide access to a single logical set of MongoDB databases.
Command
An OP_QUERY operation targeting the '$cmd' collection namespace.
Direct connection
A driver connection mode that sends all database operations to a single server without regard for type.
Eligible
Describes candidate servers that also meet the criteria specified by the tag_sets
and maxStalenessSeconds
read
preference parameters.
Hedged Read
A server mode in which the same query is dispatched in parallel to multiple replica set members.
Immediate topology check
For a multi-threaded or asynchronous client, this means waking all server monitors for an immediate check. For a single-threaded client, this means a (blocking) scan of all servers.
Latency window
When choosing between several suitable servers, the latency window is the range of acceptable RTTs from the shortest RTT to the shortest RTT plus the local threshold. E.g. if the shortest RTT is 15ms and the local threshold is 200ms, then the latency window ranges from 15ms - 215ms.
Local threshold
The maximum acceptable difference in milliseconds between the shortest RTT and the longest RTT of servers suitable to be selected.
Mode
One of several enumerated values used as part of a read preference, defining which server types are candidates for reads and the semantics for choosing a specific one.
Primary
Describes a server of type RSPrimary.
Query
An OP_QUERY operation targeting a regular (non '$cmd') collection namespace.
Read preference
The parameters describing which servers in a deployment can receive read operations, including mode
, tag_sets
,
maxStalenessSeconds
, and hedge
.
RS
Abbreviation for "replica set".
RTT
Abbreviation for "round trip time".
Round trip time
The time in milliseconds to execute a hello
or legacy hello command and receive a response for a given server. This
spec differentiates between the RTT of a single hello
or legacy hello command and a server's average RTT over
several such commands.
Secondary
A server of type RSSecondary.
Staleness
A worst-case estimate of how far a secondary's replication lags behind the primary's last write.
Server
A mongod or mongos process.
Server selection
The process by which a server is chosen for a database operation out of all potential servers in a deployment.
Server type
An enumerated type indicating whether a server is up or down, whether it is a mongod or mongos, whether it belongs to a replica set and, if so, what role it serves in the replica set. See the Server Discovery and Monitoring spec for more details.
Suitable
Describes a server that meets all specified criteria for a read or write operation.
Tag
A single key/value pair describing either (1) a user-specified characteristic of a replica set member or (2) a desired characteristic for the target of a read operation. The key and value have no semantic meaning to the driver; they are arbitrary user choices.
Tag set
A document of zero or more tags. Each member of a replica set can be configured with zero or one tag set.
Tag set list
A list of zero or more tag sets. A read preference might have a tag set list used for selecting servers.
Topology
The state of a deployment, including its type, which servers are members, and the server types of members.
Topology type
An enumerated type indicating the semantics for monitoring servers and selecting servers for database operations. See the Server Discovery and Monitoring spec for more details.
Assumptions
- Unless they explicitly override these priorities, we assume our users prefer their applications to be, in order:
- Predictable: the behavior of the application should not change based on the deployment type, whether single mongod, replica set or sharded cluster.
- Resilient: applications will adapt to topology changes, if possible, without raising errors or requiring manual reconfiguration.
- Low-latency: all else being equal, faster responses to queries and writes are preferable.
- Clients know the state of a deployment based on some form of ongoing monitoring, following the rules defined in the
Server Discovery and Monitoring spec.
- They know which members are up or down, what their tag sets are, and their types.
- They know average round trip times to each available member.
- They detect reconfiguration and the addition or removal of members.
- The state of a deployment could change at any time, in between any network interaction.
- Servers might or might not be reachable; they can change type at any time, whether due to partitions, elections, or misconfiguration.
- Data rollbacks could occur at any time.
MongoClient Configuration
Selecting a server requires the following client-level configuration options:
localThresholdMS
This defines the size of the latency window for selecting among multiple suitable servers. The default is 15 (milliseconds). It MUST be configurable at the client level. It MUST NOT be configurable at the level of a database object, collection object, or at the level of an individual query.
In the prior read preference specification, localThresholdMS
was called secondaryAcceptableLatencyMS
by drivers.
Drivers MUST support the new name for consistency, but MAY continue to support the legacy name to avoid a
backward-breaking change.
mongos currently uses localThreshold
and MAY continue to do so.
serverSelectionTimeoutMS
This defines the maximum time to block for server selection before throwing an exception. The default is 30,000 (milliseconds). It MUST be configurable at the client level. It MUST NOT be configurable at the level of a database object, collection object, or at the level of an individual query.
The actual timeout for server selection can be less than serverSelectionTimeoutMS
. See Timeouts for rules
to compute the exact value.
This default value was chosen to be sufficient for a typical server primary election to complete. As the server improves the speed of elections, this number may be revised downward.
Users that can tolerate long delays for server selection when the topology is in flux can set this higher. Users that want to "fail fast" when the topology is in flux can set this to a small number.
A serverSelectionTimeoutMS of zero MAY have special meaning in some drivers; zero's meaning is not defined in this spec, but all drivers SHOULD document the meaning of zero.
serverSelectionTryOnce
Single-threaded drivers MUST provide a "serverSelectionTryOnce" mode, in which the driver scans the topology exactly once after server selection fails, then either selects a server or raises an error.
The serverSelectionTryOnce option MUST be true by default. If it is set false, then the driver repeatedly searches for an appropriate server until the selection process times out (pausing minHeartbeatFrequencyMS between attempts, as required by the Server Discovery and Monitoring spec).
Users of single-threaded drivers MUST be able to control this mode in one or both of these ways:
- In code, pass true or false for an option called serverSelectionTryOnce, spelled idiomatically for the language, to the MongoClient constructor.
- Include "serverSelectionTryOnce=true" or "serverSelectionTryOnce=false" in the URI. The URI option is spelled the same for all drivers.
Conflicting usages of the URI option and the symbol is an error.
Multi-threaded drivers MUST NOT provide this mode. (See single-threaded server selection implementation and the rationale for a "try once" mode.)
heartbeatFrequencyMS
This controls when topology updates are scheduled. See heartbeatFrequencyMS in the Server Discovery and Monitoring spec for details.
socketCheckIntervalMS
Only for single-threaded drivers.
The default socketCheckIntervalMS MUST be 5000 (5 seconds), and it MAY be configurable. If socket has been idle for at least this long, it must be checked before being used again.
See checking an idle socket after socketCheckIntervalMS and what is the purpose of socketCheckIntervalMS?.
idleWritePeriodMS
A constant, how often an idle primary writes a no-op to the oplog. See idleWritePeriodMS in the Max Staleness spec for details.
smallestMaxStalenessSeconds
A constant, 90 seconds. See "Smallest allowed value for maxStalenessSeconds" in the Max Staleness Spec.
serverSelector
Implementations MAY allow configuration of an optional, application-provided function that augments the server selection rules. The function takes as a parameter a list of server descriptions representing the suitable servers for the read or write operation, and returns a list of server descriptions that should still be considered suitable.
Read Preference
A read preference determines which servers are considered suitable for read operations. Read preferences are interpreted differently based on topology type. See topology-type-specific server selection rules for details.
When no servers are suitable, the selection might be retried or will eventually fail following the rules described in the Rules for server selection section.
Components of a read preference
A read preference consists of a mode
and optional tag_sets
, maxStalenessSeconds
, and hedge
. The mode
prioritizes between primaries and secondaries to produce either a single suitable server or a list of candidate servers.
If tag_sets
and maxStalenessSeconds
are set, they determine which candidate servers are eligible for selection. If
hedge
is set, it configures how server hedged reads are used.
The default mode
is 'primary'. The default tag_sets
is a list with an empty tag set: [{}]
. The default
maxStalenessSeconds
is -1 or null, depending on the language. The default hedge
is unset.
Each is explained in greater detail below.
mode
For a deployment with topology type ReplicaSetWithPrimary or ReplicaSetNoPrimary, the mode
parameter controls whether
primaries or secondaries are deemed suitable. Topology types Single and Sharded have different selection criteria and
are described elsewhere.
Clients MUST support these modes:
primary
Only an available primary is suitable.
secondary
All secondaries (and only secondaries) are candidates, but only eligible candidates (i.e. after applying
tag_sets
and maxStalenessSeconds
) are suitable.
primaryPreferred
If a primary is available, only the primary is suitable. Otherwise, all secondaries are candidates, but only eligible secondaries are suitable.
secondaryPreferred
All secondaries are candidates. If there is at least one eligible secondary, only eligible secondaries are suitable. Otherwise, when there are no eligible secondaries, the primary is suitable.
nearest
The primary and all secondaries are candidates, but only eligible candidates are suitable.
Note on other server types: The Server Discovery and Monitoring spec defines several other server types that could appear in a replica set. Such types are never candidates, eligible or suitable.
maxStalenessSeconds
The maximum replication lag, in wall clock time, that a secondary can suffer and still be eligible.
The default is no maximum staleness.
A maxStalenessSeconds
of -1 MUST mean "no maximum". Drivers are also free to use None, null, or other representations
of "no value" to represent "no max staleness".
Drivers MUST raise an error if maxStalenessSeconds
is a positive number and the mode
field is 'primary'.
A driver MUST raise an error if the TopologyType is ReplicaSetWithPrimary or ReplicaSetNoPrimary and either of these conditions is false:
maxStalenessSeconds * 1000 >= heartbeatFrequencyMS + idleWritePeriodMS
maxStalenessSeconds >= smallestMaxStalenessSeconds
heartbeatFrequencyMS
is defined in the
Server Discovery and Monitoring spec, and
idleWritePeriodMS
is defined to be 10 seconds in the Max Staleness spec.
See "Smallest allowed value for maxStalenessSeconds" in the Max Staleness Spec.
mongos MUST reject a read with maxStalenessSeconds
provided and a mode
of 'primary'.
mongos MUST reject a read with maxStalenessSeconds
that is not a positive integer.
mongos MUST reject a read if maxStalenessSeconds
is less than smallestMaxStalenessSeconds, with error code 160
(SERVER-24421).
During server selection, drivers (but not mongos) with minWireVersion
< 5 MUST raise an error if
maxStalenessSeconds
is a positive number, and any available server's maxWireVersion
is less than 5.1
After filtering servers according to mode
, and before filtering with tag_sets
, eligibility MUST be determined from
maxStalenessSeconds
as follows:
-
If
maxStalenessSeconds
is not a positive number, then all servers are eligible. -
Otherwise, calculate staleness. Non-secondary servers (including Mongos servers) have zero staleness. If TopologyType is ReplicaSetWithPrimary, a secondary's staleness is calculated using its ServerDescription "S" and the primary's ServerDescription "P":
(S.lastUpdateTime - S.lastWriteDate) - (P.lastUpdateTime - P.lastWriteDate) + heartbeatFrequencyMS
(All datetime units are in milliseconds.)
If TopologyType is ReplicaSetNoPrimary, a secondary's staleness is calculated using its ServerDescription "S" and the ServerDescription of the secondary with the greatest lastWriteDate, "SMax":
SMax.lastWriteDate - S.lastWriteDate + heartbeatFrequencyMS
Servers with staleness less than or equal to
maxStalenessSeconds
are eligible.
See the Max Staleness Spec for overall description and justification of this feature.
tag_sets
The read preference tag_sets
parameter is an ordered list of tag sets used to restrict the eligibility of servers,
such as for data center awareness.
Clients MUST raise an error if a non-empty tag set is given in tag_sets
and the mode
field is 'primary'.
A read preference tag set (T
) matches a server tag set (S
) – or equivalently a server tag set (S
) matches a read
preference tag set (T
) — if T
is a subset of S
(i.e. T ⊆ S
).
For example, the read preference tag set "{ dc: 'ny', rack: '2' }" matches a secondary server with tag set "{ dc: 'ny', rack: '2', size: 'large' }".
A tag set that is an empty document matches any server, because the empty tag set is a subset of any tag set. This means
the default tag_sets
parameter ([{}]
) matches all servers.
Tag sets are applied after filtering servers by mode
and maxStalenessSeconds
, and before selecting one server within
the latency window.
Eligibility MUST be determined from tag_sets
as follows:
- If the
tag_sets
list is empty then all candidate servers are eligible servers. (Note, the default of[{}]
means an empty list probably won't often be seen, but if the client does not forbid an empty list, this rule MUST be implemented to handle that case.) - If the
tag_sets
list is not empty, then tag sets are tried in order until a tag set matches at least one candidate server. All candidate servers matching that tag set are eligible servers. Subsequent tag sets in the list are ignored. - If the
tag_sets
list is not empty and no tag set in the list matches any candidate server, no servers are eligible servers.
hedge
The read preference hedge
parameter is a document that configures how the server will perform hedged reads. It
consists of the following keys:
enabled
: Enables or disables hedging
Hedged reads are automatically enabled in MongoDB 4.4+ when using a nearest
read preference. To explicitly enable
hedging, the hedge
document must be passed. An empty document uses server defaults to control hedging, but the
enabled
key may be set to true
or false
to explicitly enable or disable hedged reads.
Drivers MAY allow users to specify an empty hedge document if they accept documents for read preference options. Any
driver that exposes a builder API for read preference objects MUST NOT allow an empty hedge
document to be
constructed. In this case, the user MUST specify a value for enabled
, which MUST default to true
. If the user does
not call a hedge
API method, drivers MUST NOT send a hedge
option to the server.
Read preference configuration
Drivers MUST allow users to configure a default read preference on a MongoClient
object. Drivers MAY allow users to
configure a default read preference on a Database
or Collection
object.
A read preference MAY be specified as an object, document or individual mode
, tag_sets
, and maxStalenessSeconds
parameters, depending on what is most idiomatic for the language.
If more than one object has a default read preference, the default of the most specific object takes precedence. I.e.
Collection
is preferred over Database
, which is preferred over MongoClient
.
Drivers MAY allow users to set a read preference on queries on a per-operation basis similar to how hint
or
batchSize
are set. E.g., in Python:
db.collection.find({}, read_preference=ReadPreference.SECONDARY)
db.collection.find(
{},
read_preference=ReadPreference.NEAREST,
tag_sets=[{'dc': 'ny'}],
maxStalenessSeconds=120,
hedge={'enabled': true})
Passing read preference to mongos and load balancers
If a server of type Mongos or LoadBalancer is selected for a read operation, the read preference is passed to the
selected mongos through the use of $readPreference
(as a
Global Command Argument for OP_MSG or a query modifier for OP_QUERY)
and, for OP_QUERY only, the SecondaryOk
wire protocol flag, according to the following rules.
For OP_MSG:
- For mode 'primary', drivers MUST NOT set
$readPreference
- For all other read preference modes (i.e. 'secondary', 'primaryPreferred', ...), drivers MUST set
$readPreference
For OP_QUERY:
If the read preference contains only a mode
parameter and the mode is 'primary' or 'secondaryPreferred', for
maximum backwards compatibility with older versions of mongos, drivers MUST only use the value of the SecondaryOk
wire
protocol flag (i.e. set or unset) to indicate the desired read preference and MUST NOT use a $readPreference
query
modifier.
Therefore, when sending queries to a mongos or load balancer, the following rules apply:
- For mode 'primary', drivers MUST NOT set the
SecondaryOk
wire protocol flag and MUST NOT use$readPreference
- For mode 'secondary', drivers MUST set the
SecondaryOk
wire protocol flag and MUST also use$readPreference
- For mode 'primaryPreferred', drivers MUST set the
SecondaryOk
wire protocol flag and MUST also use$readPreference
- For mode 'secondaryPreferred', drivers MUST set the
SecondaryOk
wire protocol flag. If the read preference contains a non-emptytag_sets
parameter,maxStalenessSeconds
is a positive integer, or thehedge
parameter is non-empty, drivers MUST use$readPreference
; otherwise, drivers MUST NOT use$readPreference
- For mode 'nearest', drivers MUST set the
SecondaryOk
wire protocol flag and MUST also use$readPreference
The $readPreference
query modifier sends the read preference as part of the query. The read preference fields
tag_sets
is represented in a $readPreference
document using the field name tags
.
When sending a read operation via OP_QUERY and any $
modifier is used, including the $readPreference
modifier, the
query MUST be provided using the $query
modifier like so:
{
$query: {
field1: 'query_value',
field2: 'another_query_value'
},
$readPreference: {
mode: 'secondary',
tags: [ { 'dc': 'ny' } ],
maxStalenessSeconds: 120,
hedge: { enabled: true }
}
}
Document structure
A valid $readPreference
document for mongos or load balancer has the following requirements:
-
The
mode
field MUST be present exactly once with the mode represented in camel case:- 'primary'
- 'secondary'
- 'primaryPreferred'
- 'secondaryPreferred'
- 'nearest'
-
If the
mode
field is "primary", thetags
,maxStalenessSeconds
, andhedge
fields MUST be absent.Otherwise, for other
mode
values, thetags
field MUST either be absent or be present exactly once and have an array value containing at least one document. It MUST contain only documents, no other type.The
maxStalenessSeconds
field MUST be either be absent or be present exactly once with an integer value.The
hedge
field MUST be either absent or be a document.
Mongos or service receiving a query with $readPreference
SHOULD validate the mode
, tags
, maxStalenessSeconds
,
and hedge
fields according to rules 1 and 2 above, but SHOULD ignore unrecognized fields for forward-compatibility
rather than throwing an error.
Use of read preferences with commands
Because some commands are used for writes, deployment-changes or other state-changing side-effects, the use of read preference by a driver depends on the command and how it is invoked:
-
Write commands:
insert
,update
,delete
,findAndModify
Write commands are considered write operations and MUST follow the corresponding Rules for server selection for each topology type.
-
Generic command method: typically
command
orrunCommand
The generic command method MUST act as a read operation for the purposes of server selection.
The generic command method has a default read preference of
mode
'primary'. The generic command method MUST ignore any default read preference from client, database or collection configuration. The generic command method SHOULD allow an optional read preference argument.If an explicit read preference argument is provided as part of the generic command method call, it MUST be used for server selection, regardless of the name of the command. It is up to the user to use an appropriate read preference, e.g. not calling
renameCollection
with amode
of 'secondary'.N.B.: "used for server selection" does not supersede rules for server selection on "Standalone" topologies, which ignore any requested read preference.
-
Command-specific helper: methods that wrap database commands, like
count
,distinct
,listCollections
orrenameCollection
.Command-specific helpers MUST act as read operations for the purposes of server selection, with read preference rules defined by the following three categories of commands:
-
"must-use-primary": these commands have state-modifying effects and will only succeed on a primary. An example is
renameCollection
.These command-specific helpers MUST use a read preference
mode
of 'primary', MUST NOT take a read preference argument and MUST ignore any default read preference from client, database or collection configuration. Languages with dynamic argument lists MUST throw an error if a read preference is provided as an argument.Clients SHOULD rely on the server to return a "not writable primary" or other error if the command is "must-use-primary". Clients MAY raise an exception before sending the command if the topology type is Single and the server type is not "Standalone", "RSPrimary" or "Mongos", but the identification of the set of 'must-use-primary' commands is out of scope for this specification.
-
"should-use-primary": these commands are intended to be run on a primary, but would succeed -- albeit with possibly stale data -- when run against a secondary. An example is
listCollections
.These command-specific helpers MUST use a read preference
mode
of 'primary', MUST NOT take a read preference argument and MUST ignore any default read preference from client, database or collection configuration. Languages with dynamic argument lists MUST throw an error if a read preference is provided as an argument.Clients MUST NOT raise an exception if the topology type is Single.
-
"may-use-secondary": these commands run against primaries or secondaries, according to users' read preferences. They are sometimes called "query-like" commands.
The current list of "may-use-secondary" commands includes:
- aggregate without a write stage (e.g.
$out
,$merge
) - collStats
- count
- dbStats
- distinct
- find
- geoNear
- geoSearch
- group
- mapReduce where the
out
option is{ inline: 1 }
- parallelCollectionScan
Associated command-specific helpers SHOULD take a read preference argument and otherwise MUST use the default read preference from client, database, or collection configuration.
For pre-5.0 servers, an aggregate command is "must-use-primary" if its pipeline contains a write stage (e.g.
$out
,$merge
); otherwise, it is "may-use-secondary". For 5.0+ servers, secondaries can execute an aggregate command with a write stage and all aggregate commands are "may-use-secondary". This is discussed in more detail in Read preferences and server selection in the CRUD spec.If a client provides a specific helper for inline mapReduce, then it is "may-use-secondary" and the regular mapReduce helper is "must-use-primary". Otherwise, the mapReduce helper is "may-use-secondary" and it is the user's responsibility to specify
{inline: 1}
when running mapReduce on a secondary. - aggregate without a write stage (e.g.
New command-specific helpers implemented in the future will be considered "must-use-primary", "should-use-primary" or "may-use-secondary" according to the specifications for those future commands. Command helper specifications SHOULD use those terms for clarity.
-
Rules for server selection
Server selection is a process which takes an operation type (read or write), a ClusterDescription, and optionally a read preference and, on success, returns a ServerDescription for an operation of the given type.
Server selection varies depending on whether a client is multi-threaded/asynchronous or single-threaded because a single-threaded client cannot rely on the topology state being updated in the background.
Timeouts
Multi-threaded drivers and single-threaded drivers with serverSelectionTryOnce
set to false MUST enforce a timeout for
the server selection process. The timeout MUST be computed as described in
Client Side Operations Timeout: Server Selection.
Multi-threaded or asynchronous server selection
A driver that uses multi-threaded or asynchronous monitoring MUST unblock waiting operations as soon as server selection completes, even if not all servers have been checked by a monitor. Put differently, the client MUST NOT block server selection while waiting for server discovery to finish.
For example, if the client is discovering a replica set and the application attempts a read operation with mode 'primaryPreferred', the operation MUST proceed immediately if a suitable secondary is found, rather than blocking until the client has checked all members and possibly discovered a primary.
The number of threads allowed to wait for server selection SHOULD be either (a) the same as the number of threads allowed to wait for a connection from a pool; or (b) governed by a global or client-wide limit on number of waiting threads, depending on how resource limits are implemented by a driver.
operationCount
Multi-threaded or async drivers MUST keep track of the number of operations that a given server is currently executing
(the server's operationCount
). This value MUST be incremented once a server is selected for an operation and MUST be
decremented once that operation has completed, regardless of its outcome. Where this value is stored is left as a
implementation detail of the driver; some example locations include the Server
type that also owns the connection pool
for the server (if there exists such a type in the driver's implementation) or on the pool itself. Incrementing or
decrementing a server's operationCount
MUST NOT wake up any threads that are waiting for a topology update as part of
server selection. See
operationCount-based selection within the latency window (multi-threaded or async)
for the rationale behind the way this value is used.
Server Selection Algorithm
For multi-threaded clients, the server selection algorithm is as follows:
- Record the server selection start time and log a "Server selection started" message.
- If the topology wire version is invalid, raise an error and log a "Server selection failed" message.
- Find suitable servers by topology type and operation type. If a list of deprioritized servers is provided, and the topology is a sharded cluster, these servers should be selected only if there are no other suitable servers. The server selection algorithm MUST ignore the deprioritized servers if the topology is not a sharded cluster.
- Filter the suitable servers by calling the optional, application-provided server selector.
- If there are any suitable servers, filter them according to Filtering suitable servers based on the latency window and continue to the next step; otherwise, log a "Waiting for suitable server to become available" message if one has not already been logged for this operation, and goto Step #9.
- Choose two servers at random from the set of suitable servers in the latency window. If there is only 1 server in the latency window, just select that server and goto Step #8.
- Of the two randomly chosen servers, select the one with the lower
operationCount
. If both servers have the sameoperationCount
, select arbitrarily between the two of them. - Increment the
operationCount
of the selected server and return it. Log a "Server selection succeeded" message. Do not go onto later steps. - Request an immediate topology check, then block the server selection thread until the topology changes or until the server selection timeout has elapsed
- If server selection has timed out, raise a server selection error and log a "Server selection failed" message.
- Goto Step #2
Single-threaded server selection
Single-threaded drivers do not monitor the topology in the background. Instead, they MUST periodically update the topology during server selection as described below.
When serverSelectionTryOnce
is true, server selection timeouts have no effect; a single immediate topology check will
be done if the topology starts stale or if the first selection attempt fails.
When serverSelectionTryOnce
is false, then the server selection loops until a server is successfully selected or until
the selection timeout is exceeded.
Therefore, for single-threaded clients, the server selection algorithm is as follows:
- Record the server selection start time and log a "Server selection started" message.
- Record the maximum time as start time plus the computed timeout
- If the topology has not been scanned in
heartbeatFrequencyMS
milliseconds, mark the topology stale - If the topology is stale, proceed as follows:
- record the target scan time as last scan time plus
minHeartBeatFrequencyMS
- if serverSelectionTryOnce is false and the target scan time would exceed the maximum time, raise a server selection error and log a "Server selection failed" message.
- if the current time is less than the target scan time, sleep until the target scan time
- do a blocking immediate topology check (which must also update the last scan time and mark the topology as no longer stale)
- record the target scan time as last scan time plus
- If the topology wire version is invalid, raise an error and log a "Server selection failed" message.
- Find suitable servers by topology type and operation type. If a list of deprioritized servers is provided, and the topology is a sharded cluster, these servers should be selected only if there are no other suitable servers. The server selection algorithm MUST ignore the deprioritized servers if the topology is not a sharded cluster.
- Filter the suitable servers by calling the optional, application-provided server selector.
- If there are any suitable servers, filter them according to Filtering suitable servers based on the latency window and return one at random from the filtered servers, and log a "Server selection succeeded" message.; otherwise, mark the topology stale and continue to step #9.
- If serverSelectionTryOnce is true and the last scan time is newer than the selection start time, raise a server selection error and log a "Server selection failed" message; otherwise, log a "Waiting for suitable server to become available" message if one has not already been logged for this operation, and goto Step #4
- If the current time exceeds the maximum time, raise a server selection error and log a "Server selection failed" message.
- Goto Step #4
Before using a socket to the selected server, drivers MUST check whether the socket has been used in socketCheckIntervalMS milliseconds. If the socket has been idle for longer, the driver MUST update the ServerDescription for the selected server. After updating, if the server is no longer suitable, the driver MUST repeat the server selection algorithm and select a new server.
Because single-threaded selection can do a blocking immediate check, the server selection timeout is not a hard
deadline. The actual maximum server selection time for any given request can vary from the timeout minus
minHeartbeatFrequencyMS
to the timeout plus the time required for a blocking scan.
Single-threaded drivers MUST document that when serverSelectionTryOne
is true, selection may take up to the time
required for a blocking scan, and when serverSelectionTryOne
is false, selection may take up to the timeout plus the
time required for a blocking scan.
Topology type: Unknown
When a deployment has topology type "Unknown", no servers are suitable for read or write operations.
Topology type: Single
A deployment of topology type Single contains only a single server of any type. Topology type Single signifies a direct connection intended to receive all read and write operations.
Therefore, read preference is ignored during server selection with topology type Single. The single server is always suitable for reads if it is available. Depending on server type, the read preference is communicated to the server differently:
- Type Mongos: the read preference is sent to the server using the rules for Passing read preference to mongos and load balancers.
- Type Standalone: clients MUST NOT send the read preference to the server
- For all other types, using OP_QUERY: clients MUST always set the
SecondaryOk
wire protocol flag on reads to ensure that any server type can handle the request. - For all other types, using OP_MSG: If no read preference is configured by the application, or if the application read
preference is Primary, then
$readPreference
MUST be set to{ "mode": "primaryPreferred" }
to ensure that any server type can handle the request. If the application read preference is set otherwise,$readPreference
MUST be set following Document structure.
The single server is always suitable for write operations if it is available.
Topology type: LoadBalanced
During command construction, drivers MUST add a $readPreference
field to the command when required by
Passing read preference to mongos and load balancers; see the
Load Balancer Specification for details.
Topology types: ReplicaSetWithPrimary or ReplicaSetNoPrimary
A deployment with topology type ReplicaSetWithPrimary or ReplicaSetNoPrimary can have a mix of server types: RSPrimary (only in ReplicaSetWithPrimary), RSSecondary, RSArbiter, RSOther, RSGhost, Unknown or PossiblePrimary.
Read operations
For the purpose of selecting a server for read operations, the same rules apply to both ReplicaSetWithPrimary and ReplicaSetNoPrimary.
To select from the topology a server that matches the user's Read Preference:
If mode
is 'primary', select the primary server.
If mode
is 'secondary' or 'nearest':
- Select all secondaries if
mode
is 'secondary', or all secondaries and the primary ifmode
is 'nearest'.- From these, filter out servers staler than
maxStalenessSeconds
if it is a positive number.- From the remaining servers, select servers matching the
tag_sets
.- From these, select one server within the latency window.
(See algorithm for filtering by staleness, algorithm for filtering by tag_sets, and filtering suitable servers based on the latency window for details on each step, and why is maxStalenessSeconds applied before tag_sets?.)
If mode
is 'secondaryPreferred', attempt the selection algorithm with mode
'secondary' and the user's
maxStalenessSeconds
and tag_sets
. If no server matches, select the primary.
If mode
is 'primaryPreferred', select the primary if it is known, otherwise attempt the selection algorithm with
mode
'secondary' and the user's maxStalenessSeconds
and tag_sets
.
For all read preferences modes except 'primary', clients MUST set the SecondaryOk
wire protocol flag (OP_QUERY) or
$readPreference
global command argument (OP_MSG) to ensure that any suitable server can handle the request. If the
read preference mode is 'primary', clients MUST NOT set the SecondaryOk
wire protocol flag (OP_QUERY) or
$readPreference
global command argument (OP_MSG).
Write operations
If the topology type is ReplicaSetWithPrimary, only an available primary is suitable for write operations.
If the topology type is ReplicaSetNoPrimary, no servers are suitable for write operations.
Topology type: Sharded
A deployment of topology type Sharded contains one or more servers of type Mongos or Unknown.
For read operations, all servers of type Mongos are suitable; the mode
, tag_sets
, and maxStalenessSeconds
read
preference parameters are ignored for selecting a server, but are passed through to mongos. See
Passing read preference to mongos and load balancers.
For write operations, all servers of type Mongos are suitable.
If more than one mongos is suitable, drivers MUST select a suitable server within the latency window (see Filtering suitable servers based on the latency window).
Round Trip Times and the Latency Window
Calculation of Average Round Trip Times
For every available server, clients MUST track the average RTT of server monitoring hello
or legacy hello commands.
An Unknown server has no average RTT. When a server becomes unavailable, its average RTT MUST be cleared. Clients MAY implement this idiomatically (e.g nil, -1, etc.).
When there is no average RTT for a server, the average RTT MUST be set equal to the first RTT measurement (i.e. the
first hello
or legacy hello command after the server becomes available).
After the first measurement, average RTT MUST be computed using an exponentially-weighted moving average formula, with a
weighting factor (alpha
) of 0.2. If the prior average is denoted old_rtt
, then the new average (new_rtt
) is
computed from a new RTT measurement (x
) using the following formula:
alpha = 0.2
new_rtt = alpha * x + (1 - alpha) * old_rtt
A weighting factor of 0.2 was chosen to put about 85% of the weight of the average RTT on the 9 most recent observations.
Filtering suitable servers based on the latency window
Server selection results in a set of zero or more suitable servers. If more than one server is suitable, a server MUST be selected from among those within the latency window.
The localThresholdMS
configuration parameter controls the size of the latency window used to select a suitable server.
The shortest average round trip time (RTT) from among suitable servers anchors one end of the latency window (A
). The
other end is determined by adding localThresholdMS
(B = A + localThresholdMS
).
A server MUST be selected from among suitable servers that have an average RTT (RTT
) within the latency window (i.e.
A ≤ RTT ≤ B
). In other words, the suitable server with the shortest average RTT is always a possible choice. Other
servers could be chosen if their average RTTs are no more than localThresholdMS
more than the shortest average RTT.
See either Single-threaded server selection or Multi-threaded or asynchronous server selection for information on how to select a server from among those within the latency window.
Checking an Idle Socket After socketCheckIntervalMS
Only for single-threaded drivers.
If a server is selected that has an existing connection that has been idle for socketCheckIntervalMS, the driver MUST check the connection with the "ping" command. If the ping succeeds, use the selected connection. If not, set the server's type to Unknown and update the Topology Description according to the Server Discovery and Monitoring Spec, and attempt once more to select a server.
The logic is expressed in this pseudocode. The algorithm for the "getServer" function is suggested below, in Single-threaded server selection implementation:
def getConnection(criteria):
# Get a server for writes, or a server matching read prefs, by
# running the server selection algorithm.
server = getServer(criteria)
if not server:
throw server selection error
connection = server.connection
if connection is NULL:
# connect to server and return connection
else if connection has been idle < socketCheckIntervalMS:
return connection
else:
try:
use connection for "ping" command
return connection
except network error:
close connection
mark server Unknown and update Topology Description
# Attempt *once* more to select.
server = getServer(criteria)
if not server:
throw server selection error
# connect to server and return connection
See What is the purpose of socketCheckIntervalMS?.
Requests and Pinning Deprecated
The prior read preference specification included the concept of a "request", which pinned a server to a thread for subsequent, related reads. Requests and pinning are now deprecated. See What happened to pinning? for the rationale for this change.
Drivers with an existing request API MAY continue to provide it for backwards compatibility, but MUST document that pinning for the request does not guarantee monotonic reads.
Drivers MUST NOT automatically pin the client or a thread to a particular server without an explicit start_request
(or
comparable) method call.
Outside a legacy "request" API, drivers MUST use server selection for each individual read operation.
Logging
Please refer to the logging specification for details on logging implementations in general, including log levels, log components, and structured versus unstructured logging.
Drivers MUST support logging of server selection information via the following log messages. These messages MUST use the
serverSelection
log component.
The types used in the structured message definitions below are demonstrative, and drivers MAY use similar types instead so long as the information is present (e.g. a double instead of an integer, or a string instead of an integer if the structured logging framework does not support numeric types.)
Common Fields
The following key-value pairs MUST be included in all server selection log messages:
Key | Suggested Type | Value |
---|---|---|
selector | String | String representation of the selector being used to select the server. This can be a read preference or an application-provided custom selector. The exact content of is flexible depending on what the driver is able to log. At minimum, when the selector is a read preference this string MUST contain all components of the read preference, and when it is an application-provided custom selector the string MUST somehow indicate that it is a custom selector. |
operationId | Int | The driver-generated operation ID. Optional; only present if the driver generates operation IDs and this command has one. |
operation | String | The name of the operation for which a server is being selected. When server selection is being performed to select a server for a command, this MUST be the command name. |
topologyDescription | String | String representation of the current topology description. The format of is flexible and could be e.g. the toString() implementation for a driver's topology type, or an extended JSON representation of the topology object. |
"Server selection started" message
This message MUST be logged at debug
level. It MUST be emitted on the occasions specified either in
Multi-threaded or asynchronous server selection or
Single-threaded server selection, depending on which algorithm the driver
implements.
This message MUST contain the following key-value pairs:
Key | Suggested Type | Value |
---|---|---|
message | String | "Server selection started" |
The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:
Server selection started for operation {{operation}} with ID {{operationId}}. Selector: {{selector}}, topology description: {{topologyDescription}}
"Server selection succeeded" message
This message MUST be logged at debug
level. It MUST be emitted on the occasions specified either in
Multi-threaded or asynchronous server selection or
Single-threaded server selection, depending on which algorithm the driver
implements.
This message MUST contain the following key-value pairs:
Key | Suggested Type | Value |
---|---|---|
message | String | "Server selection succeeded" |
serverHost | String | The hostname, IP address, or Unix domain socket path for the selected server. |
serverPort | Int | The port for the selected server. Optional; not present for Unix domain sockets. When the user does not specify a port and the default (27017) is used, the driver SHOULD include it here. |
The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:
Server selection succeeded for operation {{operation}} with ID {{operationId}}. Selected server: {{serverHost}}:{{serverPort}}. Selector: {{selector}}, topology description: {{topologyDescription}}
"Server selection failed" message
This message MUST be logged at debug
level. It MUST be emitted on the occasions specified either in
Multi-threaded or asynchronous server selection or
Single-threaded server selection, depending on which algorithm the driver
implements.
This message MUST contain the following key-value pairs:
Key | Suggested Type | Value |
---|---|---|
message | String | "Server selection failed" |
failure | Flexible | Representation of the error the driver will throw regarding server selection failing. The type and format of this value is flexible; see the logging specification for details on representing errors in log messages. Drivers MUST take care to not include any information in this field that is already included in the log message; e.g. the topology description should not be duplicated within this field. |
The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:
Server selection failed for operation {{operationName}} with ID {{operationId}}. Failure: {{failure}}. Selector: {{selector}}, topology description: {{topologyDescription}}
"Waiting for suitable server to become available" message
This message MUST be logged at info
level. It MUST be emitted on the occasions specified either in
Multi-threaded or asynchronous server selection or
Single-threaded server selection, depending on which algorithm the driver
implements.
In order to avoid generating redundant log messages, the driver MUST take care to only emit this message once per operation. We only log the message once because the only values that can change over time are:
- The remaining time: given the initial message's timestamp and the initial timestamp, the time remaining can always be inferred from the original message.
- The topology description: rather than logging these changes on a per-operation basis, users should observe them with a single set of messages for the entire client via SDAM log messages.
This message MUST contain the following key-value pairs:
Key | Suggested Type | Value |
---|---|---|
message | String | "Waiting for suitable server to become available" |
remainingTimeMS | Int | The remaining time left until server selection will time out. This MAY be omitted if the driver supports disabling server selection timeout altogether. |
The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:
Waiting for server to become available for operation {{operationName}} with ID {{operationId}}. Remaining time: {{remainingTimeMS}} ms. Selector: {{selector}}, topology description: {{topologyDescription}}.
Implementation Notes
These are suggestions. As always, driver authors should balance cross-language standardization with backwards compatibility and the idioms of their language.
Modes
Modes ('primary', 'secondary', ...) are constants declared in whatever way is idiomatic for the programming language.
The constant values may be ints, strings, or whatever. However, when attaching modes to $readPreference
camel case
must be used as described above in
Passing read preference to mongos and load balancers.
primaryPreferred and secondaryPreferred
'primaryPreferred' is equivalent to selecting a server with read preference mode 'primary' (without tag_sets
or
maxStalenessSeconds
), or, if that fails, falling back to selecting with read preference mode 'secondary' (with
tag_sets
and maxStalenessSeconds
, if provided).
'secondaryPreferred' is the inverse: selecting with mode 'secondary' (with tag_sets
and maxStalenessSeconds
) and
falling back to selecting with mode 'primary' (without tag_sets
or maxStalenessSeconds
).
Depending on the implementation, this may result in cleaner code.
nearest
The term 'nearest' is unfortunate, as it implies a choice based on geographic locality or absolute lowest latency, neither of which are true.
Instead, and unlike the other read preference modes, 'nearest' does not favor either primaries or secondaries; instead
all servers are candidates and are filtered by tag_sets
and maxStalenessSeconds
.
To always select the server with the lowest RTT, users should use mode 'nearest' without tag_sets
or
maxStalenessSeconds
and set localThresholdMS
to zero.
To distribute reads across all members evenly regardless of RTT, users should use mode 'nearest' without tag_sets
or
maxStalenessSeconds
and set localThresholdMS
very high so that all servers fall within the latency window.
In both cases, tag_sets
and maxStalenessSeconds
could be used to further restrict the set of eligible servers, if
desired.
Tag set lists
Tag set lists can be configured in the driver in whatever way is natural for the language.
Multi-threaded server selection implementation
The following example uses a single lock for clarity. Drivers are free to implement whatever concurrency model best suits their design.
The following is pseudocode for multi-threaded or asynchronous server selection:
def getServer(criteria):
client.lock.acquire()
now = gettime()
endTime = now + computed server selection timeout
log a "server selection started" message
while true:
# The topologyDescription keeps track of whether any server has an
# an invalid wire version range
if not topologyDescription.compatible:
client.lock.release()
log a "server selection failed" message
throw invalid wire protocol range error with details
if maxStalenessSeconds is set:
if client minWireVersion < 5 and "<any available server's maxWireVersion < 5">:
client.lock.release()
throw error
if topologyDescription.type in (ReplicaSetWithPrimary, ReplicaSetNoPrimary):
if (maxStalenessSeconds * 1000 < heartbeatFrequencyMS + idleWritePeriodMS or
maxStalenessSeconds < smallestMaxStalenessSeconds):
client.lock.release()
throw error
servers = all servers in topologyDescription matching criteria
if serverSelector is not null:
servers = serverSelector(servers)
if servers is not empty:
in_window = servers within the latency window
if len(in_window) == 1:
selected = in_window[0]
else:
server1, server2 = random two entries from in_window
if server1.operation_count <= server2.operation_count:
selected = server1
else:
selected = server2
selected.operation_count += 1
client.lock.release()
return selected
request that all monitors check immediately
if the message was not logged already for this operation:
log a "waiting for suitable server to become available" message
# Wait for a new TopologyDescription. condition.wait() releases
# client.lock while waiting and reacquires it before returning.
# While a thread is waiting on client.condition, it is awakened
# early whenever a server check completes.
timeout_left = endTime - gettime()
client.condition.wait(timeout_left)
if now after endTime:
client.lock.release()
throw server selection error
Single-threaded server selection implementation
The following is pseudocode for single-threaded server selection:
def getServer(criteria):
startTime = gettime()
loopEndTime = startTime
maxTime = startTime + computed server selection timeout
nextUpdateTime = topologyDescription.lastUpdateTime
+ heartbeatFrequencyMS/1000:
if nextUpdateTime < startTime:
topologyDescription.stale = true
while true:
if topologyDescription.stale:
scanReadyTime = topologyDescription.lastUpdateTime
+ minHeartbeatFrequencyMS/1000
if ((not serverSelectionTryOnce) and (scanReadyTime > maxTime)):
throw server selection error with details
# using loopEndTime below is a proxy for "now" but avoids
# the overhead of another gettime() call
sleepTime = scanReadyTime - loopEndTime
if sleepTime > 0:
sleep sleepTime
rescan all servers
topologyDescription.lastupdateTime = gettime()
topologyDescription.stale = false
# topologyDescription keeps a record of whether any
# server has an incompatible wire version range
if not topologyDescription.compatible:
topologyDescription.stale = true
# throw invalid wire version range error with details
if maxStalenessSeconds is set:
if client minWireVersion < 5 and "<any available server's maxWireVersion < 5>":
# throw error
if topologyDescription.type in (ReplicaSetWithPrimary, ReplicaSetNoPrimary):
if (maxStalenessSeconds * 1000 < heartbeatFrequencyMS + idleWritePeriodMS or
maxStalenessSeconds < smallestMaxStalenessSeconds):
# throw error
servers = all servers in topologyDescription matching criteria
if serverSelector is not null:
servers = serverSelector(servers)
if servers is not empty:
in_window = servers within the latency window
return random entry from in_window
else:
topologyDescription.stale = true
loopEndTime = gettime()
if serverSelectionTryOnce:
if topologyDescription.lastUpdateTime > startTime:
throw server selection error with details
else if loopEndTime > maxTime:
throw server selection error with details
if the message was not logged already:
log a "waiting for suitable server to become available" message
Server Selection Errors
Drivers should use server descriptions and their error attributes (if set) to return useful error messages.
For example, when there are no members matching the ReadPreference:
- "No server available for query with ReadPreference primary"
- "No server available for query with ReadPreference secondary"
- "No server available for query with ReadPreference " + mode + ", tag set list " + tag_sets + ", and
maxStalenessSeconds
" + maxStalenessSeconds
Or, if authentication failed:
- "Authentication failed:
[specific error message]
"
Here is a sketch of some pseudocode for handling error reporting when errors could be different across servers:
if there are any available servers:
error_message = "No servers are suitable for " + criteria
else if all ServerDescriptions' errors are the same:
error_message = a ServerDescription.error value
else:
error_message = ', '.join(all ServerDescriptions' errors)
Cursors
Cursor operations OP_GET_MORE and OP_KILL_CURSOR do not go through the server selection process. Cursor operations must be sent to the original server that received the query and sent the OP_REPLY. For exhaust cursors, the same socket must be used for OP_GET_MORE until the cursor is exhausted.
Sharded Transactions
Operations that are part of a sharded transaction (after the initial command) do not go through the server selection process. Sharded transaction operations MUST be sent to the original mongos server on which the transaction was started.
The 'text' command and mongos
Note: As of MongoDB 2.6, mongos doesn't distribute the "text" command to secondaries, see SERVER-10947.
However, the "text" command is deprecated in 2.6, so this command-specific helper may become deprecated before this is fixed.
Test Plan
The server selection test plan is given in a separate document that describes the tests and supporting data files: Server Selection Tests
Design Rationale
Use of topology types
The prior version of the read preference spec had only a loose definition of server or topology types. The Server Discovery and Monitoring spec defines these terms explicitly and they are used here for consistency and clarity.
Consistency with mongos
In order to ensure that behavior is consistent regardless of topology type, read preference behaviors are limited to those that mongos can proxy.
For example, mongos ignores read preference 'secondary' when a shard consists of a single server. Therefore, this spec calls for topology type Single to ignore read preferences for consistency.
The spec has been written with the intention that it can apply to both drivers and mongos and the term "client" has been used when behaviors should apply to both. Behaviors that are specific to drivers are largely limited to those for communicating with a mongos.
New localThresholdMS configuration option name
Because this does not apply only to secondaries and does not limit absolute latency, the name
secondaryAcceptableLatencyMS
is misleading.
The mongos name localThreshold
misleads because it has nothing to do with locality. It also doesn't include the MS
units suffix for consistency with other time-related configuration options.
However, given a choice between the two, localThreshold
is a more general term. For drivers, we add the MS
suffix
for clarity about units and consistency with other configuration options.
Random selection within the latency window (single-threaded)
When more than one server is judged to be suitable, the spec calls for random selection to ensure a fair distribution of work among servers within the latency window.
It would be hard to ensure a fair round-robin approach given the potential for servers to come and go. Making newly available servers either first or last could lead to unbalanced work. Random selection has a better fairness guarantee and keeps the design simpler.
operationCount-based selection within the latency window (multi-threaded or async)
As operation execution slows down on a node (e.g. due to degraded server-side performance or increased network latency), checked-out pooled connections to that node will begin to remain checked out for longer periods of time. Assuming at least constant incoming operation load, more connections will then need to be opened against the node to service new operations that it gets selected for, further straining it and slowing it down. This can lead to runaway connection creation scenarios that can cripple a deployment ("connection storms"). As part of DRIVERS-781, the random choice portion of multi-threaded server selection was changed to more evenly spread out the workload among suitable servers in order to prevent any single node from being overloaded. The new steps achieve this by approximating an individual server's load via the number of concurrent operations that node is processing (operationCount) and then routing operations to servers with less load. This should reduce the number of new operations routed towards nodes that are busier and thus increase the number routed towards nodes that are servicing operations faster or are simply less busy. The previous random selection mechanism did not take load into account and could assign work to nodes that were under too much stress already.
As an added benefit, the new approach gives preference to nodes that have recently been discovered and are thus are more likely to be alive (e.g. during a rolling restart). The narrowing to two random choices first ensures new servers aren't overly preferred however, preventing a "thundering herd" situation. Additionally, the maxConnecting provisions included in the CMAP specification prevent drivers from crippling new nodes with connection storms.
This approach is based on the "Power of Two Random Choices with Least Connections" load balancing algorithm.
An alternative approach to this would be to prefer selecting servers that already have available connections. While that approach could help reduce latency, it does not achieve the benefits of routing operations away from slow servers or of preferring newly introduced servers. Additionally, that approach could lead to the same node being selected repeatedly rather than spreading the load out among all suitable servers.
The SecondaryOk wire protocol flag
In server selection, there is a race condition that could exist between what a selected server type is believed to be and what it actually is.
The SecondaryOk
wire protocol flag solves the race problem by communicating to the server whether a secondary is
acceptable. The server knows its type and can return a "not writable primary" error if SecondaryOk
is false and the
server is a secondary.
However, because topology type Single is used for direct connections, we want read operations to succeed even against a
secondary, so the SecondaryOk
wire protocol flag must be sent to mongods with topology type Single.
(If the server type is Mongos, follow the rules for Passing read preference to mongos and load balancers, even for topology type Single.)
General command method going to primary
The list of commands that can go to secondaries changes over time and depends not just on the command but on parameters.
For example, the mapReduce
command may or may not be able to be run on secondaries depending on the value of the out
parameter.
It significantly simplifies implementation for the general command method always to go to the primary unless a explicit read preference is set and rely on users of the general command method to provide a read preference appropriate to the command.
The command-specific helpers will need to implement a check of read preferences against the semantics of the command and its parameters, but keeping this logic close to the command rather than in a generic method is a better design than either delegating this check to the generic method, duplicating the logic in the generic method, or coupling both to another validation method.
Average round trip time calculation
Using an exponentially-weighted moving average avoids having to store and rotate an arbitrary number of RTT observations. All observations count towards the average. The weighting makes recent observations count more heavily while smoothing volatility.
Verbose errors
Error messages should be sufficiently verbose to allow users and/or support engineers to determine the reasons for server selection failures from log or other error messages.
"Try once" mode
Single-threaded drivers in languages like PHP and Perl are typically deployed as many processes per application server. Each process must independently discover and monitor the MongoDB deployment.
When no suitable server is available (due to a partition or misconfiguration), it is better for each request to fail as soon as its process detects a problem, instead of waiting and retrying to see if the deployment recovers.
Minimizing response latency is important for maximizing request-handling capacity and for user experience (e.g. a quick fail message instead of a slow web page).
However, when a request arrives and the topology information is already stale, or no suitable server is known, making a single attempt to update the topology to service the request is acceptable.
A user of a single-threaded driver who prefers resilience in the face of topology problems, rather than short response times, can turn the "try once" mode off. Then driver rescans the topology every minHeartbeatFrequencyMS until a suitable server is found or the timeout expires.
What is the purpose of socketCheckIntervalMS?
Single-threaded clients need to make a compromise: if they check servers too frequently it slows down regular operations, but if they check too rarely they cannot proactively avoid errors.
Errors are more disruptive for single-threaded clients than for multi-threaded. If one thread in a multi-threaded process encounters an error, it warns the other threads not to use the disconnected server. But single-threaded clients are deployed as many independent processes per application server, and each process must throw an error until all have discovered that a server is down.
The compromise specified here balances the cost of frequent checks against the disruption of many errors. The client
preemptively checks individual sockets that have not been used in the last
socketCheckIntervalMS, which is more frequent by default than heartbeatFrequencyMS
defined
in the Server Discovery and Monitoring Spec.
The client checks the socket with a "ping" command, rather than "hello" or legacy hello, because it is not checking the
server's full state as in the Server Discovery and Monitoring Spec, it is only verifying that the connection is still
open. We might also consider a select
or poll
call to check if the socket layer considers the socket closed, without
requiring a round-trip to the server. However, this technique usually will not detect an uncleanly shutdown server or a
network outage.
Backwards Compatibility
In general, backwards breaking changes have been made in the name of consistency with mongos and avoiding misleading users about monotonicity.
-
Features removed:
- Automatic pinning (see What happened to pinning?)
- Auto retry (replaced by the general server selection algorithm)
- mongos "high availability" mode (effectively, mongos pinning)
-
Other features and behaviors have changed explicitly
- Ignoring read preferences for topology type Single
- Default read preference for the generic command method
-
Changes with grandfather clauses
- Alternate names for
localThresholdMS
- Pinning for legacy request APIs
- Alternate names for
-
Internal changes with little user-visibility
- Clarifying calculation of average RTT
Questions and Answers
What happened to pinning?
The prior read preference spec, which was implemented in the versions of the drivers and mongos released concomitantly with MongoDB 2.2, stated that a thread / client should remain pinned to an RS member as long as that member matched the current mode, tags, and acceptable latency. This increased the odds that reads would be monotonic (assuming no rollback), but had the following surprising consequence:
- Thread / client reads with mode 'secondary' or 'secondaryPreferred', gets pinned to a secondary
- Thread / client reads with mode 'primaryPreferred', driver / mongos sees that the pinned member (a secondary) matches the mode (which allows for a secondary) and reads from secondary, even though the primary is available and preferable
The old spec also had the swapped problem, reading from the primary with 'secondaryPreferred', except for mongos which was changed at the last minute before release with SERVER-6565.
This left application developers with two problems:
- 'primaryPreferred' and 'secondaryPreferred' acted surprisingly and unpredictably within requests
- There was no way to specify a common need: read from a secondary if possible with 'secondaryPreferred', then from primary if possible with 'primaryPreferred', all within a request. Instead an application developer would have to do the second read with 'primary', which would unpin the thread but risk unavailability if only secondaries were up.
Additionally, mongos 2.4 introduced the releaseConnectionsAfterResponse option (RCAR), mongos 2.6 made it the default and mongos 2.8 will remove the ability to turn it off. This means that pinning to a mongos offers no guarantee that connections to shards are pinned. Since we can't provide the same guarantees for replica sets and sharded clusters, we removed automatic pinning entirely and deprecated "requests". See SERVER-11956 and SERVER-12273.
Regardless, even for replica sets, pinning offers no monotonicity because of the ever-present possibility of rollbacks. Through MongoDB 2.6, secondaries did not close sockets on rollback, so a rollback could happen between any two queries without any indication to the driver.
Therefore, an inconsistent feature that doesn't actually do what people think it does has no place in the spec and has been removed. Should the server eventually implement some form of "sessions", this spec will need to be revised accordingly.
Why change from mongos High Availability (HA) to random selection?
Mongos HA has similar problems with pinning, in that one can wind up pinned to a high-latency mongos even if a lower-latency mongos later becomes available.
Selection within the latency window avoids this problem and makes server selection exactly analogous to having multiple suitable servers from a replica set. This is easier to explain and implement.
What happened to auto-retry?
The old auto-retry mechanism was closely connected to server pinning, which has been removed. It also mandated exactly three attempts to carry out a query on different servers, with no way to disable or adjust that value, and only for the first query within a request.
To the extent that auto-retry was trying to compensate for unavailable servers, the Server Discovery and Monitoring spec and new server selection algorithm provide a more robust and configurable way to direct all queries to available servers.
After a server is selected, several error conditions could still occur that make the selected server unsuitable for sending the operation, such as:
- the server could have shutdown the socket (e.g. a primary stepping down),
- a connection pool could be empty, requiring new connections; those connections could fail to connect or could fail the server handshake
Once an operation is sent over the wire, several additional error conditions could occur, such as:
- a socket timeout could occur before the server responds
- the server might send an RST packet, indicating the socket was already closed
- for write operations, the server might return a "not writable primary" error
This specification does not require nor prohibit drivers from attempting automatic recovery for various cases where it might be considered reasonable to do so, such as:
- repeating server selection if, after selection, a socket is determined to be unsuitable before a message is sent on it
- for a read operation, after a socket error, selecting a new server meeting the read preference and resending the query
- for a write operation, after a "not writable primary" error, selecting a new server (to locate the primary) and resending the write operation
Driver-common rules for retrying operations (and configuring such retries) could be the topic of a different, future specification.
Why is maxStalenessSeconds applied before tag_sets?
The intention of read preference's list of tag sets is to allow a user to prefer the first tag set but fall back to members matching later tag sets. In order to know whether to fall back or not, we must first filter by all other criteria.
Say you have two secondaries:
- Node 1, tagged
{'tag': 'value1'}
, estimated staleness 5 minutes- Node 2, tagged
{'tag': 'value2'}
, estimated staleness 1 minute
And a read preference:
- mode: "secondary"
- maxStalenessSeconds: 120 (2 minutes)
- tag_sets:
[{'tag': 'value1'}, {'tag': 'value2'}]
If tag sets were applied before maxStalenessSeconds, we would select Node 1 since it matches the first tag set, then filter it out because it is too stale, and be left with no eligible servers.
The user's intent in specifying two tag sets was to fall back to the second set if needed, so we filter by maxStalenessSeconds first, then tag_sets, and select Node 2.
References
- Server Discovery and Monitoring specification
- Driver Authentication specification
- Connection Monitoring and Pooling specification
Changelog
-
2015-06-26: Updated single-threaded selection logic with "stale" and serverSelectionTryOnce.
-
2015-08-10: Updated single-threaded selection logic to ensure a scan always happens at least once under serverSelectionTryOnce if selection fails. Removed the general selection algorithm and put full algorithms for each of the single- and multi-threaded sections. Added a requirement that single-threaded drivers document selection time expectations.
-
2016-07-21: Updated for Max Staleness support.
-
2016-08-03: Clarify selection algorithm, in particular that maxStalenessMS comes before tag_sets.
-
2016-10-24: Rename option from "maxStalenessMS" to "maxStalenessSeconds".
-
2016-10-25: Change minimum maxStalenessSeconds value from 2 * heartbeatFrequencyMS to heartbeatFrequencyMS + idleWritePeriodMS (with proper conversions of course).
-
2016-11-01: Update formula for secondary staleness estimate with the equivalent, and clearer, expression of this formula from the Max Staleness Spec
-
2016-11-21: Revert changes that would allow idleWritePeriodMS to change in the future, require maxStalenessSeconds to be at least 90.
-
2017-06-07: Clarify socketCheckIntervalMS behavior, single-threaded drivers must retry selection after checking an idle socket and discovering it is broken.
-
2017-11-10: Added application-configurated server selector.
-
2017-11-12: Specify read preferences for OP_MSG with direct connection, and delete obsolete comment direct connections to secondaries getting "not writable primary" errors by design.
-
2018-01-22: Clarify that
$query
wrapping is only for OP_QUERY -
2018-01-22: Clarify that
$out
on aggregate follows the "$out
Aggregation Pipeline Operator" spec and warns if read preference is not primary. -
2018-01-29: Remove reference to '
$out
Aggregation spec'. Clarify runCommand selection rules. -
2018-12-13: Update tag_set example to use only String values
-
2019-05-20: Added rule to not send read preferene to standalone servers
-
2019-06-07: Clarify language for aggregate and mapReduce commands that write
-
2020-03-17: Specify read preferences with support for server hedged reads
-
2020-10-10: Consider server load when selecting servers within the latency window.
-
2021-04-07: Adding in behaviour for load balancer mode.
-
2021-05-12: Removed deprecated URI option in favour of readPreference=secondaryPreferred.
-
2021-05-13: Updated to use modern terminology.
-
2021-08-05: Updated
$readPreference
logic to describe OP_MSG behavior. -
2021-09-03: Clarify that wire version check only applies to available servers.
-
2021-09-28: Note that 5.0+ secondaries support aggregate with write stages (e.g.
$out
and$merge
). Clarify settingSecondaryOk
wire protocol flag or$readPreference
global command argument for replica set topology. -
2022-01-19: Require that timeouts be applied per the client-side operations timeout spec
-
2022-10-05: Remove spec front matter, move footnote, and reformat changelog.
-
2022-11-09: Add log messages and tests.
-
2023-08-26: Add list of deprioritized servers for sharded cluster topology.
-
2024-02-07: Migrated from reStructuredText to Markdown.
mongos 3.4 refuses to connect to mongods with maxWireVersion < 5, so it does no additional wire version checks related to maxStalenessSeconds.
Max Staleness
- Status: Accepted
- Minimum Server Version: 3.4
Abstract
Read preference gains a new option, "maxStalenessSeconds".
A client (driver or mongos) MUST estimate the staleness of each secondary, based on lastWriteDate values provided in server hello responses, and select only those secondaries whose staleness is less than or equal to maxStalenessSeconds.
Most of the implementation of the maxStalenessSeconds option is specified in the Server Discovery And Monitoring Spec and the Server Selection Spec. This document supplements those specs by collecting information specifically about maxStalenessSeconds.
Meta
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Motivation for Change
Users have often asked for ways to avoid reading from stale secondaries. An application with a geographically distributed replica set may want to prefer nearby members to minimize latency, while at the same time avoiding extremely laggy secondaries to mitigate the risk of very stale reads.
Goals
- Provide an approximate means of limiting the staleness of secondary reads.
- Provide a client-side knob to adjust the tradeoff between network-local reads and data recency.
- Be robust in the face of clock skew between the client and servers, and skew between the primary and secondaries.
- Avoid "inadvertent primary read preference": prevent a maxStalenessSeconds setting so small it forces all reads to the primary regardless of actual replication lag.
- Specify how mongos routers and shards track the opTimes of Config Servers as Replica Sets ("CSRS").
Non-Goals
- Provide a global server-side configuration of max acceptable staleness (see rejected ideas).
- Support small values for max staleness.
- Make a consistency guarantee resembling readConcern "afterOpTime".
- Specify how maxStalenessSeconds interacts with readConcern "afterOpTime" in drivers (distinct from the goal for routers and shards).
- Compensate for the duration of server checks in staleness estimations.
Specification
API
"maxStalenessSeconds" is a new read preference option, with a positive integer value. It MUST be configurable similar to other read preference options like "readPreference" and "tag_sets". Clients MUST also recognize it in the connection string:
mongodb://host/?readPreference=secondary&maxStalenessSeconds=120
Clients MUST consider "maxStalenessSeconds=-1" in the connection string to mean "no maximum staleness".
A connection string combining a positive maxStalenessSeconds with read preference mode "primary" MUST be considered invalid; this includes connection strings with no explicit read preference mode.
By default there is no maximum staleness.
A driver connected to a replica set requires that maxStalenessSeconds be absent, or be at least smallestMaxStalenessSeconds (90 seconds) and at least heartbeatFrequencyMS + idleWritePeriodMS. The exact mechanism for enforcement is defined in the Server Selection Spec.
Besides configuring maxStalenessSeconds in the connection string, the API for configuring it in code is not specified; drivers are free to use None, null, -1, or other representations of "no value" to represent "no max staleness".
Replica Sets
Replica set primaries and secondaries implement the following features to support maxStalenessSeconds.
idleWritePeriodMS
An idle primary writes a no-op to the oplog every 10 seconds to refresh secondaries' lastWriteDate values (see
SERVER-23892 and primary must write periodic no-ops). This spec refers to this
period as idleWritePeriodMS
with constant value 10,000.
lastWrite
A primary's or secondary's hello response contains a "lastWrite" subdocument with these fields (SERVER-8858):
- lastWriteDate: a BSON UTC datetime, the wall-clock time of the primary when it most recently recorded a write to the oplog.
- opTime: an opaque value representing the position in the oplog of the most recently seen write. Needed for sharding, not used for the maxStalenessSeconds read preference option.
Wire Version
The maxWireVersion MUST be incremented to 5 to indicate that the server includes maxStalenessSeconds features (SERVER-23893).
Client
A client (driver or mongos) MUST estimate the staleness of each secondary, based on lastWriteDate values provided in server hello responses, and select for reads only those secondaries whose estimated staleness is less than or equal to maxStalenessSeconds.
If any server's maxWireVersion is less than 5 and maxStalenessSeconds is a positive number, every attempt at server selection throws an error.
When there is a known primary, a secondary S's staleness is estimated with this formula:
(S.lastUpdateTime - S.lastWriteDate) - (P.lastUpdateTime - P.lastWriteDate) + heartbeatFrequencyMS
Where "P" and "S" are the primary's and secondary's ServerDescriptions. All datetimes are in milliseconds. The staleness estimate could be temporarily negative.
When there is no known primary, a secondary S's staleness is estimated with this formula:
SMax.lastWriteDate - S.lastWriteDate + heartbeatFrequencyMS
Where "SMax" is the secondary with the greatest lastWriteDate.
Explanation of Staleness Estimate With Primary
- When the client checks the primary, it gets the delta between the primary's lastWriteDate and the client clock. Call this "Client_to_Primary".
- When the client checks a secondary, it gets the delta between the secondary's lastWriteDate and the client clock. Call this "Client_to_Secondary".
- The difference of these two is an estimate of the delta between the primary's and secondary's lastWriteDate.
Thus:
staleness = Client_to_Secondary - Client_to_Primary
= (S.lastUpdateTime - S.lastWriteDate) - (P.lastUpdateTime - P.lastWriteDate)
Finally, add heartbeatFrequencyMS:
(S.lastUpdateTime - S.lastWriteDate) - (P.lastUpdateTime - P.lastWriteDate) + heartbeatFrequencyMS
This adjusts for the pessimistic assumption that S stops replicating right after S.lastUpdateTime, so it will be heartbeatFrequencyMS more stale by the time it is checked again. This means S must be fresh enough at S.lastUpdateTime to be eligible for reads from now until the next check, even if it stops replicating.
See the Server Discovery and Monitoring Spec and Server Selection Spec for details of client implementation.
Routers and shards
Background: Shard servers and mongos servers in a sharded cluster with CSRS use readConcern "afterOptime" for consistency guarantees when querying the shard config.
Besides tracking lastWriteDate, routers and shards additionally track the opTime of CSRS members if they have maxWireVersion 5 or greater. (See Server Discovery and Monitoring Spec for details.)
When a router or shard selects a CSRS member to read from with readConcern like:
readConcern: { afterOpTime: OPTIME }
... then it follows this selection logic:
- Make a list of known CSRS data members.
- Filter out those whose last known opTime is older than OPTIME.
- If no servers remain, select the primary.
- Otherwise, select randomly one of the CSRS members whose roundTripTime is within localThresholdMS of the member with the fastest roundTripTime.
Step 4 is the standard localThresholdMS logic from the Server Selection Spec.
This algorithm helps routers and shards select a secondary that is likely to satisfy readConcern "afterOpTime" without blocking.
This feature is only for routers and shards, not drivers. See Future Work.
Reference Implementation
The C Driver (CDRIVER-1363) and Perl Driver (PERL-626).
Estimating Staleness: Example With a Primary and Continuous Writes
Consider a primary P and a secondary S, and a client with heartbeatFrequencyMS set to 10 seconds. Say that the primary's clock is 50 seconds skewed ahead of the client's.
The client checks P and S at time 60 (meaning 60 seconds past midnight) by the client's clock. The primary reports its lastWriteDate is 10.
Then, S reports its lastWriteDate is 0. The client estimates S's staleness as:
(S.lastUpdateTime - S.lastWriteDate) - (P.lastUpdateTime - P.lastWriteDate) + heartbeatFrequencyMS
= (60 - 0) - (60 - 10) + 10
= 20 seconds
(Values converted from milliseconds to seconds for the sake of discussion.)
Note that the secondary appears only 10 seconds stale at this moment, but the client adds heartbeatFrequencyMS, pessimistically assuming that the secondary will not replicate at all between now and the next check. If the current staleness plus heartbeatFrequencyMS is still less than maxStalenessSeconds, then we can safely read from the secondary from now until the next check.
The client re-checks P and S 10 seconds later, at time 70 by the client's clock. S responds first with a lastWriteDate of 5: it has fallen 5 seconds further behind. The client updates S's lastWriteDate and lastUpdateTime. The client now estimates S's staleness as:
(S.lastUpdateTime - S.lastWriteDate) - (P.lastUpdateTime - P.lastWriteDate) + heartbeatFrequencyMS
= (70 - 5) - (60 - 10) + 10
= 25 seconds
Say that P's response arrives 10 seconds later, at client time 80, and reports its lastWriteDate is 30. S's staleness is still 25 seconds:
(S.lastUpdateTime - S.lastWriteDate) - (P.lastUpdateTime - P.lastWriteDate) + heartbeatFrequencyMS
= (70 - 5) - (80 - 30) + 10
= 25 seconds
The same story as a table:
Client clock | Primary clock | Event | S.lastUpdateTime | S.lastWriteDate | P.lastUpdateTime | P.lastWriteDate | S staleness |
---|---|---|---|---|---|---|---|
60 | 10 | P and S respond | 60 | 0 | 60 | 10 | 20 seconds |
70 | 20 | S responds | 70 | 5 | 60 | 10 | 25 seconds |
80 | 30 | P responds | 70 | 5 | 80 | 30 | 25 seconds |
Estimating Staleness: Example With No Primary
Consider a replica set with secondaries S1 and S2, and no primary. S2 lags 15 seconds farther behind S1 and has not yet caught up. The client has heartbeatFrequencyMS set to 10 seconds.
When the client checks the two secondaries, S1's lastWriteDate is 20 and S2's lastWriteDate is 5.
Because S1 is the secondary with the maximum lastWriteDate, "SMax", its staleness estimate equals heartbeatFrequencyMS:
SMax.lastWriteDate - S.lastWriteDate + heartbeatFrequencyMS = 20 - 20 + 10 = 10
(Since max staleness must be at least heartbeatFrequencyMS + idleWritePeriodMS, S1 is eligible for reads no matter what.)
S2's staleness estimate is:
SMax.lastWriteDate - S.lastWriteDate + heartbeatFrequencyMS
= 20 - 5 + 10
= 25
Estimating Staleness: Example of Worst-Case Accuracy With Idle Replica Set
Consider a primary P and a secondary S, and a client with heartbeatFrequencyMS set to 500 ms. There is no clock skew. (Previous examples show that skew has no effect.)
The primary has been idle for 10 seconds and writes a no-op to the oplog at time 50 (meaning 50 seconds past midnight), and again at time 60.
Before the secondary can replicate the no-op at time 60, the client checks both servers. The primary reports its lastWriteDate is 60, the secondary reports 50.
The client estimates S's staleness as:
(S.lastUpdateTime - S.lastWriteDate) - (P.lastUpdateTime - P.lastWriteDate) + heartbeatFrequencyMS
= (60 - 50) - (60 - 60) + 0.5
= 10.5
The same story as a table:
Clock | Event | S.lastUpdateTime | S.lastWriteDate | P.lastUpdateTime | P.lastWriteDate | S staleness |
---|---|---|---|---|---|---|
50 | Idle write | 50 | 50 | |||
60 | Idle write begins | 60 | 50 | |||
60 | Client checks P and S | 60 | 60 | 50 | 60 | 10.5 |
60 | Idle write completes | 60 | 60 |
In this scenario the actual secondary lag is between 0 and 10 seconds. But the staleness estimate can be as large as:
staleness = idleWritePeriodMS + heartbeatFrequencyMS
To ensure the secondary is always eligible for reads in an idle replica set, we require:
maxStalenessSeconds * 1000 >= heartbeatFrequencyMS + idleWritePeriodMS
Supplemental
Python scripts in this document's source directory:
test_max_staleness_spo.py
: Usesscipy.optimize
to determine worst-case accuracy of the staleness estimate in an idle replica set.test_staleness_estimate.py
: Tests whether a client would correctly select a secondary from an idle replica set, given a random distribution of values for maxStalenessSeconds, heartbeatFrequencyMS, lastWriteDate, and lastUpdateTime.
Test Plan
See max-staleness-tests.md
, and the YAML and JSON tests in the tests directory.
Design Rationale
Specify max staleness in seconds
Other driver options that are timespans are in milliseconds, for example serverSelectionTimeoutMS. The max staleness option is specified in seconds, however, to make it obvious to users that clients can only enforce large, imprecise max staleness values.
maxStalenessSeconds is part of Read Preferences
maxStalenessSeconds MAY be configurable at the client, database, and collection level, and per operation, the same as other read preference fields are, because users expressed that their tolerance for stale reads varies per operation.
Primary must write periodic no-ops
Consider a scenario in which the primary does not:
- There are no writes for an hour.
- A client performs a heavy read-only workload with read preference mode "nearest" and maxStalenessSeconds of 90 seconds.
- The primary receives a write.
- In the brief time before any secondary replicates the write, the client re-checks all servers.
- Since the primary's lastWriteDate is an hour ahead of all secondaries', the client only queries the primary.
- After heartbeatFrequencyMS, the client re-checks all servers and finds that the secondaries aren't lagging after all, and resumes querying them.
This apparent "replication lag spike" is just a measurement error, but it causes exactly the behavior the user wanted to avoid: a small replication lag makes the client route all queries from the secondaries to the primary.
Therefore an idle primary must execute a no-op every 10 seconds (idleWritePeriodMS) to keep secondaries' lastWriteDate values close to the primary's clock. The no-op also keeps opTimes close to the primary's, which helps mongos choose an up-to-date secondary to read from in a CSRS.
Monitoring software like MongoDB Cloud Manager that charts replication lag will also benefit when spurious lag spikes are solved.
See Estimating Staleness: Example of Worst-Case Accuracy With Idle Replica Set. and SERVER-23892.
Smallest allowed value for maxStalenessSeconds
If maxStalenessSeconds is a positive number, it must be at least smallestMaxStalenessSeconds (90 seconds) and at least heartbeatFrequencyMS + idleWritePeriodMS. The exact mechanism for enforcement is defined in the Server Selection Spec.
The justification for heartbeatFrequencyMS + idleWritePeriodMS is technical: If maxStalenessSeconds is set to exactly heartbeatFrequencyMS (converted to seconds), then so long as a secondary lags even a millisecond it is ineligible. Despite the user's read preference mode, the client will always read from the primary.
This is an example of "inadvertent primary read preference": a maxStalenessSeconds setting so small it forces all reads to the primary regardless of actual replication lag. We want to prohibit this effect (see goals).
We also want to ensure that a secondary in an idle replica set is always considered eligible for reads with maxStalenessSeconds. See Estimating Staleness: Example of Worst-Case Accuracy With Idle Replica Set.
Requiring maxStalenessSeconds to be at least 90 seconds is a design choice. If the only requirement were that maxStalenessSeconds be at least heartbeatFrequencyMS + idleWritePeriodMS, then the smallest value would be 20 seconds for multi-threaded drivers (10 second idleWritePeriodMS plus multi-threaded drivers' default 10 second heartbeatFrequencyMS), 70 seconds for single-threaded drivers (whose default heartbeatFrequencyMS is 60 seconds), and 40 seconds for mongos (whose replica set monitor checks servers every 30 seconds).
The smallest configurable value for heartbeatFrequencyMS is 0.5 seconds, so maxStalenessSeconds could be as small as 10.5 when using a driver connected to a replica set, but mongos provides no such flexibility.
Therefore, this spec also requires that maxStalenessSeconds is at least 90:
- To provide a minimum for all languages and topologies that is easy to document and explain
- To avoid application breakage when moving from replica set to sharded cluster, or when using the same URI with different drivers
- To emphasize that maxStalenessSeconds is a low-precision heuristic
- To avoid the arbitrary-seeming minimum of 70 seconds imposed by single-threaded drivers
All servers must have wire version 5 to support maxStalenessSeconds
Clients with minWireVersion < 5 MUST throw an error if maxStalenessSeconds is set, and any available server in the topology has maxWireVersion less than 5.
An available server is defined in the Server Selection specification.
Servers began reporting lastWriteDate in wire protocol version 5, and clients require some or all servers' lastWriteDate in order to estimate any servers' staleness. The exact requirements of the formula vary according to TopologyType, so this spec makes a simple ruling: if any server is running an outdated version, maxStalenessSeconds cannot be supported.
Rejected ideas
Add all secondaries' opTimes to primary's hello response
Not needed; each secondary's self-report of its opTime is just as good as the primary's.
Use opTimes from command responses besides hello
An idea was to add opTime to command responses that don't already include it (e.g., "find"), and use these opTimes to update ServerDescriptions more frequently than the periodic hello calls.
But while a server is not being used (e.g., while it is too stale, or while it does not match some other part of the Read Preference), only its periodic hello responses can update its opTime. Therefore, heartbeatFrequencyMS sets a lower bound on maxStalenessSeconds, so there is no benefit in recording each server's opTime more frequently. On the other hand there would be costs: effort adding opTime to all command responses, lock contention getting the opTime on the server and recording it on the client, complexity in the spec and the client code.
Use current time in staleness estimate
A proposed staleness formula estimated the secondary's worst possible staleness:
P.lastWriteDate + (now - P.lastUpdateTime) - S.lastWriteDate
In this proposed formula, the place occupied by "S.lastUpdateTime" in the actual formula is replaced with "now", at the moment in the server selection process when staleness is being estimated.
This formula attempted a worst-case estimate right now: it assumed the primary kept writing after the client checked it, and that the secondary replicated nothing since the client last checked the secondary. The formula was rejected because it would slosh load to and from the secondary during the interval between checks.
For example: Say heartbeatFrequencyMS is 10 seconds and maxStalenessSeconds is set to 25 seconds, and immediately after a secondary is checked its staleness is estimated at 20 seconds. It is eligible for reads until 5 seconds after the check, then it becomes ineligible, causing all queries to be directed to the primary until the next check, 5 seconds later.
Server-side Configuration
We considered a deployment-wide "max staleness" setting that servers communicate to clients in hello, e.g., "120 seconds is the max staleness." The read preference config is simplified: "maxStalenessSeconds" is gone, instead we have "staleOk: true" (the default?) and "staleOk: false".
Based on Customer Advisory Board feedback, configuring staleness per-operation on the client side is more useful. We should merely avoid closing the door on a future server-side configuration feature.
References
Complaints about stale reads, and proposed solutions:
Future Work
Future feature to support readConcern "afterOpTime"
If a future spec allows applications to use readConcern "afterOptime", clients should prefer secondaries that have already replicated to that opTime, so reads do not block. This is an extension of the mongos logic for CSRS to applications.
Future feature to support server-side configuration
For this spec, we chose to control maxStalenessSeconds in client code. A future spec could allow database administrators to configure from the server side how much replication lag makes a secondary too stale to read from. (See Server-side Configuration above.) This could be implemented atop the current feature: if a server communicates is staleness configuration in its hello response like:
{ hello: true, maxStalenessSeconds: 30 }
... then a future client can use the value from the server as its default maxStalenessSeconds when there is no client-side setting.
Changelog
-
2024-08-09: Updated wire versions in tests to 4.0+.
-
2024-04-30: Migrated from reStructuredText to Markdown.
-
2022-10-05: Remove spec front matter and revise changelog.
-
2021-09-08: Updated tests to support driver removal of support for server versions older than 3.6.
-
2021-09-03: Clarify that wire version check only applies to available servers.
-
2021-04-06: Updated to use hello command.
-
2016-09-29: Specify "no max staleness" in the URI with "maxStalenessMS=-1" instead of "maxStalenessMS=0".
-
2016-10-24: Rename option from "maxStalenessMS" to "maxStalenessSeconds".
-
2016-10-25: Change minimum maxStalenessSeconds value from 2 * heartbeatFrequencyMS to heartbeatFrequencyMS + idleWritePeriodMS (with proper conversions of course).
-
2016-11-21: Revert changes that would allow idleWritePeriodMS to change in the future, require maxStalenessSeconds to be at least 90.
Retryable Reads
- Status: Accepted
- Minimum Server Version: 3.6
Abstract
This specification is about the ability for drivers to automatically retry any read operation that has not yet received any results—due to a transient network error, a "not writable primary" error after a replica set failover, etc.
This specification will
- outline how an API for retryable read operations will be implemented in drivers
- define an option to enable retryable reads for an application.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Specification
Terms
Retryable Error
An error is considered retryable if it meets any of the criteria defined under Retryable Writes: Terms: Retryable Error, minus the final criterion about write concern errors. For convenience, the relevant criteria have been adapted to retryable reads and reproduced below.
An error is considered retryable if it meets any of the following criteria:
- any network exception (e.g. socket timeout or error)
- a server error response with any the following codes:
Error Name | Error Code |
---|---|
ExceededTimeLimit | 262 |
InterruptedAtShutdown | 11600 |
InterruptedDueToReplStateChange | 11602 |
NotWritablePrimary | 10107 |
NotPrimaryNoSecondaryOk | 13435 |
NotPrimaryOrSecondary | 13436 |
PrimarySteppedDown | 189 |
ReadConcernMajorityNotAvailableYet | 134 |
ShutdownInProgress | 91 |
HostNotFound | 7 |
HostUnreachable | 6 |
NetworkTimeout | 89 |
SocketException | 9001 |
- a PoolClearedError
- Any of the above retryable errors that occur during a connection handshake (including the authentication step). For example, a network error or ShutdownInProgress error encountered when running the hello or saslContinue commands.
MongoClient Configuration
This specification introduces the following client-level configuration option.
retryReads
This boolean option determines whether retryable behavior will be applied to all read operations executed within the MongoClient. This option MUST default to true. As with retryable writes, this option MUST NOT be configurable at the level of an individual read operation, collection object, or database object. Drivers that expose a "high" and "core" API (e.g. Java and C# driver) MUST NOT expose a configurable option at the level of an individual read operation, collection object, or database object in "high", but MAY expose the option in "core."
Naming Deviations
As with retryable writes, drivers MUST use the defined name of
retryReads
for the connection string parameter to ensure portability of connection strings across applications and
drivers. If drivers solicit MongoClient options through another mechanism (e.g. an options dictionary provided to the
MongoClient constructor), drivers SHOULD use the defined name but MAY deviate to comply with their existing conventions.
For example, a driver may use retry_reads
instead of retryReads
. For any other names in the spec, drivers SHOULD use
the defined name but MAY deviate to comply with their existing conventions.
Requirements for Retryable Reads
Supported Server Versions
Drivers MUST verify server eligibility by ensuring that maxWireVersion
is at least 6 because retryable reads require a
MongoDB 3.6 standalone, replica set or shard cluster, MongoDB 3.6 server wire version is 6 as defined in the
Server Wire version and Feature List specification.
The minimum server version is 3.6 because
- It gives us version parity with retryable writes.
- It forces the retry attempt(s) to use the same implicit session, which would make it it easier to track operations and kill any errant longer running operation.
- It limits the scope of the implementation (
OP_QUERY
will not need to be supported).
Supported Read Operations
Drivers MUST support retryability for the following operations:
-
All read operations defined in the CRUD specification i.e.
-
Collection.find()
- This includes the
find
operations backing the GridFS API.
- This includes the
-
Collection.aggregate()
- Only if the pipeline does not include a write stage (e.g.
$out
,$merge
)
- Only if the pipeline does not include a write stage (e.g.
-
Collection.distinct()
-
Collection.count()
- Only required if the driver already provides
count()
- Only required if the driver already provides
-
Collection.estimatedDocumentCount()
-
Collection.countDocuments()
-
-
All read operation helpers in the change streams specification i.e.
Collection.watch()
Database.watch()
MongoClient.watch()
-
All enumeration commands e.g.
MongoClient.listDatabases()
Database.listCollections()
Collection.listIndexes()
-
Any read operations not defined in the aforementioned specifications:
- Any read operation helpers e.g.
Collection.findOne()
- Any read operation helpers e.g.
Drivers SHOULD support retryability for the following operations:
- Any driver that provides generic command runners for read commands (with logic to inherit a client-level read concerns) SHOULD implement retryability for the read-only command runner.
Most of the above methods are defined in the following specifications:
Unsupported Read Operations
Drivers MUST NOT retry the following operations:
Collection.mapReduce()
- This is due to the "Early Failure on Socket Disconnect" feature not supporting
mapReduce
. - N.B. If
mapReduce
is executed via a generic command runner for read commands, drivers SHOULD NOT inspect the command to preventmapReduce
from retrying.
- This is due to the "Early Failure on Socket Disconnect" feature not supporting
- Cursor.getMore()
- The generic runCommand helper, even if it is passed a read command.
- N.B.: This applies only to a generic command runner, which is agnostic about the read/write nature of the command.
Implementing Retryable Reads
Executing Retryable Read Commands
Executing retryable read commands is extremely similar to executing retryable write commands. The following explanation for executing retryable read commands has been adapted from the explanation for executing retryable write commands.
1. Selecting the initial server
The driver selects the initial server for the command as usual. When selecting a server for the first attempt of a retryable read command, drivers MUST allow a server selection error to propagate. In this case, the caller is able to infer that no attempt was made.
2. Determining whether retry should be allowed
A driver then determines if it should attempt to retry next.
2a. When not to allow retry
Drivers MUST attempt to execute the read command exactly once and allow any errors to propagate under any of the the following conditions:
- if retryable reads is not enabled or
- if the selected server does not support retryable reads or
- if the session in a transaction
By allowing the error to propagate, the caller is able to infer that one attempt was made.
2b. When to allow retry
Drivers MUST only attempt to retry a read command if
- retryable reads are enabled and
- the selected server supports retryable reads and
- the previous attempt yields a retryable error
3. Deciding to allow retry, encountering the initial retryable error, and selecting a server
If the driver decides to allow retry and the previous attempt of a retryable read command encounters a retryable error, the driver MUST update its topology according to the Server Discovery and Monitoring spec (see SDAM: Error Handling) and capture this original retryable error. Drivers should then proceed with selecting a server for a retry attempt.
3a. Selecting the server for retry
In a sharded cluster, the server on which the operation failed MUST be provided to the server selection mechanism as a deprioritized server.
If the driver cannot select a server for a retry attempt or the newly selected server does not support retryable reads, retrying is not possible and drivers MUST raise the previous retryable error. In both cases, the caller is able to infer that an attempt was made.
3b. Sending an equivalent command for a retry attempt
After server selection, a driver MUST send a valid command to the newly selected server that is equivalent1 to the initial command sent to the first server. If the driver determines that the newly selected server may not be able to support a command equivalent to the initial command, drivers MUST NOT retry and MUST raise the previous retryable error
The above requirement can be fulfilled in one of two ways:
-
During a retry attempt, the driver SHOULD recreate the command while adhering to that operation's specification's server/wire version requirements. If an error occurs while recreating the command, then the driver MUST raise the original retryable error.
For example, if the wire version dips from W0 to W1 after server selection, and the spec for operation O notes that for wire version W1, that field F should be omitted, then field F should be omitted. If the spec for operation O requires the driver to error out if field F is defined when talking to a server with wire version W1, then the driver must error out and raise the original retryable error.
-
Alternatively, if a driver chooses not to recreate the command as described above, then a driver MUST NOT retry if the server/wire version dips after server selection and MUST raise the original retryable error.
For example, if the wire version dips after server selection, the driver can choose to not retry and simply raise the original retryable error because there is no guarantee that the lower versioned server can support the original command.
3c. If a retry attempt fails
If a retry attempt also fails and Client Side Operations Timeout (CSOT) is enabled and the timeout has not yet expired, then the Driver MUST jump back to step 2b above in order to allow multiple retry attempts.
Otherwise, drivers MUST update their topology according to the SDAM spec (see SDAM: Error Handling). If an error would not allow the caller to infer that an attempt was made (e.g. connection pool exception originating from the driver), the previous error should be raised. If a retry failed due to another retryable error or some other error originating from the server, that error should be raised instead as the caller can infer that an attempt was made and the second error is likely more relevant (with respect to the current topology state).
If a driver associates server information (e.g. the server address or description) with an error, the driver MUST ensure that the reported server information corresponds to the server that originated the error.
4. Implementation constraints
When retrying a read command, drivers MUST NOT resend the original wire protocol message (see: Can drivers resend the same wire protocol message on retry attempts?).
Pseudocode
The following pseudocode for executing retryable read commands has been adapted from the pseudocode for executing retryable write commands and reflects the flow described above.
/**
* Checks if a connection supports retryable reads.
*/
function isRetryableReadsSupported(connection) {
return connection.MaxWireVersion >= RETRYABLE_READS_MIN_WIRE_VERSION);
}
/**
* Executes a read command in the context of a MongoClient where a retryable
* read have been enabled. The session parameter may be an implicit or
* explicit client session (depending on how the CRUD method was invoked).
*/
function executeRetryableRead(command, session) {
Exception previousError = null;
retrying = false;
Server previousServer = null;
while true {
if (previousError != null) {
retrying = true;
}
try {
if (previousServer == null) {
server = selectServer();
} else {
// If a previous attempt was made, deprioritize the previous server
// where the command failed.
deprioritizedServers = [ previousServer ];
server = selectServer(deprioritizedServers);
}
} catch (ServerSelectionException exception) {
if (previousError == null) {
// If this is the first attempt, propagate the exception.
throw exception;
}
// For retries, propagate the previous error.
throw previousError;
}
try {
connection = server.getConnection();
} catch (PoolClearedException poolClearedError) {
/* PoolClearedException indicates the operation did not even attempt to
* create a connection, let alone execute the operation. This means we
* are always safe to attempt a retry. We do not need to update SDAM,
* since whatever error caused the pool to be cleared will do so itself. */
if (previousError == null) {
previousError = poolClearedError;
}
/* CSOT is enabled and the operation has timed out. */
if (timeoutMS != null && isExpired(timeoutMS) {
throw previousError;
}
continue;
}
if ( !isRetryableReadsSupported(connection) || session.inTransaction()) {
/* If this is the first loop iteration and we determine that retryable
* reads are not supported, execute the command once and allow any
* errors to propagate */
if (previousError == null) {
return executeCommand(connection, command);
}
/* If the server selected for retrying is too old, throw the previous error.
* The caller can then infer that an attempt was made and failed. This case
* is very rare, and likely means that the cluster is in the midst of a
* downgrade. */
throw previousError;
}
/* NetworkException and NotWritablePrimaryException are both retryable errors. If
* caught, remember the exception, update SDAM accordingly, and proceed with
* retrying the operation.
*
* Exceptions that originate from the driver (e.g. no socket available
* from the connection pool) are treated as fatal. Any such exception
* that occurs on the previous attempt is propagated as-is. On retries,
* the error from the previous attempt is raised as it will be more
* relevant for the user. */
try {
return executeCommand(connection, retryableCommand);
} catch (NetworkException networkError) {
updateTopologyDescriptionForNetworkError(server, networkError);
previousError = networkError;
previousServer = server;
} catch (NotWritablePrimaryException notPrimaryError) {
updateTopologyDescriptionForNotWritablePrimaryError(server, notPrimaryError);
previousError = notPrimaryError;
previousServer = server;
} catch (DriverException error) {
if ( previousError != null ) {
throw previousError;
}
throw error;
}
if (timeoutMS == null) {
/* If CSOT is not enabled, allow any retryable error from the second
* attempt to propagate to our caller, as it will be just as relevant
* (if not more relevant) than the original error. */
if (retrying) {
throw previousError;
}
} else if (isExpired(timeoutMS)) {
/* CSOT is enabled and the operation has timed out. */
throw previousError;
}
}
}
Logging Retry Attempts
As with retryable writes, drivers MAY choose to log retry attempts for read operations. This specification does not define a format for such log messages.
Command Monitoring
As with retryable writes, in accordance with the
Command Logging and Monitoring specification,
drivers MUST guarantee that each CommandStartedEvent
has either a correlating CommandSucceededEvent
or
CommandFailedEvent
and that every "command started" log message has either a correlating "command succeeded" log
message or "command failed" log message. If the first attempt of a retryable read operation encounters a retryable
error, drivers MUST fire a CommandFailedEvent
and emit a "command failed" log message for the retryable error and fire
a separate CommandStartedEvent
and emit a separate "command started" log message when executing the subsequent retry
attempt. Note that the second CommandStartedEvent
and "command started" log message may have a different
connectionId
, since a server is reselected for a retry attempt.
Documentation
- Drivers MUST document all read operations that support retryable behavior.
- Drivers MUST document that the operations in Unsupported Read Operations do not support retryable behavior.
- Driver release notes MUST make it clear to users that they may need to adjust custom retry logic to prevent an application from inadvertently retrying for too long (see Backwards Compatibility for details).
- Drivers implementing retryability for their generic command runner for read commands MUST document that
mapReduce
will be retried if it is passed as a command to the command runner. These drivers also MUST document the potential for degraded performance given that "Early Failure on Socket Disconnect" feature does not supportmapReduce
.
Test Plan
See the README for tests.
At a high level, the test plan will cover executing supported read operations within a MongoClient where retryable reads have been enabled, ensuring that reads are retried.
Motivation for Change
Drivers currently have an API for the retryability of write operations but not for read operations. The driver API needs to be extended to include support for retryable behavior for read operations.
Design Rationale
The design of this specification is based off the Retryable Writes specification. It modifies the driver API as little as possible to introduce the concept retryable behavior for read operations.
Alternative retry strategies (e.g. exponential back-off, incremental intervals, regular intervals, immediate retry, randomization) were considered, but the behavior of a single, immediate retry attempt was chosen in the interests of simplicity as well as consistency with the design for retryable writes.
See the future work section for potential upcoming changes to retry mechanics.
Backwards Compatibility
The API changes to support retryable reads extend the existing API but do not introduce any backward breaking changes. Existing programs that do not make use of retryable reads will continue to compile and run correctly.
N.B.: Applications with custom retry logic that choose to enable retryable reads may need to redo their custom retry logic to ensure that the reads are retried as desired. e.g. if an application has custom logic that retries reads n times and enables retryable reads, then the application could end up retrying reads up to 2n times.
The note above will also apply if an application upgrades to a version of the driver where that defaults to enabling retryable reads.
Rejected Designs
- To improve performance on servers without "Early Failure on Socket Disconnect", we considered using
killSessions
to automatically kill the previous attempt before running a retry. We decided against this because after killing the session, parts of it still may be running if there are any errors. Additionally, killing sessions takes time because a kill has to talk to every non-configmongod
in the cluster (i.e. all the primaries and secondaries of each shard). In addition, in order to protect the system against getting overloaded with these requests, every server allows no more than one killsession operation at a time. Operations that attempt tokillsessions
while a killsession is running are batched together and run simultaneously after the current one finishes.
Reference Implementation
The C# and Python drivers will provide the reference implementations. See CSHARP-2429 and PYTHON-1674.
Security Implications
None.
Future work
- A later specification may allow operations (including read) to be retried any number of times during a singular timeout period.
- Any future changes to the the applicable parts of retryable writes specification may also need to be reflected in the retryable reads specification, and vice versa.
- We may revisit the decision not retry
Cursor.getMore()
(see Q&A). - Once DRIVERS-560 is resolved, tests will be added to allow testing Retryable Reads on MongoDB 3.6. See the test plan for additional information.
Q&A
Why is retrying Cursor.getMore()
not supported?
Cursor.getMore()
cannot be retried because of the inability for the client to discern if the cursor was advanced. In
other words, since the driver does not know if the original getMore()
succeeded or not, the driver cannot reliably
know if results might be inadvertently skipped.
For example, if a transient network error occurs as a driver requests the second batch of results via a getMore() and
the driver were to silently retry the getMore()
, it is possible that the server had actually received the initial
getMore()
. In such a case, the server will advance the cursor once more and return the third batch instead of the
desired second batch.
Furthermore, even if the driver could detect such a scenario, it is impossible to return previously iterated data from a cursor because the server currently only allows forward iteration.
It is worth noting that the "Cursors survive primary stepdown" feature avoids this issue in certain common
circumstances, so that we may revisit this decision to disallow trying getMore()
in the future.
Why are read operations only retried once by default?
Read operations are only retried once for the same reasons that writes are also only retried once. For convenience's sake, that reasoning has been adapted for reads and reproduced below:
The spec concerns itself with retrying read operations that encounter a retryable error (i.e. no response due to network error or a response indicating that the node is no longer a primary). A retryable error may be classified as either a transient error (e.g. dropped connection, replica set failover) or persistent outage. If a transient error results in the server being marked as "unknown", a subsequent retry attempt will allow the driver to rediscover the primary within the designated server selection timeout period (30 seconds by default). If server selection times out during this retry attempt, we can reasonably assume that there is a persistent outage. In the case of a persistent outage, multiple retry attempts are fruitless and would waste time. See How To Write Resilient MongoDB Applications for additional discussion on this strategy.
However when Client Side Operations Timeout is enabled, the driver will retry multiple times until the operation succeeds, a non-retryable error is encountered, or the timeout expires. Retrying multiple times provides greater resilience to cascading failures such as rolling server restarts during planned maintenance events.
Can drivers resend the same wire protocol message on retry attempts?
No. This is in contrast to the answer supplied in in the retryable writes specification. However, when retryable writes were implemented, no driver actually chose to resend the same wire protocol message. Today, if a driver attempted to resend the same wire protocol message, this could violate the rules for gossiping $clusterTime: specifically the rule that a driver must send the highest seen $clusterTime.
Additionally, there would be a behavioral difference between a driver resending the same wire protocol message and one that does not. For example, a driver that creates a new wire protocol message could exhibit the following characteristics:
- The second attempt to send the read command could have a higher
$clusterTime
. - If the initial attempt failed with a server error, then the session's
operationTime
would be advanced and the next read would include a largerreadConcern.afterClusterTime
.
A driver that resends the same wire protocol message would not exhibit the above characteristics. Thus, in order to
avoid this behavioral difference and not violate the rules about gossiping $clusterTime
, drivers MUST not resend the
same wire protocol message.
Why isn't MongoDB 4.2 required?
MongoDB 4.2 was initially considered as a requirement for retryable reads because MongoDB 4.2 implements support for
"Early Failure on Socket Disconnect," changing the the semantics of socket disconnect to prevent ops from doing work
that no client is interested in. This prevents applications from seeing degraded performance when an expensive read is
retried. Upon further discussion, we decided that "Early Failure on Socket Disconnect" should not be required to retry
reads because the resilience benefit of retryable reads outweighs the minor risk of degraded performance. Additionally,
any customers experiencing degraded performance can simply disable retryableReads
.
Changelog
-
2024-04-30: Migrated from reStructuredText to Markdown.
-
2023-12-05: Add that any server information associated with retryable exceptions MUST reflect the originating server, even in the presence of retries.
-
2023-11-30: Add ReadConcernMajorityNotAvailableYet to the list of error codes that should be retried.
-
2023-11-28: Add ExceededTimeLimit to the list of error codes that should be retried.
-
2023-08-26: Require that in a sharded cluster the server on which the operation failed MUST be provided to the server selection mechanism as a deprioritized server.
-
2023-08-21: Update Q&A that contradicts SDAM transient error logic
-
2022-11-09: CLAM must apply both events and log messages.
-
2022-10-18: When CSOT is enabled multiple retry attempts may occur.
-
2022-10-05: Remove spec front matter, move footnote, and reformat changelog.
-
2022-01-25: Note that drivers should retry handshake network failures.
-
2021-04-26: Replaced deprecated terminology; removed requirement to parse error message text as MongoDB 3.6+ servers will always return an error code
-
2021-03-23: Require that PoolClearedErrors are retried
-
2019-06-07: Mention $merge stage for aggregate alongside $out
-
2019-05-29: Renamed InterruptedDueToStepDown to InterruptedDueToReplStateChange
The first and second commands will be identical unless variations in parameters exist between wire/server versions.
Retryable Writes
- Status: Accepted
- Minimum Server Version: 3.6
Abstract
MongoDB 3.6 will implement support for server sessions, which are shared resources within a cluster identified by a session ID. Drivers compatible with MongoDB 3.6 will also implement support for client sessions, which are always associated with a server session and will allow for certain commands to be executed within the context of a server session.
Additionally, MongoDB 3.6 will utilize server sessions to allow some write commands to specify a transaction ID to enforce at-most-once semantics for the write operation(s) and allow for retrying the operation if the driver fails to obtain a write result (e.g. network error or "not writable primary" error after a replica set failover). This specification will outline how an API for retryable write operations will be implemented in drivers. The specification will define an option to enable retryable writes for an application and describe how a transaction ID will be provided to write commands executed therein.
META
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Specification
Terms
Transaction ID
The transaction ID identifies the transaction as part of which the command is running. In a write command where the
client has requested retryable behavior, it is expressed by the top-level lsid
and txnNumber
fields. The lsid
component is the corresponding server session ID. which is a BSON value defined in the
Driver Session specification. The txnNumber
component is a monotonically increasing
(per server session), positive 64-bit integer.
ClientSession
Driver object representing a client session, which is defined in the Driver Session specification. This object is always associated with a server session; however, drivers will pool server sessions so that creating a ClientSession will not always entail creation of a new server session. The name of this object MAY vary across drivers.
Retryable Error
An error is considered retryable if it has a RetryableWriteError label in its top-level "errorLabels" field. See Determining Retryable Errors for more information.
Additional terms may be defined in the Driver Session specification.
Naming Deviations
This specification defines the name for a new MongoClient option, retryWrites
. Drivers MUST use the defined name for
the connection string parameter to ensure portability of connection strings across applications and drivers.
If drivers solicit MongoClient options through another mechanism (e.g. options dictionary provided to the MongoClient
constructor), drivers SHOULD use the defined name but MAY deviate to comply with their existing conventions. For
example, a driver may use retry_writes
instead of retryWrites
.
For any other names in the spec, drivers SHOULD use the defined name but MAY deviate to comply with their existing conventions.
MongoClient Configuration
This specification introduces the following client-level configuration option.
retryWrites
This boolean option determines whether retryable behavior will be applied to all supported write operations executed within the MongoClient. This option MUST default to true.
This option MUST NOT be configurable at the level of a database object, collection object, or at the level of an individual write operation.
Requirements for Retryable Writes
Supported Server Versions
Like sessions, retryable writes require a MongoDB 3.6 replica set or shard cluster operating with feature compatibility
version 3.6 (i.e. the {setFeatureCompatibilityVersion: 3.6}
administrative command has been run on the cluster).
Drivers MUST verify server eligibility by ensuring that maxWireVersion
is at least six, the
logicalSessionTimeoutMinutes
field is present in the server's hello
or legacy hello response, and the server type is
not standalone.
Retryable writes are only supported by storage engines that support document-level locking. Notably, that excludes the
MMAPv1 storage engine which is available in both MongoDB 3.6 and 4.0. Since retryWrites
defaults to true
, Drivers
MUST raise an actionable error message when the server returns code 20 with errmsg starting with "Transaction numbers".
The replacement error message MUST be:
This MongoDB deployment does not support retryable writes. Please add
retryWrites=false to your connection string.
If the server selected for the first attempt of a retryable write operation does not support retryable writes, drivers MUST execute the write as if retryable writes were not enabled. Drivers MUST NOT include a transaction ID in the write command and MUST not retry the command under any circumstances.
In a sharded cluster, it is possible that mongos may appear to support retryable writes but one or more shards in the
cluster do not (e.g. replica set shard is configured with feature compatibility version 3.4, a standalone is added as a
new shard). In these rare cases, a write command that fans out to a shard that does not support retryable writes may
partially fail and an error may be reported in the write result from mongos (e.g. writeErrors
array in the bulk write
result). This does not constitute a retryable error. Drivers MUST relay such errors to the user.
Supported Write Operations
MongoDB 3.6 will support retryability for some, but not all, write operations.
Supported single-statement write operations include insertOne()
, updateOne()
, replaceOne()
, deleteOne()
,
findOneAndDelete()
, findOneAndReplace()
, and findOneAndUpdate()
.
Supported multi-statement write operations include insertMany()
and bulkWrite()
. The ordered option may be true
or
false
. For both the collection-level and client-level bulkWrite()
methods, a bulk write batch is only retryable if
it does not contain any multi: true
writes (i.e. UpdateMany
and DeleteMany
). Drivers MUST evaluate eligibility for
each write command sent as part of the bulkWrite()
(after order and batch splitting) individually. Drivers MUST NOT
alter existing logic for order and batch splitting in an attempt to maximize retryability for operations within a bulk
write.
These methods above are defined in the CRUD specification.
Later versions of MongoDB may add support for additional write operations.
Drivers MUST document operations that support retryable behavior and the conditions for which retryability is determined (see: How will users know which operations are supported?). Drivers are not required to exhaustively document all operations that do not support retryable behavior.
Unsupported Write Operations
Write commands specifying an unacknowledged write concern (e.g. {w: 0})
) do not support retryable behavior. Drivers
MUST NOT add a transaction ID to any write command with an unacknowledged write concern executed within a MongoClient
where retryable writes have been enabled. Drivers MUST NOT retry these commands.
Write commands where a single statement might affect multiple documents will not be initially supported by MongoDB 3.6,
although this may change in the future. This includes an
update command where any statement in the updates
sequence specifies a multi
option of true
or a
delete command where any statement in the deletes
sequence specifies a limit
option of 0
. In the context of the CRUD specification, this includes
the updateMany()
and deleteMany()
methods and, in some cases, bulkWrite()
. Drivers MUST NOT add a transaction ID
to any single- or multi-statement write commands that include one or more multi-document write operations. Drivers MUST
NOT retry these commands if they fail to return a response. With regard to bulkWrite()
, drivers MUST evaluate
eligibility for each write command sent as part of the bulkWrite()
(after order and batch splitting) individually.
Write commands other than insert,
update,
delete, or
findAndModify will not be initially supported by
MongoDB 3.6, although this may change in the future. This includes, but is not limited to, an
aggregate command using a write stage (e.g. $out
,
$merge
). Drivers MUST NOT add a transaction ID to these commands and MUST NOT retry these commands if they fail to
return a response.
Retryable Writes Within Transactions
In MongoDB 4.0 the only supported retryable write commands within a transaction are commitTransaction
and
abortTransaction
. Therefore drivers MUST NOT retry write commands within transactions even when retryWrites
has been
set to true on the MongoClient
. In addition, drivers MUST NOT add the RetryableWriteError
label to any error that
occurs during a write command within a transaction (excepting commitTransation
and abortTransaction
), even when
retryWrites
has been set to true on the MongoClient
.
Implementing Retryable Writes
Determining Retryable Errors
When connected to a MongoDB instance that supports retryable writes (versions 3.6+), the driver MUST treat all errors with the RetryableWriteError label as retryable. This error label can be found in the top-level "errorLabels" field of the error.
RetryableWriteError Labels
The RetryableWriteError label might be added to an error in a variety of ways:
-
When the driver encounters a network error establishing an initial connection to a server, it MUST add a RetryableWriteError label to that error if the MongoClient performing the operation has the retryWrites configuration option set to true.
-
When the driver encounters a network error communicating with any server version that supports retryable writes, it MUST add a RetryableWriteError label to that error if the MongoClient performing the operation has the retryWrites configuration option set to true.
-
When a CMAP-compliant driver encounters a PoolClearedError during connection check out, it MUST add a RetryableWriteError label to that error if the MongoClient performing the operation has the retryWrites configuration option set to true.
-
For server versions 4.4 and newer, the server will add a RetryableWriteError label to errors or server responses that it considers retryable before returning them to the driver. As new server versions are released, the errors that are labeled with the RetryableWriteError label may change. Drivers MUST NOT add a RetryableWriteError label to any error derived from a 4.4+ server response (i.e. any error that is not a network error).
-
When receiving a command result with an error from a pre-4.4 server that supports retryable writes, the driver MUST add a RetryableWriteError label to errors that meet the following criteria if the retryWrites option is set to true on the client performing the relevant operation:
-
a mongod or mongos response with any the following error codes in the top-level
code
field:Error Name Error Code InterruptedAtShutdown 11600 InterruptedDueToReplStateChange 11602 NotWritablePrimary 10107 NotPrimaryNoSecondaryOk 13435 NotPrimaryOrSecondary 13436 PrimarySteppedDown 189 ShutdownInProgress 91 HostNotFound 7 HostUnreachable 6 NetworkTimeout 89 SocketException 9001 ExceededTimeLimit 262 -
a mongod response with any of the previously listed codes in the
writeConcernError.code
field.
Drivers MUST NOT add a RetryableWriteError label based on the following:
- any
writeErrors[].code
fields in a mongod or mongos response - the
writeConcernError.code
field in a mongos response
The criteria for retryable errors is similar to the discussion in the SDAM spec's section on Error Handling, but includes additional error codes. See What do the additional error codes mean? for the reasoning behind these additional errors.
-
To understand why the driver should only add the RetryableWriteError label to an error when the retryWrites option is true on the MongoClient performing the operation, see Why does the driver only add the RetryableWriteError label to errors that occur on a MongoClient with retryWrites set to true?
Note: During a retryable write operation on a sharded cluster, mongos may retry the operation internally, in which case it will not add a RetryableWriteError label to any error that occurs after those internal retries to prevent excessive retrying.
For more information about error labels, see the Transactions specification.
Generating Transaction IDs
The server requires each retryable write operation to provide a unique transaction ID in its command document. The transaction ID consists of a server session ID and a monotonically increasing transaction number. The session ID is obtained from the ClientSession object, which will have either been passed to the write operation from the application or constructed internally for the operation. Drivers will be responsible for maintaining a monotonically increasing transaction number for each server session used by a ClientSession object. Drivers that pool server sessions MUST preserve the transaction number when reusing a server session from the pool with a new ClientSession (this can be tracked as another property on the driver's object for the server session).
Drivers MUST ensure that each retryable write command specifies a transaction number larger than any previously used transaction number for its session ID.
Since ClientSession objects are not thread safe and may only be used by one thread at a time, drivers should not need to worry about race conditions when incrementing the transaction number.
Behavioral Changes for Write Commands
Drivers MUST automatically add a transaction ID to all supported write commands executed via a specific
CRUD method (e.g. updateOne()
) or write command method (e.g. executeWriteCommand()
) within a
MongoClient where retryable writes have been enabled and when the selected server supports retryable writes.
If your driver offers a generic command method on your database object (e.g. runCommand()
), it MUST NOT check the
user's command document to determine if it is a supported write operation and MUST NOT automatically add a transaction
ID. The method should send the user's command document to the server as-is.
This specification does not affect write commands executed within a MongoClient where retryable writes have not been enabled.
Constructing Write Commands
When constructing a supported write command that will be executed within a MongoClient where retryable writes have been
enabled, drivers MUST increment the transaction number for the corresponding server session and include the server
session ID and transaction number in top-level lsid
and txnNumber
fields, respectively. lsid
is a BSON value
(discussed in the Driver Session specification). txnNumber
MUST be a positive 64-bit
integer (BSON type 0x12).
The following example illustrates a possible write command for an updateOne()
operation:
{
update: "coll",
lsid: { ... },
txnNumber: 100,
updates: [
{ q: { x: 1 }, u: { $inc: { y: 1 } } },
],
ordered: true
}
When constructing multiple write commands for a multi-statement write operation (i.e. insertMany()
and bulkWrite()
),
drivers MUST increment the transaction number for each supported write command in the batch.
Executing Retryable Write Commands
When selecting a writable server for the first attempt of a retryable write command, drivers MUST allow a server selection error to propagate. In this case, the caller is able to infer that no attempt was made.
If retryable writes is not enabled or the selected server does not support retryable writes, drivers MUST NOT include a transaction ID in the command and MUST attempt to execute the write command exactly once and allow any errors to propagate. In this case, the caller is able to infer that an attempt was made.
If retryable writes are enabled and the selected server supports retryable writes, drivers MUST add a transaction ID to the command. Drivers MUST only attempt to retry a write command if the first attempt yields a retryable error. Drivers MUST NOT attempt to retry a write command on any other error.
If the first attempt of a write command including a transaction ID encounters a retryable error, the driver MUST update its topology according to the SDAM spec (see: Error Handling) and capture this original retryable error.
Drivers MUST then retry the operation as many times as necessary until any one of the following conditions is reached:
-
the operation succeeds.
-
the operation fails with a non-retryable error.
-
CSOT is enabled and the operation times out per Client Side Operations Timeout: Retryability.
-
CSOT is not enabled and one retry was attempted.
For each retry attempt, drivers MUST select a writable server. In a sharded cluster, the server on which the operation failed MUST be provided to the server selection mechanism as a deprioritized server.
If the driver cannot select a server for a retry attempt or the selected server does not support retryable writes, retrying is not possible and drivers MUST raise the retryable error from the previous attempt. In both cases, the caller is able to infer that an attempt was made.
If a retry attempt also fails, drivers MUST update their topology according to the SDAM spec (see: Error Handling). If an error would not allow the caller to infer that an attempt was made (e.g. connection pool exception originating from the driver) or the error is labeled "NoWritesPerformed", the error from the previous attempt should be raised. If all server errors are labeled "NoWritesPerformed", then the first error should be raised.
If a driver associates server information (e.g. the server address or description) with an error, the driver MUST ensure that the reported server information corresponds to the server that originated the error.
The above rules are implemented in the following pseudo-code:
/**
* Checks if a server supports retryable writes.
*/
function isRetryableWritesSupported(server) {
if (server.getMaxWireVersion() < RETRYABLE_WIRE_VERSION) {
return false;
}
if ( ! server.hasLogicalSessionTimeoutMinutes()) {
return false;
}
if (server.isStandalone()) {
return false;
}
return true;
}
/**
* Executes a write command in the context of a MongoClient where retryable
* writes have been enabled. The session parameter may be an implicit or
* explicit client session (depending on how the CRUD method was invoked).
*/
function executeRetryableWrite(command, session) {
/* Allow ServerSelectionException to propagate to our caller, which can then
* assume that no attempts were made. */
server = selectServer("writable");
/* If the server does not support retryable writes, execute the write as if
* retryable writes are not enabled. */
if ( ! isRetryableWritesSupported(server)) {
return executeCommand(server, command);
}
/* Incorporate lsid and txnNumber fields into the command document. These
* values will be derived from the implicit or explicit session object. */
retryableCommand = addTransactionIdToCommand(command, session);
Exception previousError = null;
retrying = false;
while true {
try {
return executeCommand(server, retryableCommand);
} catch (Exception currentError) {
handleError(currentError);
/* If the error has a RetryableWriteError label, remember the exception
* and proceed with retrying the operation.
*
* IllegalOperation (code 20) with errmsg starting with "Transaction
* numbers" MUST be re-raised with an actionable error message.
*/
if (!currentError.hasErrorLabel("RetryableWriteError")) {
if ( currentError.code == 20 && previousError.errmsg.startsWith("Transaction numbers") ) {
currentError.errmsg = "This MongoDB deployment does not support retryable...";
}
throw currentError;
}
/*
* If the "previousError" is "null", then the "currentError" is the
* first error encountered during the retry attempt cycle. We must
* persist the first error in the case where all succeeding errors are
* labeled "NoWritesPerformed", which would otherwise raise "null" as
* the error.
*/
if (previousError == null) {
previousError = currentError;
}
/*
* For exceptions that originate from the driver (e.g. no socket available
* from the connection pool), we should raise the previous error if there
* was one.
*/
if (currentError is not DriverException && ! previousError.hasErrorLabel("NoWritesPerformed")) {
previousError = currentError;
}
}
/*
* We try to select server that is not the one that failed by passing the
* failed server as a deprioritized server.
* If we cannot select a writable server, do not proceed with retrying and
* throw the previous error. The caller can then infer that an attempt was
* made and failed. */
try {
deprioritizedServers = [ server ];
server = selectServer("writable", deprioritizedServers);
} catch (Exception ignoredError) {
throw previousError;
}
/* If the server selected for retrying is too old, throw the previous error.
* The caller can then infer that an attempt was made and failed. This case
* is very rare, and likely means that the cluster is in the midst of a
* downgrade. */
if ( ! isRetryableWritesSupported(server)) {
throw previousError;
}
if (timeoutMS == null) {
/* If CSOT is not enabled, allow any retryable error from the second
* attempt to propagate to our caller, as it will be just as relevant
* (if not more relevant) than the original error. */
if (retrying) {
throw previousError;
}
} else if (isExpired(timeoutMS)) {
/* CSOT is enabled and the operation has timed out. */
throw previousError;
}
retrying = true;
}
}
handleError
in the above pseudocode refers to the function defined in the
Error handling pseudocode
section of the SDAM specification.
When retrying a write command, drivers MUST resend the command with the same transaction ID. Drivers MUST NOT resend the original wire protocol message if doing so would violate rules for gossipping the cluster time (see: Can drivers resend the same wire protocol message on retry attempts?).
In the case of a multi-statement write operation split across multiple write commands, a failed retry attempt will also interrupt execution of any additional write operations in the batch (regardless of the ordered option). This is no different than if a retryable error had been encountered without retryable behavior enabled or supported by the driver. Drivers are encouraged to provide access to an intermediary write result (e.g. BulkWriteResult, InsertManyResult) through the BulkWriteException, in accordance with the CRUD specification.
Logging Retry Attempts
Drivers MAY choose to log retry attempts for write operations. This specification does not define a format for such log messages.
Command Monitoring
In accordance with the
Command Logging and Monitoring specification,
drivers MUST guarantee that each CommandStartedEvent
has either a correlating CommandSucceededEvent
or
CommandFailedEvent
and that every "command started" log message has either a correlating "command succeeded" log
message or "command failed" log message. If the first attempt of a retryable write operation encounters a retryable
error, drivers MUST fire a CommandFailedEvent
and emit a "command failed" log message for the retryable error and fire
a separate CommandStartedEvent
and "command succeeded" log message when executing the subsequent retry attempt. Note
that the second CommandStartedEvent
and "command succeeded" log message may have a different connectionId
, since a
writable server is reselected for the retry attempt.
Each attempt of a retryable write operation SHOULD report a different requestId
so that events for each attempt can be
properly correlated with one another.
The Command Logging and Monitoring specification
states that the operationId
field is a driver-generated, 64-bit integer and may be "used to link events together such
as bulk write operations." Each attempt of a retryable write operation SHOULD report the same operationId
; however,
drivers SHOULD NOT use the operationId
field to relay information about a transaction ID. A bulk write operation may
consist of multiple write commands, each of which may specify a unique transaction ID.
Test Plan
See the README for tests.
At a high level, the test plan will cover the following scenarios for executing supported write operations within a MongoClient where retryable writes have been enabled:
- Executing the same write operation (and transaction ID) multiple times should yield an identical write result.
- Test at-most-once behavior by observing that subsequent executions of the same write operation do not incur further modifications to the collection data.
- Exercise supported single-statement write operations (i.e. deleteOne, insertOne, replaceOne, updateOne, and findAndModify).
- Exercise supported multi-statement insertMany and bulkWrite operations, which contain only supported single-statement write operations. Both ordered and unordered execution should be tested.
Additional prose tests for other scenarios are also included.
Motivation for Change
Drivers currently have no API for specifying at-most-once semantics and retryable behavior for write operations. The driver API needs to be extended to support this behavior.
Design Rationale
The design of this specification piggy-backs that of the Driver Session specification in that it modifies the driver API as little as possible to introduce the concept of at-most-once semantics and retryable behavior for write operations. A transaction ID will be included in all supported write commands executed within the scope of a MongoClient where retryable writes have been enabled.
Drivers expect the server to yield an error if a transaction ID is included in an unsupported write command. This requires drivers to maintain an allow list and track which write operations support retryable behavior for a given server version (see: Why must drivers maintain an allow list of supported operations?).
While this approach will allow applications to take advantage of retryable write behavior with minimal code changes, it also presents a documentation challenge. Users must understand exactly what can and will be retried (see: How will users know which operations are supported?).
Backwards Compatibility
The API changes to support retryable writes extend the existing API but do not introduce any backward breaking changes. Existing programs that do not make use of retryable writes will continue to compile and run correctly.
Reference Implementation
The C# and C drivers will provide reference implementations. JIRA links will be added here at a later point.
Future Work
Supporting at-most-once semantics and retryable behavior for updateMany and deleteMany operations may become possible once the server implements support for multi-document transactions.
A separate specification for retryable read operations could complement this specification. Retrying read operations would not require client or server sessions and could be implemented independently of retryable writes.
Q & A
What do the additional error codes mean?
The errors HostNotFound
, HostUnreachable
, NetworkTimeout
, SocketException
may be returned from mongos during
problems routing to a shard. These may be transient, or localized to that mongos.
Why are write operations only retried once by default?
The spec concerns itself with retrying write operations that encounter a retryable error (i.e. no response due to network error or a response indicating that the node is no longer a primary). A retryable error may be classified as either a transient error (e.g. dropped connection, replica set failover) or persistent outage. In the case of a transient error, the driver will mark the server as "unknown" per the SDAM spec. A subsequent retry attempt will allow the driver to rediscover the primary within the designated server selection timeout period (30 seconds by default). If server selection times out during this retry attempt, we can reasonably assume that there is a persistent outage. In the case of a persistent outage, multiple retry attempts are fruitless and would waste time. See How To Write Resilient MongoDB Applications for additional discussion on this strategy.
However when Client Side Operations Timeout is enabled, the driver will retry multiple times until the