MongoDB Driver Specifications

The modern MongoDB driver consists of a number of components, each of which are thoroughly documented in this repository. Though this information is readily available and extremely helpful, what it lacks is a high level overview to tie the specs together into a cohesive picture of what a MongoDB driver is.

Architecturally an implicit hierarchy exists within the drivers, so expressing drivers in terms of an onion model feels appropriate.

Layers of the Onion

The "drivers onion" is meant to represent how various concepts, components and APIs can be layered atop each other to build a MongoDB driver from the ground up, or to help understand how existing drivers have been structured. Hopefully this representation of MongoDB’s drivers helps provide some clarity, as the complexity of these libraries - like the onion above - could otherwise bring you to tears.

Serialization

At their lowest level all MongoDB drivers will need to know how to work with BSON. BSON (short for "Binary JSON") is a bin­ary-en­coded serialization of JSON-like documents, and like JSON, it sup­ports the nesting of arrays and documents. BSON also contains extensions that al­low representation of data types that are not part of the JSON spec.

Specifications: BSON, ObjectId, Decimal128, UUID, DBRef, Extended JSON

Communication

Once BSON documents can be created and manipulated, the foundation for interacting with a MongoDB host process has been laid. Drivers communicate by sending database commands as serialized BSON documents using MongoDB’s wire protocol.

From the provided connection string and options a socket connection is established to a host, which an initial handshake verifies is in fact a valid MongoDB connection by sending a simple hello. Based on the response to this first command a driver can continue to establish and authenticate connections.

Specifications: OP_MSG, Command Execution, Connection String, URI Options, OCSP, Initial Handshake, Wire Compression, SOCKS5, Initial DNS Seedlist Discovery

Connectivity

Now that a valid host has been found, the cluster’s topology can be discovered and monitoring connections can be established. Connection pools can then be created and populated with connections. The monitoring connections will subsequently be used for ensuring operations are routed to available hosts, or hosts that meet certain criteria (such as a configured read preference or acceptable latency window).

Specifications: SDAM, CMAP, Load Balancer Support

Authentication

Establishing and monitoring connections to MongoDB ensures they’re available, but MongoDB server processes typically will require the connection to be authenticated before commands will be accepted. MongoDB offers many authentication mechanisms such as SCRAM, x.509, Kerberos, LDAP, OpenID Connect and AWS IAM, which MongoDB drivers support using the Simple Authentication and Security Layer (SASL) framework.

Specifications: Authentication

Availability

All client operations will be serialized as BSON and sent to MongoDB over a connection that will first be checked out of a connection pool. Various monitoring processes exist to ensure a driver’s internal state machine contains an accurate view of the cluster’s topology so that read and write requests can always be appropriately routed according to MongoDB’s server selection algorithm.

Specifications: Server Monitoring, SRV Polling for mongos Discovery, Server Selection, Max Staleness

Resilience

At their core, database drivers are client libraries meant to facilitate interactions between an application and the database. MongoDB’s drivers are no different in that regard, as they abstract away the underlying serialization, communication, connectivity, and availability functions required to programmatically interact with your data.

To further enhance the developer experience while working with MongoDB, various resilience features can be added based on logical sessions such as retryable writes, causal consistency, and transactions.

Specifications: Retryability (Reads, Writes), CSOT, Consistency (Sessions, Causal Consistency, Snapshot Reads, Transactions, Convenient Transactions API)

Programmability

Now that we can serialize commands and send them over the wire through an authenticated connection we can begin actually manipulating data. Since all database interactions are in the form of commands, if we wanted to remove a single document we might issue a delete command such as the following:

db.runCommand(
  {
     delete: "orders",
     deletes: [ { q: { status: "D" }, limit: 0 } ]
  }
)

Though not exceedingly complex, a better developer experience can be achieved through more single-purpose APIs. This would allow the above example to be expressed as:

db.orders.deleteMany({ status: "D" })

To provide a cleaner and clearer developer experience, many specifications exist to describe how these APIs should be consistently presented across driver implementations, while still providing the flexibility to make APIs more idiomatic for each language.

Advanced security features such as client-side field level encryption are also defined at this layer.

Specifications: Resource Management (Databases, Collections, Indexes), Data Management (CRUD, Collation, Write Commands, Bulk API, Bulk Write, R/W Concern), Cursors (Change Streams, find/getMore/killCursors), GridFS, Stable API, Security (Client Side Encryption, BSON Binary Subtype 6)

Observability

With database commands being serialized and sent to MongoDB servers and responses being received and deserialized, our driver can be considered fully functional for most read and write operations. As MongoDB drivers abstract away most of the complexity involved with creating and maintaining the connections these commands will be sent over, providing mechanisms for introspection into a driver’s functionality can provide developers with added confidence that things are working as expected.

The inner workings of connection pools, connection lifecycle, server monitoring, topology changes, command execution and other driver components are exposed by means of events developers can register listeners to capture. This can be an invaluable troubleshooting tool and can help facilitate monitoring the health of an application.

const { MongoClient, BSON: { EJSON } } = require('mongodb');

function debugPrint(label, event) {
 console.log(`${label}: ${EJSON.stringify(event)}`);
}

async function main() {
 const client = new MongoClient("mongodb://localhost:27017", { monitorCommands: true });
 client.on('commandStarted', (event) => debugPrint('commandStarted', event));
 client.on('connectionCheckedOut', (event) => debugPrint('connectionCheckedOut', event));
 await client.connect();
 const coll = client.db("test").collection("foo");
 const result = await coll.findOne();
 client.close();
}
main();

Given the example above (using the Node.js driver) the specified connection events and command events would be logged as they’re emitted by the driver:

connectionCheckedOut: {"time":{"$date":"2024-05-17T15:18:18.589Z"},"address":"localhost:27018","name":"connectionCheckedOut","connectionId":1}
commandStarted: {"name":"commandStarted","address":"127.0.0.1:27018","connectionId":1,"serviceId":null,"requestId":5,"databaseName":"test","commandName":"find","command":{"find":"foo","filter":{},"limit":1,"singleBatch":true,"batchSize":1,"lsid":{"id":{"$binary":{"base64":"4B1kOPCGRUe/641MKhGT4Q==","subType":"04"}}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1715959097,"i":1}},"signature":{"hash":{"$binary":"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"00"}},"keyId":0}},"$db":"test"},"serverConnectionId":140}

The preferred method of observing internal behavior would be through standardized logging once it is available in all drivers (DRIVERS-1204), however until that time only event logging is consistently available. In the future additional observability tooling such as Open Telemetry support may also be introduced.

Specifications: Command Logging and Monitoring, SDAM Logging and Monitoring, Standardized Logging, Connection Pool Logging

Testability

Ensuring existing as well as net-new drivers can be effectively tested for correctness and performance, most specifications define a standard set of tests using YAML tests to improve driver conformance. This allows specification authors and maintainers to describe functionality once with the confidence that the tests can be executed alike by language-specific test runners across all drivers.

Though the unified test format greatly simplifies language-specific implementations, not all tests can be represented in this fashion. In those cases the specifications may describe tests to be manually implemented as prose. By limiting the number of prose tests that each driver must implement, engineers can deliver functionality with greater confidence while also minimizing the burden of upstream verification.

Specifications: Unified Test Format, Atlas Data Federation Testing, Performance Benchmarking, BSON Corpus, Replication Event Resilience, FAAS Automated Testing, Atlas Serverless Testing

Conclusion

Most (if not all) the information required to build a new driver or maintain existing drivers technically exists within the specifications, however without a mental mode of their composition and architecture it can be extremely challenging to know where to look.

Peeling the "drivers onion" should hopefully make reasoning about them a little easier, especially with the understanding that everything can be tested to validate individual implementations are "up to spec".

Driver Mantras

When developing specifications -- and the drivers themselves -- we follow the following principles:

Strive to be idiomatic, but favor consistency

Drivers attempt to provide the easiest way to work with MongoDB in a given language ecosystem, while specifications attempt to provide a consistent behavior and experience across all languages. Drivers should strive to be as idiomatic as possible while meeting the specification and staying true to the original intent.

No Knobs

Too many choices stress out users. Whenever possible, we aim to minimize the number of configuration options exposed to users. In particular, if a typical user would have no idea how to choose a correct value, we pick a good default instead of adding a knob.

Topology agnostic

Users test and deploy against different topologies or might scale up from replica sets to sharded clusters. Applications should never need to use the driver differently based on topology type.

Where possible, depend on server to return errors

The features available to users depend on a server's version, topology, storage engine and configuration. So that drivers don't need to code and test all possible variations, and to maximize forward compatibility, always let users attempt operations and let the server error when it can't comply. Exceptions should be rare: for cases where the server might not error and correctness is at stake.

Minimize administrative helpers

Administrative helpers are methods for admin tasks, like user creation. These are rarely used and have maintenance costs as the server changes the administrative API. Don't create administrative helpers; let users rely on "RunCommand" for administrative commands.

Check wire version, not server version

When determining server capabilities within the driver, rely only on the maxWireVersion in the hello response, not on the X.Y.Z server version. An exception is testing server development releases, as the server bumps wire version early and then continues to add features until the GA.

When in doubt, use "MUST" not "SHOULD" in specs

Specs guide our work. While there are occasionally valid technical reasons for drivers to differ in their behavior, avoid encouraging it with a wishy-washy "SHOULD" instead of a more assertive "MUST".

Defy augury

While we have some idea of what the server will do in the future, don't design features with those expectations in mind. Design and implement based on what is expected in the next release.

Case Study: In designing OP_MSG, we held off on designing support for Document Sequences in Replies in drivers until the server would support it. We subsequently decided not to implement that feature in the server.

The best way to see what the server does is to test it

For any unusual case, relying on documentation or anecdote to anticipate the server's behavior in different versions/topologies/etc. is error-prone. The best way to check the server's behavior is to use a driver or the shell and test it directly.

Drivers follow semantic versioning

Drivers should follow X.Y.Z versioning, where breaking API changes require a bump to X. See semver.org for more.

Backward breaking behavior changes and semver

Backward breaking behavior changes can be more dangerous and disruptive than backward breaking API changes. When thinking about the implications of a behavior change, ask yourself what could happen if a user upgraded your library without carefully reading the changelog and/or adequately testing the change.

BSON

Latest version of the specification can be found at https://bsonspec.org/spec.html.


Specification Version 1.1

BSON is a binary format in which zero or more ordered key/value pairs are stored as a single entity. We call this entity a document.

The following grammar specifies version 1.1 of the BSON standard. We've written the grammar using a pseudo-BNF syntax. Valid BSON data is represented by the document non-terminal.

Basic Types

The following basic types are used as terminals in the rest of the grammar. Each type must be serialized in little-endian format.

byte	1 byte (8-bits)
signed_byte(n)	8-bit, two's complement signed integer for which the value is n
unsigned_byte(n)	8-bit unsigned integer for which the value is n
int32	4 bytes (32-bit signed integer, two's complement)
int64	8 bytes (64-bit signed integer, two's complement)
uint64	8 bytes (64-bit unsigned integer)
double	8 bytes (64-bit IEEE 754-2008 binary floating point)
decimal128	16 bytes (128-bit IEEE 754-2008 decimal floating point)

Non-terminals

The following specifies the rest of the BSON grammar. Note that we use the * operator as shorthand for repetition (e.g. (byte*2) is byte byte). When used as a unary operator, * means that the repetition can occur 0 or more times.

document	::=	int32 e_list unsigned_byte(0)	BSON Document. int32 is the total number of bytes comprising the document.
e_list	::=	element e_list
|	""
element	::=	signed_byte(1) e_name double	64-bit binary floating point
|	signed_byte(2) e_name string	UTF-8 string
|	signed_byte(3) e_name document	Embedded document
|	signed_byte(4) e_name document	Array
|	signed_byte(5) e_name binary	Binary data
|	signed_byte(6) e_name	Undefined (value) — Deprecated
|	signed_byte(7) e_name (byte*12)	ObjectId
|	signed_byte(8) e_name unsigned_byte(0)	Boolean - false
|	signed_byte(8) e_name unsigned_byte(1)	Boolean - true
|	signed_byte(9) e_name int64	UTC datetime
|	signed_byte(10) e_name	Null value
|	signed_byte(11) e_name cstring cstring	Regular expression - The first cstring is the regex pattern, the second is the regex options string. Options are identified by characters, which must be stored in alphabetical order. Valid option characters are i for case insensitive matching, m for multiline matching, s for dotall mode ("." matches everything), x for verbose mode, and u to make "\w", "\W", etc. match Unicode.
|	signed_byte(12) e_name string (byte*12)	DBPointer — Deprecated
|	signed_byte(13) e_name string	JavaScript code
|	signed_byte(14) e_name string	Symbol — Deprecated
|	signed_byte(15) e_name code_w_s	JavaScript code with scope — Deprecated
|	signed_byte(16) e_name int32	32-bit integer
|	signed_byte(17) e_name uint64	Timestamp
|	signed_byte(18) e_name int64	64-bit integer
|	signed_byte(19) e_name decimal128	128-bit decimal floating point
|	signed_byte(-1) e_name	Min key
|	signed_byte(127) e_name	Max key
e_name	::=	cstring	Key name
string	::=	int32 (byte*) unsigned_byte(0)	String - The int32 is the number of bytes in the (byte*) plus one for the trailing null byte. The (byte*) is zero or more UTF-8 encoded characters.
cstring	::=	(byte*) unsigned_byte(0)	Zero or more modified UTF-8 encoded characters followed by the null byte. The (byte*) MUST NOT contain unsigned_byte(0), hence it is not full UTF-8.
binary	::=	int32 subtype (byte*)	Binary - The int32 is the number of bytes in the (byte*).
subtype	::=	unsigned_byte(0)	Generic binary subtype
|	unsigned_byte(1)	Function
|	unsigned_byte(2)	Binary (Old)
|	unsigned_byte(3)	UUID (Old)
|	unsigned_byte(4)	UUID
|	unsigned_byte(5)	MD5
|	unsigned_byte(6)	Encrypted BSON value
|	unsigned_byte(7)	Compressed BSON column
|	unsigned_byte(8)	Sensitive
|	unsigned_byte(128)—unsigned_byte(255)	User defined
code_w_s	::=	int32 string document	Code with scope — Deprecated

Notes

  • Array - The document for an array is a normal BSON document with integer values for the keys, starting with 0 and continuing sequentially. For example, the array ['red', 'blue'] would be encoded as the document {'0': 'red', '1': 'blue'}. The keys must be in ascending numerical order.
  • UTC datetime - The int64 is UTC milliseconds since the Unix epoch.
  • Timestamp - Special internal type used by MongoDB replication and sharding. First 4 bytes are an increment, second 4 are a timestamp.
  • Min key - Special type which compares lower than all other possible BSON element values.
  • Max key - Special type which compares higher than all other possible BSON element values.
  • Generic binary subtype - This is the most commonly used binary subtype and should be the 'default' for drivers and tools.
  • Compressed BSON Column - Compact storage of BSON data. This data type uses delta and delta-of-delta compression and run-length-encoding for efficient element storage. Also has an encoding for sparse arrays containing missing values.
  • The BSON "binary" or "BinData" datatype is used to represent arrays of bytes. It is somewhat analogous to the Java notion of a ByteArray. BSON binary values have a subtype. This is used to indicate what kind of data is in the byte array. Subtypes from 0 to 127 are predefined or reserved. Subtypes from 128 to 255 are user-defined.
    • unsigned_byte(2) Binary (Old) - This used to be the default subtype, but was deprecated in favor of subtype 0. Drivers and tools should be sure to handle subtype 2 appropriately. The structure of the binary data (the byte* array in the binary non-terminal) must be an int32 followed by a (byte*). The int32 is the number of bytes in the repetition.
    • unsigned_byte(3) UUID (Old) - This used to be the UUID subtype, but was deprecated in favor of subtype 4. Drivers and tools for languages with a native UUID type should handle subtype 3 appropriately.
    • unsigned_byte(128)—unsigned_byte(255) User defined subtypes. The binary data can be anything.
  • Code with scope - Deprecated. The int32 is the length in bytes of the entire code_w_s value. The string is JavaScript code. The document is a mapping from identifiers to values, representing the scope in which the string should be evaluated.

.. role:: javascript(code) :language: javascript

=============== ObjectID format

:Status: Accepted :Minimum Server Version: N/A

.. contents::


Abstract

This specification documents the format and data contents of ObjectID BSON values that the drivers and the server generate when no field values have been specified (e.g. creating an ObjectID BSON value when no _id field is present in a document). It is primarily aimed to provide an alternative to the historical use of the MD5 hashing algorithm for the machine information field of the ObjectID, which is problematic when providing a FIPS compliant implementation. It also documents existing best practices for the timestamp and counter fields.

META

The keywords “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119 <https://www.ietf.org/rfc/rfc2119.txt>_.

Specification

The ObjectID_ BSON type is a 12-byte value consisting of three different portions (fields):

  • a 4-byte value representing the seconds since the Unix epoch in the highest order bytes,
  • a 5-byte random number unique to a machine and process,
  • a 3-byte counter, starting with a random value.

::

  4 byte timestamp    5 byte process unique   3 byte counter
|<----------------->|<---------------------->|<------------>|
[----|----|----|----|----|----|----|----|----|----|----|----]
0                   4                   8                   12

.. _ObjectID: https://www.mongodb.com/docs/manual/reference/method/ObjectId/

Timestamp Field

This 4-byte big endian field represents the seconds since the Unix epoch (Jan 1st, 1970, midnight UTC). It is an ever increasing value that will have a range until about Jan 7th, 2106.

Drivers MUST create ObjectIDs with this value representing the number of seconds since the Unix epoch.

Drivers MUST interpret this value as an unsigned 32-bit integer when conversions to language specific date/time values are created, and when converting this to a timestamp.

Drivers SHOULD have an accessor method on an ObjectID class for obtaining the timestamp value.

Random Value

A 5-byte field consisting of a random value generated once per process. This random value is unique to the machine and process.

Drivers MUST NOT have an accessor method on an ObjectID class for obtaining this value.

The random number does not have to be cryptographic. If possible, use a PRNG with OS supplied entropy that SHOULD NOT block to wait for more entropy to become available. Otherwise, seed a deterministic PRNG to ensure uniqueness of process and machine by combining time, process ID, and hostname.

Counter

A 3-byte big endian counter.

This counter MUST be initialised to a random value when the driver is first activated. After initialisation, the counter MUST be increased by 1 for every ObjectID creation.

When the counter overflows (i.e., hits 16777215+1), the counter MUST be reset to 0.

Drivers MUST NOT have an accessor method on an ObjectID class for obtaining this value.

The random number does not have to be cryptographic. If possible, use a PRNG with OS supplied entropy that SHOULD NOT block to wait for more entropy to become available. Otherwise, seed a deterministic PRNG to ensure uniqueness of process and machine by combining time, process ID, and hostname.

Test Plan

Drivers MUST:

  • Ensure that the Timestamp field is represented as an unsigned 32-bit representing the number of seconds since the Epoch for the Timestamp values:

    • 0x00000000: To match "Jan 1st, 1970 00:00:00 UTC"
    • 0x7FFFFFFF: To match "Jan 19th, 2038 03:14:07 UTC"
    • 0x80000000: To match "Jan 19th, 2038 03:14:08 UTC"
    • 0xFFFFFFFF: To match "Feb 7th, 2106 06:28:15 UTC"
  • Ensure that the Counter field successfully overflows its sequence from 0xFFFFFF to 0x000000.

  • Ensure that after a new process is created through a fork() or similar process creation operation, the "random number unique to a machine and process" is no longer the same as the parent process that created the new process.

Motivation for Change

Besides the specific exclusion of MD5 as an allowed hashing algorithm, the information in this specification is meant to align the ObjectID generation algorithm of both drivers and the server.

Design Rationale

Timestamp: The timestamp is a 32-bit unsigned integer, as it allows us to extend the furthest date that the timestamp can represent from the year 2038 to 2106. There is no reason why MongoDB would generate a timestamp to mean a date before 1970, as MongoDB did not exist back then.

Random Value: Originally, this field consisted of the Machine ID and Process ID fields. There were numerous divergences between drivers due to implementation choices, and the Machine ID field traditionally used the MD5 hashing algorithm which can't be used on FIPS compliant machines. In order to allow for a similar behaviour among all drivers and the MongoDB Server, these two fields have been collated together into a single 5-byte random value, unique to a machine and process.

Counter: The counter makes it possible to have multiple ObjectIDs per second, per server, and per process. As the counter can overflow, there is a possibility of having duplicate ObjectIDs if you create more than 16 million ObjectIDs per second in the same process on a single machine.

Endianness: The Timestamp and Counter are big endian because we can then use memcmp to order ObjectIDs, and we want to ensure an increasing order.

Backwards Compatibility

This specification requires that the existing Machine ID and Process ID fields are merged into a single 5-byte value. This will change the behaviour of ObjectID generation, as well as the behaviour of drivers that currently have getters and setters for the original Machine ID and Process ID fields.

Reference Implementation

Currently there is no full reference implementation yet.

Changelog

:2022-10-05: Remove spec front matter and reformat changelog. :2019-01-14: Clarify that the random numbers don't need to be cryptographically secure. Add a test to test that the unique value is different in forked processes. :2018-10-11: Clarify that the Timestamp and Counter fields are big endian, and add the reason why. :2018-07-02: Replaced Machine ID and Process ID fields with a single 5-byte unique value :2018-05-22: Initial Release

BSON Decimal128 Type Handling in Drivers

  • Status: Accepted
  • Minimum Server Version: 3.4

Abstract

MongoDB 3.4 introduces a new BSON type representing high precision decimal ("\x13"), known as Decimal128. 3.4 compatible drivers must support this type by creating a Value Object for it, possibly with accessor functions for retrieving its value in data types supported by the respective languages.

Round-tripping Decimal128 types between driver and server MUST not change its value or representation in any way. Conversion to and from native language types is complicated and there are many pitfalls to represent Decimal128 precisely in all languages

While many languages offer a native decimal type, the precision of these types often does not exactly match that of the MongoDB implementation. To ensure error-free conversion and consistency between official MongoDB drivers, this specification does not allow automatically converting the BSON Decimal128 type into a language-defined decimal type.

Language drivers will wrap their native type in value objects by default and SHOULD offer accessor functions for retrieving its value represented by language-defined types if appropriate. A driver that offers the ability to configure mappings to/from BSON types to native types MAY allow the option to automatically convert the BSON Decimal128 type to a native type. It should however be made abundantly clear to the user that converting to native data types risks incurring data loss.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Terminology

IEEE 754-2008 128-bit decimal floating point (Decimal128):
The Decimal128 specification supports 34 decimal digits of precision, a max value of approximately 10^6145, and min value of approximately -10^6145. This is the new BSON Decimal128 type ("\x13").

Clamping:
Clamping happens when a value's exponent is too large for the destination format. This works by adding zeros to the coefficient to reduce the exponent to the largest usable value. An overflow occurs if the number of digits required is more than allowed in the destination format.

Binary Integer Decimal (BID):
MongoDB uses this binary encoding for the coefficient as specified in IEEE 754-2008 section 3.5.2 using method 2 "binary encoding" rather than method 1 "decimal encoding". The byte order is little-endian, like the rest of the BSON types.

Value Object:
An immutable container type representing a value (e.g. Decimal128). This Value Object MAY provide accessors that retrieve the abstracted value as a different type (e.g. casting it). double x = valueObject.getAsDouble();

Specification

BSON Decimal128 implementation details

The BSON Decimal128 data type implements the Decimal Arithmetic Encodings specification, with certain exceptions around value integrity and the coefficient encoding. When a value cannot be represented exactly, the value will be rejected.

The coefficient MUST be stored as an unsigned binary integer (BID) rather than the densely-packed decimal (DPD) shown in the specification. See either the IEEE Std 754-2008 spec or the driver examples for further detail.

The specification defines several statuses which are meant to signal exceptional circumstances, such as when overflowing occurs, and how to handle them.

BSON Decimal128 Value Objects MUST implement these actions for these exceptions:

  • Overflow

    • When overflow occurs, the operation MUST emit an error and result in a failure
  • Underflow

    • When underflow occurs, the operation MUST emit an error and result in a failure
  • Clamping

    • Since clamping does not change the actual value, only the representation of it, clamping MUST occur without emitting an error.
  • Rounding

    • When the coefficient requires more digits then Decimal128 provides, rounding MUST be done without emitting an error, unless it would result in inexact rounding, in which case the operation MUST emit an error and result in a failure.
  • Conversion Syntax

    • Invalid strings MUST emit an error and result in a failure.

It should be noted that the given exponent is a preferred representation. If the value cannot be stored due to the value of the exponent being too large or too small, but can be stored using an alternative representation by clamping and or rounding, a BSON Decimal128 compatible Value Object MUST do so, unless such operation results in an inexact rounding or other underflow or overflow.

Reading from BSON

A BSON type "\x13" MUST be represented by an immutable Value Object by default and MUST NOT be automatically converted into language native numeric type by default. A driver that offers users a way to configure the exact type mapping to and from BSON types MAY allow the BSON Decimal128 type to be converted to the user configured type.

A driver SHOULD provide accessors for this immutable Value Object, which can return a language-specific representation of the Decimal128 value, after converting it into the respective type. For example, Java may choose to provide Decimal128.getBigDecimal().

All drivers MUST provide an accessor for retrieving the value as a string. Drivers MAY provide other accessors, retrieving the value as other types.

Serializing and writing BSON

Drivers MUST provide a way of constructing the Value Object, as the driver representation of the BSON Decimal128 is an immutable Value Object by default.

A driver MUST have a way to construct this Value Object from a string. For example, Java MUST provide a method similar to Decimal128.valueOf("2.000").

A driver that has accessors for different types SHOULD provide a way to construct the Value Object from those types.

Reading from Extended JSON

The Extended JSON representation of Decimal128 is a document with the key $numberDecimal and a value of the Decimal128 as a string. Drivers that support Extended JSON formatting MUST support the $numberDecimal type specifier.

When an Extended JSON $numberDecimal is parsed, its type should be the same as that of a deserialized BSON Decimal128, as described in Reading from BSON.

The Extended JSON $numberDecimal value follows the same stringification rules as defined in From String Representation.

Writing to Extended JSON

The Extended JSON type identifier is $numberDecimal, while the value itself is a string. Drivers that support converting values to Extended JSON MUST be able to convert its Decimal128 value object to Extended JSON.

Converting a Decimal128 Value Object to Extended JSON MUST follow the conversion rules in To String Representation, and other stringification rules as when converting Decimal128 Value Object to a String.

Operator overloading and math on Decimal128 Value Objects

Drivers MUST NOT allow any mathematical operator overloading for the Decimal128 Value Objects. This includes adding two Decimal128 Value Objects and assigning the result to a new object.

If a user wants to perform mathematical operations on Decimal128 Value Objects, the user must explicitly retrieve the native language value representations of the objects and perform the operations on those native representations. The user will then create a new Decimal128 Value Object and optionally overwrite the original Decimal128 Value Object.

From String Representation

For finite numbers, we will use the definition at http://speleotrove.com/decimal/daconvs.html. It has been modified to account for a different NaN representation and whitespace rules and copied here:

Strings which are acceptable for conversion to the abstract representation of
numbers, or which might result from conversion from the abstract representation
to a string, are called numeric strings.


A numeric string is a character string that describes either a finite
number or a special value.
* If it describes a finite number, it includes one or more decimal digits,
  with an optional decimal point. The decimal point may be embedded in the
  digits, or may be prefixed or suffixed to them. The group of digits (and
  optional point) thus constructed may have an optional sign ('+' or '-')
  which must come before any digits or decimal point. 
* The string thus described may optionally be followed by an 'E'
  (indicating an exponential part), an optional sign, and an integer
  following the sign that represents a power of ten that is to be applied.
  The 'E' may be in uppercase or lowercase.
* If it describes a special value, it is one of the case-independent names
  'Infinity', 'Inf', or 'NaN' (where the first two represent infinity and
  the second represent NaN). The name may be preceded by an optional sign,
  as for finite numbers. 
* No blanks or other whitespace characters are permitted in a numeric string.

Formally

          sign           ::=  '+' | '-'
          digit          ::=  '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' |
                              '8' | '9'
          indicator      ::=  'e' | 'E'
          digits         ::=  digit [digit]...
          decimal-part   ::=  digits '.' [digits] | ['.'] digits
          exponent-part  ::=  indicator [sign] digits
          infinity       ::=  'Infinity' | 'Inf'
          nan            ::=  'NaN'
          numeric-value  ::=  decimal-part [exponent-part] | infinity
          numeric-string ::=  [sign] numeric-value | [sign] nan

where the characters in the strings accepted for 'infinity' and 'nan' may be in
any case.  If an implementation supports the concept of diagnostic information
on NaNs, the numeric strings for NaNs MAY include one or more digits, as shown
above.[3]  These digits encode the diagnostic information in an
implementation-defined manner; however, conversions to and from string for
diagnostic NaNs should be reversible if possible. If an implementation does not
support diagnostic information on NaNs, these digits should be ignored where
necessary. A plain 'NaN' is usually the same as 'NaN0'.


Drivers MAY choose to support signed NaN (sNaN), along with sNaN with
diagnostic information. 



Examples::
Some numeric strings are:
            "0"         -- zero
           "12"         -- a whole number
          "-76"         -- a signed whole number
           "12.70"      -- some decimal places
           "+0.003"     -- a plus sign is allowed, too
          "017."        -- the same as 17
             ".5"       -- the same as 0.5
           "4E+9"       -- exponential notation
            "0.73e-7"   -- exponential notation, negative power
           "Inf"        -- the same as Infinity
           "-infinity"  -- the same as -Infinity
           "NaN"        -- not-a-Number

Notes:
1. A single period alone or with a sign is not a valid numeric string.
2. A sign alone is not a valid numeric string.
3. Significant (after the decimal point) and insignificant leading zeros
       are permitted.

To String Representation

For finite numbers, we will use the definition at http://speleotrove.com/decimal/daconvs.html. It has been copied here:

The coefficient is first converted to a string in base ten using the characters
0 through 9 with no leading zeros (except if its value is zero, in which case a
single 0 character is used).


Next, the adjusted exponent is calculated; this is the exponent, plus the
number of characters in the converted coefficient, less one. That is,
exponent+(clength-1), where clength is the length of the coefficient in decimal
digits.


If the exponent is less than or equal to zero and the adjusted exponent is
greater than or equal to -6, the number will be converted to a character form
without using exponential notation. In this case, if the exponent is zero then
no decimal point is added. Otherwise (the exponent will be negative), a decimal
point will be inserted with the absolute value of the exponent specifying the
number of characters to the right of the decimal point. '0' characters are
added to the left of the converted coefficient as necessary. If no character
precedes the decimal point after this insertion then a conventional '0'
character is prefixed.


Otherwise (that is, if the exponent is positive, or the adjusted exponent is
less than -6), the number will be converted to a character form using
exponential notation. In this case, if the converted coefficient has more than
one digit a decimal point is inserted after the first digit. An exponent in
character form is then suffixed to the converted coefficient (perhaps with
inserted decimal point); this comprises the letter 'E' followed immediately by
the adjusted exponent converted to a character form. The latter is in base ten,
using the characters 0 through 9 with no leading zeros, always prefixed by a
sign character ('-' if the calculated exponent is negative, '+' otherwise).

This corresponds to the following code snippet:

var adjusted_exponent = _exponent + (clength - 1);
if (_exponent > 0 || adjusted_exponent < -6) {
    // exponential notation
} else {
    // character form without using exponential notation
}

For special numbers such as infinity or the not a number (NaN) variants, the below table is used:

ValueString
Positive InfiniteInfinity
Negative Infinite-Infinity
Positive NaNNaN
Negative NaNNaN
Signaled NaNNaN
Negative Signaled NaNNaN
NaN with a payloadNaN

Finally, there are certain other invalid representations that must be treated as zeros, as per IEEE 754-2008. The tests will verify that each special value has been accounted for.

The server log files as well as the Extended JSON Format for Decimal128 use this format.

Motivation for Change

BSON already contains support for double ("\x01"), but this type is insufficient for certain values that require strict precision and representation, such as money, where it is necessary to perform exact decimal rounding.

The new BSON type is the 128-bit IEEE 754-2008 decimal floating point number, which is specifically designed to cope with these issues.

Design Rationale

For simplicity and consistency between drivers, drivers must not automatically convert this type into a native type by default. This also ensures original data preservation, which is crucial to Decimal128. It is however recommended that drivers offer a way to convert the Value Object to a native type through accessors, and to create a new BSON type from native types. This forces the user to explicitly do the conversion and thus understand the difference between the MongoDB type and possible language precision and representation. Representations via conversions done outside MongoDB are not guaranteed to be identical.

Backwards Compatibility

There should be no backwards compatibility concerns. This specification merely deals with how to encode and decode BSON/Extended JSON Decimal128.

Reference Implementations

Tests

See the BSON Corpus for tests.

Most of the tests are converted from the General Decimal Arithmetic Testcases.

Q&A

  • Is it true Decimal128 doesn't normalize the value?

    • Yes. As a result of non-normalization rules of the Decimal128 data type, precision is represented exactly. For example, '2.00' always remains stored as 200E-2 in Decimal128, and it differs from the representation of '2.0' (20E-1). These two values compare equally, but represent different ideas.
  • How does Decimal128 "2.000" look in the shell?

    • NumberDecimal("2.000")
  • Should a driver avoid sending Decimal128 values to pre-3.4 servers?

    • No
  • Is there a wire version bump or something for Decimal128?

    • No

Changelog

  • 2024-02-08: Migrated from reStructuredText to Markdown.
  • 2022-10-05: Remove spec front matter.

============================= Handling of Native UUID Types

:Status: Accepted :Minimum Server Version: N/A

.. contents::


Abstract

The Java, C#, and Python drivers natively support platform types for UUID, all of which by default encode them to and decode them from BSON binary subtype 3. However, each encode the bytes in a different order from the others. To improve interoperability, BSON binary subtype 4 was introduced and defined the byte order according to RFC 4122 <https://tools.ietf.org/html/rfc4122#section-4.1.2>_, and a mechanism to configure each driver to encode UUIDs this way was added to each driver. The legacy representation remained as the default for each driver.

This specification moves MongoDB drivers further towards the standard UUID representation by requiring an application relying on native UUID support to explicitly specify the representation it requires.

Drivers that support native UUID types will additionally create helpers on their BsonBinary class that will aid in conversion to and from the platform native UUID type.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 <https://www.ietf.org/rfc/rfc2119.txt>_.

Specification

Terms

UUID A Universally Unique IDentifier

BsonBinary An object that wraps an instance of a BSON binary value

Naming Deviations

All drivers MUST name operations, objects, and parameters as defined in the following sections.

The following deviations are permitted:

  • Drivers can use the platform's name for a UUID. For instance, in C# the platform class is Guid, whereas in Java it is UUID.
  • Drivers can use a "to" prefix instead of an "as" prefix for the BsonBinary method names.

Explicit encoding and decoding

Any driver with a native UUID type MUST add the following UuidRepresentation enumeration, and associated methods to its BsonBinary (or equivalent) class:

.. code:: typescript

/**

enum UuidRepresentation {

  /**
   * An unspecified representation of UUID.  Essentially, this is the null 
   * representation value. This value is not required for languages that      
   * have better ways of indicating, or preventing use of, a null value.
   */
  UNSPECIFIED("unspecified"),

  /**
   * The canonical representation of UUID according to RFC 4122, 
   * section 4.1.2 
   * 
   * It encodes as BSON binary subtype 4
  */
  STANDARD("standard"),

  /**
   * The legacy representation of UUID used by the C# driver.
   *
   * In this representation the order of bytes 0-3 are reversed, the 
   * order of bytes 4-5 are reversed, and the order of bytes 6-7 are 
   * reversed.
   *
   * It encodes as BSON binary subtype 3
   */
  C_SHARP_LEGACY("csharpLegacy"),
  
  /**
   * The legacy representation of UUID used by the Java driver.
   *
   * In this representation the order of bytes 0-7 are reversed, and the 
   * order of bytes 8-15 are reversed.
   *
   * It encodes as BSON binary subtype 3
   */
  JAVA_LEGACY("javaLegacy"),

 /**
  * The legacy representation of UUID used by the Python driver.
  *
  * As with STANDARD, this representation conforms with RFC 4122, section
  * 4.1.2 
  *
  * It encodes as BSON binary subtype 3
  */
  PYTHON_LEGACY("pythonLegacy")

}

class BsonBinary { /* * Construct from a UUID using the standard UUID representation * [Specification] This constructor SHOULD be included but MAY be * omitted if it creates backwards compatibility issues */ constructor(Uuid uuid)

  /*
   * Construct from a UUID using the given UUID representation.
   *
   * The representation must not be equal to UNSPECIFIED
   */
  constructor(Uuid uuid, UuidRepresentation representation)

  /*
   * Decode a subtype 4 binary to a UUID, erroring when the subtype is not 4.
   */
  Uuid asUuid()  

  /*
   * Decode a subtype 3 or 4 to a UUID, according to the UUID    
   * representation, erroring when subtype does not match the
   * representation.
   */
  Uuid asUuid(UuidRepresentation representation)

}

Implicit decoding and encoding

A new driver for a language with a native UUID type MUST NOT implicitly encode from or decode to the native UUID type. Rather, explicit conversion MUST be used as described in the previous section.

Drivers that already do such implicit encoding and decoding SHOULD support a URI option, uuidRepresentation, which controls the default behavior of the UUID codec. Alternatively, a driver MAY specify the UUID representation via global state.

.. list-table:: :header-rows: 1 :widths: 1 1 3 1 1

    • Value
    • Default?
    • Encode to
    • Decode subtype 4 to
    • Decode subtype 3 to
    • unspecified
    • yes
    • raise error
    • BsonBinary
    • BsonBinary
    • standard
    • no
    • BSON binary subtype 4
    • native UUID
    • BsonBinary
    • csharpLegacy
    • no
    • BSON binary subtype 3 with C# legacy byte order
    • BsonBinary
    • native UUID
    • javaLegacy
    • no
    • BSON binary subtype 3 with Java legacy byte order
    • BsonBinary
    • native UUID
    • pythonLegacy
    • no
    • BSON binary subtype 3 with standard byte order
    • BsonBinary
    • native UUID

For scenarios where the application makes the choice (e.g. a POJO with a field of type UUID), or when serializers are strongly typed and are constrained to always return values of a certain type, the driver will raise an exception in cases where otherwise it would be required to decode to a different type (e.g. BsonBinary instead of UUID or vice versa).

Note also that none of the above applies when decoding to strictly typed maps, e.g. a Map<String, BsonValue> like Java or .NET's BsonDocument class. In those cases the driver is always decoding to BsonBinary, and applications would use the asUuid methods to explicitly convert from BsonBinary to UUID.

Implementation Notes

Since changing the default UUID representation can reasonably be considered a backwards-breaking change, drivers that implement the full specification should stage implementation according to semantic versioning guidelines. Specifically, support for this specification can be added to a minor release, but with several exceptions:

The default UUID representation should be left as is (e.g. JAVA_LEGACY for the Java driver) rather than be changed to UNSPECIFIED. In a subsequent major release, the default UUID representation can be changed to UNSPECIFIED (along with appropriate documentation indicating the backwards-breaking change). Drivers MUST document this in a prior minor release.

Test Plan

The test plan consists of a series of prose tests. They all operate on the same UUID, with the String representation of "00112233-4455-6677-8899-aabbccddeeff".

Explicit encoding

  1. Create a BsonBinary instance with the given UUID

    a. Assert that the BsonBinary instance's subtype is equal to 4 and data equal to the hex-encoded string "00112233445566778899AABBCCDDEEFF"

  2. Create a BsonBinary instance with the given UUID and UuidRepresentation equal to STANDARD

    a. Assert that the BsonBinary instance's subtype is equal to 4 and data equal to the hex-encoded string "00112233445566778899AABBCCDDEEFF"

  3. Create a BsonBinary instance with the given UUID and UuidRepresentation equal to JAVA_LEGACY

    a. Assert that the BsonBinary instance's subtype is equal to 3 and data equal to the hex-encoded string "7766554433221100FFEEDDCCBBAA9988"

  4. Create a BsonBinary instance with the given UUID and UuidRepresentation equal to CSHARP_LEGACY

    a. Assert that the BsonBinary instance's subtype is equal to 3 and data equal to the hex-encoded string "33221100554477668899AABBCCDDEEFF"

  5. Create a BsonBinary instance with the given UUID and UuidRepresentation equal to PYTHON_LEGACY

    a. Assert that the BsonBinary instance's subtype is equal to 3 and data equal to the hex-encoded string "00112233445566778899AABBCCDDEEFF"

  6. Create a BsonBinary instance with the given UUID and UuidRepresentation equal to UNSPECIFIED

    a. Assert that an error is raised

Explicit Decoding

  1. Create a BsonBinary instance with subtype equal to 4 and data equal to the hex-encoded string "00112233445566778899AABBCCDDEEFF"

    a. Assert that a call to BsonBinary.asUuid() returns the given UUID b. Assert that a call to BsonBinary.asUuid(STANDARD) returns the given UUID c. Assert that a call to BsonBinary.asUuid(UNSPECIFIED) raises an error d. Assert that a call to BsonBinary.asUuid(JAVA_LEGACY) raises an error e. Assert that a call to BsonBinary.asUuid(CSHARP_LEGACY) raises an error f. Assert that a call to BsonBinary.asUuid(PYTHON_LEGACY) raises an error

  2. Create a BsonBinary instance with subtype equal to 3 and data equal to the hex-encoded string "7766554433221100FFEEDDCCBBAA9988"

    a. Assert that a call to BsonBinary.asUuid() raises an error b. Assert that a call to BsonBinary.asUuid(STANDARD) raised an error c. Assert that a call to BsonBinary.asUuid(UNSPECIFIED) raises an error d. Assert that a call to BsonBinary.asUuid(JAVA_LEGACY) returns the given UUID

  3. Create a BsonBinary instance with subtype equal to 3 and data equal to the hex-encoded string "33221100554477668899AABBCCDDEEFF"

    a. Assert that a call to BsonBinary.asUuid() raises an error b. Assert that a call to BsonBinary.asUuid(STANDARD) raised an error c. Assert that a call to BsonBinary.asUuid(UNSPECIFIED) raises an error d. Assert that a call to BsonBinary.asUuid(CSHARP_LEGACY) returns the given UUID

  4. Create a BsonBinary instance with subtype equal to 3 and data equal to the hex-encoded string "00112233445566778899AABBCCDDEEFF"

    a. Assert that a call to BsonBinary.asUuid() raises an error b. Assert that a call to BsonBinary.asUuid(STANDARD) raised an error c. Assert that a call to BsonBinary.asUuid(UNSPECIFIED) raises an error d. Assert that a call to BsonBinary.asUuid(PYTHON_LEGACY) returns the given UUID

Implicit encoding

  1. Set the uuidRepresentation of the client to "javaLegacy". Insert a document with an "_id" key set to the given native UUID value.

    a. Assert that the actual value inserted is a BSON binary with subtype 3 and data equal to the hex-encoded string "7766554433221100FFEEDDCCBBAA9988"

  2. Set the uuidRepresentation of the client to "charpLegacy". Insert a document with an "_id" key set to the given native UUID value.

    a. Assert that the actual value inserted is a BSON binary with subtype 3 and data equal to the hex-encoded string "33221100554477668899AABBCCDDEEFF"

  3. Set the uuidRepresentation of the client to "pythonLegacy". Insert a document with an "_id" key set to the given native UUID value.

    a. Assert that the actual value inserted is a BSON binary with subtype 3 and data equal to the hex-encoded string "00112233445566778899AABBCCDDEEFF"

  4. Set the uuidRepresentation of the client to "standard". Insert a document with an "_id" key set to the given native UUID value.

    a. Assert that the actual value inserted is a BSON binary with subtype 4 and data equal to the hex-encoded string "00112233445566778899AABBCCDDEEFF"

  5. Set the uuidRepresentation of the client to "unspecified". Insert a document with an "_id" key set to the given native UUID value.

    a. Assert that a BSON serialization exception is thrown

Implicit Decoding

  1. Set the uuidRepresentation of the client to "javaLegacy". Insert a document containing two fields. The "standard" field should contain a BSON Binary created by creating a BsonBinary instance with the given UUID and the STANDARD UuidRepresentation. The "legacy" field should contain a BSON Binary created by creating a BsonBinary instance with the given UUID and the JAVA_LEGACY UuidRepresentation. Find the document.

    a. Assert that the value of the "standard" field is of type BsonBinary and is equal to the inserted value. b. Assert that the value of the "legacy" field is of the native UUID type and is equal to the given UUID

    Repeat this test with the uuidRepresentation of the client set to "csharpLegacy" and "pythonLegacy".

  2. Set the uuidRepresentation of the client to "standard". Insert a document containing two fields. The "standard" field should contain a BSON Binary created by creating a BsonBinary instance with the given UUID and the STANDARD UuidRepresentation. The "legacy" field should contain a BSON Binary created by creating a BsonBinary instance with the given UUID and the PYTHON_LEGACY UuidRepresentation. Find the document.

    a. Assert that the value of the "standard" field is of the native UUID type and is equal to the given UUID b. Assert that the value of the "legacy" field is of type BsonBinary and is equal to the inserted value.

  3. Set the uuidRepresentation of the client to "unspecified". Insert a document containing two fields. The "standard" field should contain a BSON Binary created by creating a BsonBinary instance with the given UUID and the STANDARD UuidRepresentation. The "legacy" field should contain a BSON Binary created by creating a BsonBinary instance with the given UUID and the PYTHON_LEGACY UuidRepresentation. Find the document.

    a. Assert that the value of the "standard" field is of type BsonBinary and is equal to the inserted value b. Assert that the value of the "legacy" field is of type BsonBinary and is equal to the inserted value.

    Repeat this test with the uuidRepresentation of the client set to "csharpLegacy" and "pythonLegacy".

Note: the assertions will be different in the release prior to the major release, to avoid breaking changes. Adjust accordingly!

Q & A

What's the rationale for the deviations allowed by the specification?

In short, the C# driver has existing behavior that make it infeasible to work the same as other drivers.

The C# driver has a global serialization registry. Since it's global and not per-MongoClient, it's not feasible to override the UUID representation on a per-MongoClient basis, since doing so would require a per-MongoClient registry. Instead, the specification allows for a global override so that the C# driver can implement the specification.

Additionally, the C# driver has an existing configuration parameter that controls the behavior of BSON readers and writers at a level below the serializers. This configuration affects the semantics of the existing BsonBinary class in a way that doesn't allow for the constructor(UUID) mentioned in the specification. For this reason, that constructor is specified as optional.

Changelog

:2022-10-05: Remove spec front matter.

Handling of DBRefs

  • Status: Accepted
  • Minimum Server Version: N/A

Abstract

DBRefs are a convention for expressing a reference to another document as an embedded document (i.e. BSON type 0x03). Several drivers provide a model class for encoding and/or decoding DBRef documents. This specification will both define the structure of a DBRef and provide guidance for implementing model classes in drivers that choose to do so.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

This specification presents documents as Extended JSON for readability and expressing special types (e.g. ObjectId). Although JSON fields are unordered, the order of fields presented herein should be considered pertinent. This is especially relevant for the Test Plan.

Specification

DBRef Structure

A DBRef is an embedded document with the following fields:

  • $ref: required string field. Contains the name of the collection where the referenced document resides. This MUST be the first field in the DBRef.
  • $id: required field. Contains the value of the _id field of the referenced document. This MUST be the second field in the DBRef.
  • $db: optional string field. Contains the name of the database where the referenced document resides. If specified, this MUST be the third field in the DBRef. If omitted, the referenced document is assumed to reside in the same database as the DBRef.
  • Extra, optional fields may follow after $id or $db (if specified). There are no inherent restrictions on extra field names; however, older server versions may impose their own restrictions (e.g. no dots or dollars).

DBRefs have no relation to the deprecated DBPointer BSON type (i.e. type 0x0C).

Examples of Valid DBRefs

The following examples are all valid DBRefs:

// Basic DBRef with only $ref and $id fields
{ "$ref": "coll0", "$id": { "$oid": "60a6fe9a54f4180c86309efa" } }

// DBRef $id is not necessarily an ObjectId
{ "$ref": "coll0", "$id": 1 }

// DBRef with optional $db field
{ "$ref": "coll0", "$id": 1, "$db": "db0" }

// DBRef with extra, optional fields (with or without $db)
{ "$ref": "coll0", "$id": 1, "$db": "db0", "foo": "bar" }
{ "$ref": "coll0", "$id": 1, "foo": true }

// Extra field names have no inherent restrictions
{ "$ref": "coll0", "$id": 1, "$foo": "bar" }
{ "$ref": "coll0", "$id": 1, "foo.bar": 0 }

Examples of Invalid DBRefs

The following examples are all invalid DBRefs:

// Required fields are omitted
{ "$ref": "coll0" }
{ "$id": { "$oid": "60a6fe9a54f4180c86309efa" } }

// Invalid types for $ref or $db
{ "$ref": true, "$id": 1 }
{ "$ref": "coll0", "$id": 1, "$db": 1 }

// Fields are out of order
{ "$id": 1, "$ref": "coll0" }

Implementing a DBRef Model

Drivers MAY provide a model class for encoding and/or decoding DBRef documents. For those drivers that do, this section defines expected behavior of that class. This section does not prohibit drivers from implementing additional functionality, provided it does not conflict with any of these guidelines.

Constructing a DBRef model

Drivers MAY provide an API for constructing a DBRef model directly from its constituent parts. If so:

  • Drivers MUST solicit a string value for $ref.
  • Drivers MUST solicit an arbitrary value for $id. Drivers SHOULD NOT enforce any restrictions on this value; however, this may be necessary if the driver is unable to differentiate between certain BSON types (e.g. null, undefined) and the parameter being unspecified.
  • Drivers SHOULD solicit an optional string value for $db.
  • Drivers MUST require $ref and $db (if specified) to be strings but MUST NOT enforce any naming restrictions on the string values.
  • Drivers MAY solicit extra, optional fields.

Decoding a BSON document to a DBRef model

Drivers MAY support explicit and/or implicit decoding. An example of explicit decoding might be a DBRef model constructor that takes a BSON document. An example of implicit decoding might be configuring the driver's BSON codec to automatically convert embedded documents that comply with the DBRef Structure into a DBRef model.

Drivers that provide implicit decoding SHOULD provide some way for applications to opt out and allow DBRefs to be decoded like any other embedded document.

When decoding a BSON document to a DBRef model:

  • Drivers MUST require $ref and $id to be present.
  • Drivers MUST require $ref and $db (if present) to be strings but MUST NOT enforce any naming restrictions on the string values.
  • Drivers MUST accept any BSON type for $id and MUST NOT enforce any restrictions on its value.
  • Drivers MUST preserve extra, optional fields (beyond $ref, $id, and $db) and MUST provide some way to access those fields via the DBRef model. For example, an accessor method that returns the original BSON document (including $ref, etc.) would fulfill this requirement.

If a BSON document cannot be implicitly decoded to a DBRef model, it MUST be left as-is (like any other embedded document). If a BSON document cannot be explicitly decoded to a DBRef model, the driver MUST raise an error.

Since DBRefs are a special type of embedded document, a DBRef model class used for decoding SHOULD inherit the class used to represent an embedded document (e.g. Hash in Ruby). This will allow applications to always expect an instance of a common class when decoding an embedded document (if desired) and should also support the requirement for DBRef models to provide access to any extra, optional fields.

Encoding a DBRef model to a BSON document

Drivers MAY support explicit and/or implicit encoding. An example of explicit encoding might be a DBRef method that returns its corresponding representation as a BSON document. An example of implicit encoding might be configuring the driver's BSON codec to automatically convert DBRef models to the corresponding BSON document representation as needed.

If a driver supports implicit decoding of BSON to a DBRef model, it SHOULD also support implicit encoding. Doing so will allow applications to more easily round-trip DBRefs through the driver.

When encoding a DBRef model to BSON document:

  • Drivers MUST encode all fields in the order defined in DBRef Structure.
  • Drivers MUST encode $ref and $id. If $db was specified, it MUST be encoded after $id. If any extra, optional fields were specified, they MUST be encoded after $id or $db.
  • If the DBRef includes any extra, optional fields after $id or $db, drivers SHOULD attempt to preserve the original order of those fields relative to one another.

Test Plan

The test plan consists of a series of prose tests. These tests are only relevant to drivers that provide a DBRef model class.

The documents in these tests are presented as Extended JSON for readability; however, readers should consider the field order pertinent when translating to BSON (or their language equivalent). These tests are not intended to exercise a driver's Extended JSON parser. Implementations SHOULD construct the documents directly using native BSON types (e.g. Document, ObjectId).

Decoding

These tests are only relevant to drivers that allow decoding into a DBRef model. Drivers SHOULD implement these tests for both explicit and implicit decoding code paths as needed.

  1. Valid documents MUST be decoded to a DBRef model. For each of the following:

    1. { "$ref": "coll0", "$id": { "$oid": "60a6fe9a54f4180c86309efa" } }
    2. { "$ref": "coll0", "$id": 1 }
    3. { "$ref": "coll0", "$id": null }
    4. { "$ref": "coll0", "$id": 1, "$db": "db0" }

    Assert that each document is successfully decoded to a DBRef model. Assert that the $ref, $id, and $db (if applicable) fields have their expected value.

  2. Valid documents with extra fields MUST be decoded to a DBRef model and the model MUST provide some way to access those extra fields. For each of the following:

    1. { "$ref": "coll0", "$id": 1, "$db": "db0", "foo": "bar" }
    2. { "$ref": "coll0", "$id": 1, "foo": true, "bar": false }
    3. { "$ref": "coll0", "$id": 1, "meta": { "foo": 1, "bar": 2 } }
    4. { "$ref": "coll0", "$id": 1, "$foo": "bar" }
    5. { "$ref": "coll0", "$id": 1, "foo.bar": 0 }

    Assert that each document is successfully decoded to a DBRef model. Assert that the $ref, $id, and $db (if applicable) fields have their expected value. Assert that it is possible to access all extra fields and that those fields have their expected value.

  3. Documents with out of order fields that are otherwise valid MUST be decoded to a DBRef model. For each of the following:

    1. { "$id": 1, "$ref": "coll0" }
    2. { "$db": "db0", "$ref": "coll0", "$id": 1 }
    3. { "foo": 1, "$id": 1, "$ref": "coll0" }
    4. { "foo": 1, "$ref": "coll0", "$id": 1, "$db": "db0" }
    5. { "foo": 1, "$ref": "coll0", "$id": 1, "$db": "db0", "bar": 1 }

    Assert that each document is successfully decoded to a DBRef model. Assert that the $ref, $id, $db (if applicable), and any extra fields (if applicable) have their expected value.

  4. Documents missing required fields MUST NOT be decoded to a DBRef model. For each of the following:

    1. { "$ref": "coll0" }
    2. { "$id": { "$oid": "60a6fe9a54f4180c86309efa" } }
    3. { "$db": "db0" }

    Assert that each document is not decoded to a DBRef model. In the context of implicit decoding, the document MUST be decoded like any other embedded document. In the context of explicit decoding, the DBRef decoding method MUST raise an error.

  5. Documents with invalid types for $ref or $db MUST NOT be decoded to a DBRef model. For each of the following:

    1. { "$ref": true, "$id": 1 }
    2. { "$ref": "coll0", "$id": 1, "$db": 1 }

    Assert that each document is not decoded to a DBRef model. In the context of implicit decoding, the document MUST be decoded like any other embedded document. In the context of explicit decoding, the DBRef decoding method MUST raise an error.

Encoding

These tests are only relevant to drivers that allow encoding a DBRef model. Drivers SHOULD implement these tests for both explicit and implicit encoding code paths as needed.

Drivers MAY use any method to create the DBRef model for each test (e.g. constructor, explicit decoding method).

Drivers MAY skip tests that cannot be implemented as written (e.g. DBRef model constructor does not support extra, optional fields and the driver also does not support explicit/implicit decoding).

  1. Encoding DBRefs with basic fields. For each of the following:

    1. { "$ref": "coll0", "$id": { "$oid": "60a6fe9a54f4180c86309efa" } }
    2. { "$ref": "coll0", "$id": 1 }
    3. { "$ref": "coll0", "$id": null }
    4. { "$ref": "coll0", "$id": 1, "$db": "db0" }

    Assert that each DBRef model is successfully encoded to a BSON document. Assert that the $ref, $id, and $db (if applicable) fields appear in the correct order and have their expected values.

  2. Encoding DBRefs with extra, optional fields. For each of the following:

    1. { "$ref": "coll0", "$id": 1, "$db": "db0", "foo": "bar" }
    2. { "$ref": "coll0", "$id": 1, "foo": true, "bar": false }
    3. { "$ref": "coll0", "$id": 1, "meta": { "foo": 1, "bar": 2 } }
    4. { "$ref": "coll0", "$id": 1, "$foo": "bar" }
    5. { "$ref": "coll0", "$id": 1, "foo.bar": 0 }

    Assert that each DBRef model is successfully encoded to a BSON document. Assert that the $ref, $id, $db (if applicable), and any extra fields appear in the correct order and have their expected values.

  3. Encoding DBRefs re-orders any out of order fields during decoding. This test MUST NOT use a constructor that solicits fields individually. For each of the following:

    1. { "$id": 1, "$ref": "coll0" }
    2. { "$db": "db0", "$ref": "coll0", "$id": 1 }
    3. { "foo": 1, "$id": 1, "$ref": "coll0" }
    4. { "foo": 1, "$ref": "coll0", "$id": 1, "$db": "db0" }
    5. { "foo": 1, "$ref": "coll0", "$id": 1, "$db": "db0", "bar": 1 }

    Assert that each document is successfully decoded to a DBRef model and then successfully encoded back to a BSON document. Assert that the order of fields in each encoded BSON document matches the following, respectively:

    1. { "$ref": "coll0", "$id": 1 }
    2. { "$ref": "coll0", "$id": 1, "$db": "db0" }
    3. { "$ref": "coll0", "$id": 1, "foo": 1 }
    4. { "$ref": "coll0", "$id": 1, "$db": "db0", "foo": 1}
    5. { "$ref": "coll0", "$id": 1, "$db": "db0", "foo": 1, "bar": 1 }

Design Rationale

In contrast to always encoding DBRefs with the correct field order, decoding permits fields to be out of order (provided the document is otherwise valid). This follows the robustness principle in having the driver be liberal in what it accepts and conservative in what it emits. This does mean that round-tripping an out of order DBRef through a driver could result in its field order being changed; however, this behavior is consistent with existing behavior in drivers that model DBRefs (e.g. C#, Java, Node, Python, Ruby) and applications can opt out of implicit decoding if desired.

Changelog

  • 2024-02-26: Migrated from reStructuredText to Markdown.
  • 2022-10-05: Remove spec front matter.

Extended JSON

  • Status: Accepted
  • Minimum Server Version: N/A

Abstract

MongoDB Extended JSON is a string format for representing BSON documents. This specification defines the canonical format for representing each BSON type in the Extended JSON format. Thus, a tool that implements Extended JSON will be able to parse the output of any tool that emits Canonical Extended JSON. It also defines a Relaxed Extended JSON format that improves readability at the expense of type information preservation.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Naming

Acceptable naming deviations should fall within the basic style of the language. For example, CanonicalExtendedJSON would be a name in Java, where camel-case method names are used, but in Ruby canonical_extended_json would be acceptable.

Terms

Type wrapper object - a JSON value consisting of an object with one or more $-prefixed keys that collectively encode a BSON type and its corresponding value using only JSON value primitives.

Extended JSON - A general term for one of many string formats based on the JSON standard that describes how to represent BSON documents in JSON using standard JSON types and/or type wrapper objects. This specification gives a formal definition to variations of such a format.

Relaxed Extended JSON - A string format based on the JSON standard that describes BSON documents. Relaxed Extended JSON emphasizes readability and interoperability at the expense of type preservation.

Canonical Extended JSON - A string format based on the JSON standard that describes BSON documents. Canonical Extended JSON emphasizes type preservation at the expense of readability and interoperability.

Legacy Extended JSON - A string format based on the JSON standard that describes a BSON document. The Legacy Extended JSON format does not describe a specific, standardized format, and many tools, drivers, and libraries implement Extended JSON in conflicting ways.

Specification

Extended JSON Format

The Extended JSON grammar extends the JSON grammar as defined in section 2 of the JSON specification by augmenting the possible JSON values as defined in Section 3. This specification defines two formats for Extended JSON:

  • Canonical Extended JSON
  • Relaxed Extended JSON

An Extended JSON value MUST conform to one of these two formats as described in the table below.

Notes on grammar

  • Key order:
    • Keys within Canonical Extended JSON type wrapper objects SHOULD be emitted in the order described.
    • Keys within Relaxed Extended JSON type wrapper objects are unordered.
  • Terms in italics represent types defined elsewhere in the table or in the JSON specification.
  • JSON numbers (as defined in Section 6 of the JSON specification) include both integer and floating point types. For the purpose of this document, we define the following subtypes:
    • Type integer means a JSON number without frac or exp components; this is expressed in the JSON spec grammar as [minus] int.
    • Type non-integer means a JSON number that is not an integer; it must include either a frac or exp component or both.
    • Type pos-integer means a non-negative JSON number without frac or exp components; this is expressed in the JSON spec grammar as int.
  • A hex string is a JSON string that contains only hexadecimal digits [0-9a-f]. It SHOULD be emitted lower-case, but MUST be read in a case-insensitive fashion.
  • <Angle brackets> detail the contents of a value, including type information.
  • [Square brackets] specify a type constraint that restricts the specification to a particular range or set of values.

Conversion table

BSON 1.1 Type or ConventionCanonical Extended JSON FormatRelaxed Extended JSON Format
ObjectId{"$oid": <ObjectId bytes as 24-character, big-endian hex string>}
Symbol{"$symbol": string}
Stringstring
Int32{"$numberInt": <32-bit signed integer as a string>}integer
Int64{"$numberLong": <64-bit signed integer as a string>}integer
Double [finite]{"$numberDouble": <64-bit signed floating point as a decimal string>}non-integer
Double [non-finite]{"$numberDouble": <One of the strings: "Infinity", "-Infinity", or "NaN">}
Decimal128{"$numberDecimal": }1
Binary{"$binary": {"base64": <base64-encoded (with padding as =) payload as a string>, "subType": }}
Code{"$code": string}
CodeWScope{"$code": string, "$scope": Document}
Documentobject (with Extended JSON extensions)
Timestamp{"$timestamp": {"t": pos-integer, "i": pos-integer}}
Regular Expression{"$regularExpression": {pattern: string, "options": <BSON regular expression options as a string or ""2>}}
DBPointer{"$dbPointer": {"$ref": <namespace3 as a string>, "$id": ObjectId}}
Datetime [year from 1970 to 9999 inclusive]{"$date": {"$numberLong": <64-bit signed integer giving millisecs relative to the epoch, as a string>}}{"$date": <ISO-8601 Internet Date/Time Format as described in RFC-33394 with maximum time precision of milliseconds5 as a string>}
Datetime [year before 1970 or after 9999]{"$date": {"$numberLong": <64-bit signed integer giving millisecs relative to the epoch, as a string>}}
DBRef6

Note: this is not technically a BSON type, but it is a common convention.
{"$ref": , "$id": }

If the generator supports DBRefs with a database component, and the database component is nonempty:

{"$ref": ,

"$id": , "$db": }

DBRefs may also have other fields, which MUST appear after $id and $db (if supported).
MinKey{"$minKey": 1}
MaxKey{"$maxKey": 1}
Undefined{"$undefined": true}
Arrayarray
Booleantrue or false
Nullnull

Representation of Non-finite Numeric Values

Following the Extended JSON format for the Decimal128 type, non-finite numeric values are encoded as follows:

ValueString
Positive InfinityInfinity
Negative Infinity-Infinity
NaN (all variants)NaN

For example, a BSON floating-point number with a value of negative infinity would be encoded as Extended JSON as follows:

{"$numberDouble": "-Infinity"}

Parsers

An Extended JSON parser (hereafter just "parser") is a tool that transforms an Extended JSON string into another representation, such as BSON or a language-native data structure.

By default, a parser MUST accept values in either Canonical Extended JSON format or Relaxed Extended JSON format as described in this specification. A parser MAY allow users to restrict parsing to only Canonical Extended JSON format or only Relaxed Extended JSON format.

A parser MAY also accept strings that adhere to other formats, such as Legacy Extended JSON formats emitted by old versions of mongoexport or other tools, but only if explicitly configured to do so.

A parser that accepts Legacy Extended JSON MUST be configurable such that a JSON text of a MongoDB query filter containing the regex query operator can be parsed, e.g.:

{ "$regex": {
    "$regularExpression" : { "pattern": "foo*", "options": "" }
    },
    "$options" : "ix"
}

or:

{ "$regex": {
    "$regularExpression" : { "pattern": "foo*", "options": "" }
    }
}

A parser that accepts Legacy Extended JSON MUST be configurable such that a JSON text of a MongoDB query filter containing the type query operator can be parsed, e.g.:

{ "zipCode" : { $type : 2 } }

or:

{ "zipCode" : { $type : "string" } }

A parser SHOULD support at least 200 [levels of nesting](#levels of nesting) in an Extended JSON document but MAY set other limits on strings it can accept as defined in section 9 of the JSON specification.

When parsing a JSON object other than the top-level object, the presence of a $-prefixed key indicates the object could be a type wrapper object as described in the Extended JSON Conversion table. In such a case, the parser MUST follow these rules, unless configured to allow Legacy Extended JSON, in which case it SHOULD follow these rules:

  • Parsers MUST NOT consider key order as having significance. For example, the document {"$code": "function(){}", "$scope": {}} must be considered identical to {"$scope": {}, "$code": "function(){}"}.

  • If the parsed object contains any of the special keys for a type in the Conversion table (e.g. "$binary", "$timestamp") then it must contain exactly the keys of the type wrapper. Any missing or extra keys constitute an error.

    DBRef is the lone exception to this rule, as it is only a common convention and not a proper type. An object that resembles a DBRef but fails to fully comply with its structure (e.g. has $ref but missing $id) MUST be left as-is and MUST NOT constitute an error.

  • If the keys of the parsed object exactly match the keys of a type wrapper in the Conversion table, and the values of the parsed object have the correct type for the type wrapper as described in the Conversion table, then the parser MUST interpret the parsed object as a type wrapper object of the corresponding type.

  • If the keys of the parsed object exactly match the keys of a type wrapper in the Conversion table, but any of the values are of an incorrect type, then the parser MUST report an error.

  • If the $-prefixed key does not match a known type wrapper in the Conversion table, the parser MUST NOT raise an error and MUST leave the value as-is. See Restrictions and limitations for additional information.

Special rules for parsing JSON numbers

The Relaxed Extended JSON format uses JSON numbers for several different BSON types. In order to allow parsers to use language-native JSON decoders (which may not distinguish numeric type when parsing), the following rules apply to parsing JSON numbers:

  • If the number is a non-integer, parsers SHOULD interpret it as BSON Double.
  • If the number is an integer, parsers SHOULD interpret it as being of the smallest BSON integer type that can represent the number exactly. If a parser is unable to represent the number exactly as an integer (e.g. a large 64-bit number on a 32-bit platform), it MUST interpret it as a BSON Double even if this results in a loss of precision. The parser MUST NOT interpret it as a BSON String containing a decimal representation of the number.

Special rules for parsing $uuid fields

As per the UUID specification, Binary subtype 3 or 4 are used to represent UUIDs in BSON. Consequently, UUIDs are handled as per the convention described for the Binary type in the Conversion table, e.g. the following document written with the MongoDB Python Driver:

{"Binary": uuid.UUID("c8edabc3-f738-4ca3-b68d-ab92a91478a3")}

is transformed into the following (newlines and spaces added for readability):

{"Binary": {
    "$binary": {
        "base64": "yO2rw/c4TKO2jauSqRR4ow==",
        "subType": "04"}
    }
}

[!NOTE] The above described type conversion assumes that UUID representation is set to STANDARD. See the UUID specification for more information about UUID representations.

While this transformation preserves BSON subtype information (since UUIDs can be represented as BSON subtype 3 or 4), base64-encoding is not the standard way of representing UUIDs and using it makes comparing these values against textual representations coming from platform libraries difficult. Consequently, we also allow UUIDs to be represented in extended JSON as:

{"$uuid": <canonical textual representation of a UUID>}

The rules for generating the canonical string representation of a UUID are defined in RFC 4122 Section 3. Use of this format result in a more readable extended JSON representation of the UUID from the previous example:

{"Binary": {
    "$uuid": "c8edabc3-f738-4ca3-b68d-ab92a91478a3"
    }
}

Parsers MUST interpret the $uuid key as BSON Binary subtype 4. Parsers MUST accept textual representations of UUIDs that omit the URN prefix (usually urn:uuid:). Parsers MAY also accept textual representations of UUIDs that omit the hyphens between hex character groups (e.g. c8edabc3f7384ca3b68dab92a91478a3).

Generators

An Extended JSON generator (hereafter just "generator") produces strings in an Extended JSON format.

A generator MUST allow users to produce strings in either the Canonical Extended JSON format or the Relaxed Extended JSON format. If generators provide a default format, the default SHOULD be the Relaxed Extended JSON format.

A generator MAY be capable of exporting strings that adhere to other formats, such as Legacy Extended JSON formats.

A generator SHOULD support at least 100 [levels of nesting](#levels of nesting) in a BSON document.

Transforming BSON

Given a BSON document (e.g. a buffer of bytes meeting the requirements of the BSON specification), a generator MUST use the corresponding JSON values or Extended JSON type wrapper objects for the BSON type given in the Extended JSON Conversion table for the desired format. When transforming a BSON document into Extended JSON text, a generator SHOULD emit the JSON keys and values in the same order as given in the BSON document.

Transforming Language-Native data

Given language-native data (e.g. type primitives, container types, classes, etc.), if there is a semantically-equivalent BSON type for a given language-native type, a generator MUST use the corresponding JSON values or Extended JSON type wrapper objects for the BSON type given in the Extended JSON Conversion table for the desired format. For example, a Python datetime object must be represented the same as a BSON datetime type. A generator SHOULD error if a language-native type has no semantically-equivalent BSON type.

Format and Method Names

The following format names SHOULD be used for selecting formats for generator output:

  • canonicalExtendedJSON (references Canonical Extended JSON as described in this specification)
  • relaxedExtendedJSON (references Relaxed Extended JSON as described in this specification)
  • legacyExtendedJSON (if supported: references Legacy Extended JSON, with implementation-defined behavior)

Generators MAY use these format names as part of function/method names or MAY use them as arguments or constants, as needed.

If a generator provides a generic to_json or to_extended_json method, it MUST default to producing Relaxed Extended JSON or MUST be deprecated in favor of a spec-compliant method.

Restrictions and limitations

Extended JSON is designed primarily for testing and human inspection of BSON documents. It is not designed to reliably round-trip BSON documents. One fundamental limitation is that JSON objects are inherently unordered and BSON objects are ordered.

Further, Extended JSON uses $-prefixed keys in type wrappers and has no provision for escaping a leading $ used elsewhere in a document. This means that the Extended JSON representation of a document with $-prefixed keys could be indistinguishable from another document with a type wrapper with the same keys.

Extended JSON formats SHOULD NOT be used in contexts where $-prefixed keys could exist in BSON documents (with the exception of the DBRef convention, which is accounted for in this spec).

Test Plan

Drivers, tools, and libraries can test their compliance to this specification by running the tests in version 2.0 and above of the BSON Corpus Test Suite.

Examples

Canonical Extended JSON Example

Consider the following document, written with the MongoDB Python Driver:

{
    "_id": bson.ObjectId("57e193d7a9cc81b4027498b5"),
    "String": "string",
    "Int32": 42,
    "Int64": bson.Int64(42),
    "Double": 42.42,
    "Decimal": bson.Decimal128("1234.5"),
    "Binary": uuid.UUID("c8edabc3-f738-4ca3-b68d-ab92a91478a3"),
    "BinaryUserDefined": bson.Binary(b'123', 80),
    "Code": bson.Code("function() {}"),
    "CodeWithScope": bson.Code("function() {}", scope={}),
    "Subdocument": {"foo": "bar"},
    "Array": [1, 2, 3, 4, 5],
    "Timestamp": bson.Timestamp(42, 1),
    "RegularExpression": bson.Regex("foo*", "xi"),
    "DatetimeEpoch": datetime.datetime.utcfromtimestamp(0),
    "DatetimePositive": datetime.datetime.max,
    "DatetimeNegative": datetime.datetime.min,
    "True": True,
    "False": False,
    "DBRef": bson.DBRef(
        "collection", bson.ObjectId("57e193d7a9cc81b4027498b1"), database="database"),
    "DBRefNoDB": bson.DBRef(
        "collection", bson.ObjectId("57fd71e96e32ab4225b723fb")),
    "Minkey": bson.MinKey(),
    "Maxkey": bson.MaxKey(),
    "Null": None
}

The above document is transformed into the following (newlines and spaces added for readability):

{
    "_id": {
        "$oid": "57e193d7a9cc81b4027498b5"
    },
    "String": "string",
    "Int32": {
        "$numberInt": "42"
    },
    "Int64": {
        "$numberLong": "42"
    },
    "Double": {
        "$numberDouble": "42.42"
    },
    "Decimal": {
        "$numberDecimal": "1234.5"
    },
    "Binary": {
        "$binary": {
            "base64": "yO2rw/c4TKO2jauSqRR4ow==",
            "subType": "04"
        }
    },
    "BinaryUserDefined": {
        "$binary": {
            "base64": "MTIz",
            "subType": "80"
        }
    },
    "Code": {
        "$code": "function() {}"
    },
    "CodeWithScope": {
        "$code": "function() {}",
        "$scope": {}
    },
    "Subdocument": {
        "foo": "bar"
    },
    "Array": [
        {"$numberInt": "1"},
        {"$numberInt": "2"},
        {"$numberInt": "3"},
        {"$numberInt": "4"},
        {"$numberInt": "5"}
    ],
    "Timestamp": {
        "$timestamp": { "t": 42, "i": 1 }
    },
    "RegularExpression": {
        "$regularExpression": {
            "pattern": "foo*",
            "options": "ix"
        }
    },
    "DatetimeEpoch": {
        "$date": {
            "$numberLong": "0"
        }
    },
    "DatetimePositive": {
        "$date": {
            "$numberLong": "253402300799999"
        }
    },
    "DatetimeNegative": {
        "$date": {
            "$numberLong": "-62135596800000"
        }
    },
    "True": true,
    "False": false,
    "DBRef": {
        "$ref": "collection",
        "$id": {
            "$oid": "57e193d7a9cc81b4027498b1"
        },
        "$db": "database"
    },
    "DBRefNoDB": {
        "$ref": "collection",
        "$id": {
            "$oid": "57fd71e96e32ab4225b723fb"
        }
    },
    "Minkey": {
        "$minKey": 1
    },
    "Maxkey": {
        "$maxKey": 1
    },
    "Null": null
}

Relaxed Extended JSON Example

In Relaxed Extended JSON, the example document is transformed similarly to Canonical Extended JSON, with the exception of the following keys (newlines and spaces added for readability):

{
    ...
    "Int32": 42,
    "Int64": 42,
    "Double": 42.42,
    ...
    "DatetimeEpoch": {
        "$date": "1970-01-01T00:00:00.000Z"
    },
    ...
}

Motivation for Change

There existed many Extended JSON parser and generator implementations prior to this specification that used conflicting formats, since there was no agreement on the precise format of Extended JSON. This resulted in problems where the output of some generators could not be consumed by some parsers.

MongoDB drivers needed a single, standard Extended JSON format for testing that covers all BSON types. However, there were BSON types that had no defined Extended JSON representation. This spec primarily addresses that need, but provides for slightly broader use as well.

Design Rationale

Of Relaxed and Canonical Formats

There are various use cases for expressing BSON documents in a text rather that binary format. They broadly fall into two categories:

  • Type preserving: for things like testing, where one has to describe the expected form of a BSON document, it's helpful to be able to precisely specify expected types. In particular, numeric types need to differentiate between Int32, Int64 and Double forms.
  • JSON-like: for things like a web API, where one is sending a document (or a projection of a document) that only uses ordinary JSON type primitives, it's desirable to represent numbers in the native JSON format. This output is also the most human readable and is useful for debugging and documentation.

The two formats in this specification address these two categories of use cases.

Of Parsers and Generators

Parsers need to accept any valid Extended JSON string that a generator can produce. Parsers and generators are permitted to accept and output strings in other formats as well for backwards compatibility.

Acceptable nesting depth has implications for resource usage so unlimited nesting is not permitted.

Generators support at least 100 levels of nesting in a BSON document being transformed to Extended JSON. This aligns with MongoDB's own limitation of 100 levels of nesting.

Parsers support at least 200 levels of nesting in Extended JSON text, since the Extended JSON language can double the level of apparent nesting of a BSON document by wrapping certain types in their own documents.

Of Canonical Type Wrapper Formats

Prior to this specification, BSON types fell into three categories with respect to Legacy Extended JSON:

  1. A single, portable representation for the type already existed.
  2. Multiple representations for the type existed among various Extended JSON generators, and those representations were in conflict with each other or with current portability goals.
  3. No Legacy Extended JSON representation existed.

If a BSON type fell into category (1), this specification just declares that form to be canonical, since all drivers, tools, and libraries already know how to parse or output this form. There are two exceptions:

RegularExpression

The form {"$regex: <string>, $options: <string>"} has until this specification been canonical. The change to {"$regularExpression": {pattern: <string>, "options": <string>"}} is motivated by a conflict between the previous canonical form and the $regex MongoDB query operator. The form specified here disambiguates between the two, such that a parser can accept any MongoDB query filter, even one containing the $regex operator.

Binary

The form {"$binary": "AQIDBAU=", "$type": "80"} has until this specification been canonical. The change to {"$binary": {"base64": "AQIDBAU=", "subType": "80"}} is motivated by a conflict between the previous canonical form and the $type MongoDB query operator. The form specified here disambiguates between the two, such that a parser can accept any MongoDB query filter, even one containing the $type operator.

Reconciled type wrappers

If a BSON type fell into category (2), this specification selects a new common representation for the type to be canonical. Conflicting formats were gathered by surveying a number of Extended JSON generators, including the MongoDB Java Driver (version 3.3.0), the MongoDB Python Driver (version 3.4.0.dev0), the MongoDB Extended JSON module on NPM (version 1.7.1), and each minor version of mongoexport from 2.4.14 through 3.3.12. When possible, we set the "strict" option on the JSON codec. The following BSON types had conflicting Extended JSON representations:

Binary

Some implementations write the Extended JSON form of a Binary object with a strict two-hexadecimal digit subtype (e.g. they output a leading 0 for subtypes < 16). However, the NPM mongodb-extended-json module and Java driver use a single hexadecimal digit to represent subtypes less than 16. This specification makes both one- and two-digit representations acceptable.

Code

Mongoexport 2.4 does not quote the Code value when writing out the extended JSON form of a BSON Code object. All other implementations do so. This spec canonicalises the form where the Javascript code is quoted, since the latter form adheres to the JSON specification and the former does not. As an additional note, the NPM mongodb-extended-json module uses the form {"code": "<javascript code>"}, omitting the dollar sign ($) from the key. This specification does not accommodate the eccentricity of a single library.

CodeWithScope

In addition to the same variants as BSON Code types, there are other variations when turning CodeWithScope objects into Extended JSON. Mongoexport 2.4 and 2.6 omit the scope portion of CodeWithScope if it is empty, making the output indistinguishable from a Code type. All other implementations include the empty scope. This specification therefore canonicalises the form where the scope is always included. The presence of $scope is what differentiates Code from CodeWithScope.

Datetime

Mongoexport 2.4 and the Java driver always transform a Datetime object into an Extended JSON string of the form {"$date": <ms since epoch>}. This form has the problem of a potential loss of precision or range on the Datetimes that can be represented. Mongoexport 2.6 transforms Datetime objects into an extended JSON string of the form {"$date": <ISO-8601 date string in local time>}for dates starting at or after the Unix epoch (UTC). Dates prior to the epoch take the form {"$date": {"$numberLong": "<ms since epoch>"}}. Starting in version 3.0, mongoexport always turns Datetime objects into strings of the form {"$date": <ISO-8601 date string in UTC>}. The NPM mongodb-extended-json module does the same. The Python driver can also transform Datetime objects into strings like {"$date": {"$numberLong": "<ms since epoch>"}}. This specification canonicalises this form, since this form is the most portable. In Relaxed Extended JSON format, this specification provides for ISO-8601 representation for better readability, but limits it to a portable subset, from the epoch to the end of the largest year that can be represented with four digits. This should encompass most typical use of dates in applications.

DBPointer

Mongoexport 2.4 and 2.6 use the form{"$ref": <namespace>, "$id": <hex string>}. All other implementations studied include the canonical ObjectId form:{"$ref": <namespace>, "$id": {"$oid": <hex string>}}. Neither of these forms are distinguishable from that of DBRef, so this specification creates a new format: {"$dbPointer": {"$ref": <namespace>, "$id": {"$oid": <hex string>}}}.

Newly-added type wrappers .

If a BSON type fell into category (3), above, this specification creates a type wrapper format for the type. The following new Extended JSON type wrappers are introduced by this spec:

  • $dbPointer- See above.
  • $numberInt - This is used to preserve the "int32" BSON type in Canonical Extended JSON. Without using $numberInt, this type will be indistinguishable from a double in certain languages where the distinction does not exist, such as Javascript.
  • $numberDouble - This is used to preserve the doubletype in Canonical Extended JSON, as some JSON generators might omit a trailing ".0" for integral types.
    It also supports representing non-finite values like NaN or Infinity which are prohibited in the JSON specification for numbers.
  • $symbol - The use of the $symbol key preserves the symbol type in Canonical Extended JSON, distinguishing it from JSON strings.

Reference Implementation

[Canonical Extended JSON format reference implementation needs to be updated] PyMongo implements the Canonical Extended JSON format, which must be chosen by selecting the right option on the JSONOptions object::

from bson.json_util import dumps, DatetimeRepresentation, CANONICAL_JSON_OPTIONS    
dumps(document, json_options=CANONICAL_JSON_OPTIONS)  

[Relaxed Extended JSON format reference implementation is TBD]

Implementation Notes

JSON File Format

Some applications like mongoexport may wish to write multiple Extended JSON documents to a single file. One way to do this is to list each JSON document one-per-line. When doing this, it is important to ensure that special characters like newlines are encoded properly (e.g.n).

Duplicate Keys

The BSON specification does not prohibit duplicate key names within the same BSON document, but provides no semantics for the interpretation of duplicate keys. The JSON specification says that names within an object should be unique, and many JSON libraries are incapable of handling this scenario. This specification is silent on the matter, so as not to conflict with a future change by either specification.

Future Work

This specification will need to be amended if future BSON types are added to the BSON specification.

Q&A

Q. Why was version 2 of the spec necessary? A. After Version 1 was released, several stakeholders raised concerns that not providing an option to output BSON numbers as ordinary JSON numbers limited the utility of Extended JSON for common historical uses. We decided to provide a second format option and more clearly distinguish the use cases (and limitations) inherent in each format.

Q. My BSON parser doesn't distinguish every BSON type. Does my Extended JSON generator need to distinguish these types?
A. No. Some BSON parsers do not emit a unique type for each BSON type, making round-tripping BSON through such libraries impossible without changing the document. For example, a DBPointer will be parsed into a DBRef by PyMongo. In such cases, a generator must emit the Extended JSON form for whatever type the BSON parser emitted. It does not need to preserve type information when that information has been lost by the BSON parser.

Q. How can implementations which require backwards compatibility with Legacy Extended JSON, in which BSON regular expressions were represented with $regex, handle parsing of extended JSON test representing a MongoDB query filter containing the $regex operator?
A. An implementation can handle this in a number of ways: - Introduce an enumeration that determines the behavior of the parser. If the value is LEGACY, it will parse $regexand not treat $regularExpression specially, and if the value is CANONICAL, it will parse $regularExpression and not treat $regex specially. - Support both legacy and canonical forms in the parser without requiring the application to specify one or the other. Making that work for the $regex query operator use case will require that the rules set forth in the 1.0.0 version of this specification are followed for $regex; specifically, that a document with a $regex key whose value is a JSON object should be parsed as a normal document and not reported as an error.

Q. How can implementations which require backwards compatibility with Legacy Extended JSON, in which BSON binary values were represented like {"$binary": "AQIDBAU=", "$type": "80"}, handle parsing of extended JSON test representing a MongoDB query filter containing the $typeoperator?
A. An implementation can handle this in a number of ways:
- Introduce an enumeration that determines the behavior of the parser. If the value is LEGACY, it will parse the new binary form and not treat the legacy one specially, and if the value is CANONICAL, it will parse the new form and not treat the legacy form specially. - Support both legacy and canonical forms in the parser without requiring the application to specify one or the other. Making that work for the $type query operator use case will require that the rules set forth in the 1.0.0 version of this specification are followed for $type; specifically, that a document with a $type key whose value is an integral type, or a document with a $type key but without a $binary key, should be parsed as a normal document and not reported as an error.

Q. Sometimes I see the term "extjson" used in other specifications. Is "extjson" related to this specification?
A. Yes, "extjson" is short for "Extended JSON".

Changelog

  • 2024-05-29: Migrated from reStructuredText to Markdown.
  • 2022-10-05: Remove spec front matter and reformat changelog.
  • 2021-05-26:
    • Remove any mention of extra dollar-prefixed keys being prohibited in a DBRef. MongoDB 5.0 and compatible drivers no longer enforce such restrictions.
    • Objects that resemble a DBRef without fully complying to its structure should be left as-is during parsing. - 2020-09-01: Note that $-prefixed keys not matching a known type MUST be left as-is when parsing. This is patch-level change as this behavior was already required in the BSON corpus tests ("Document with keys that start with $").
  • 2020-09-08:
    • Added support for parsing $uuid fields as BSON Binary subtype 4.
    • Changed the example to using the MongoDB Python Driver. It previously used the MongoDB Java Driver. The new example excludes the following BSON types that are unsupported in Python - Symbol,SpecialFloat,DBPointer, and Undefined. Transformations for these types are now only documented in the Conversion table_.
  • 2017-07-20:
    • Bumped specification to version 2.0.
    • Added "Relaxed" format.
    • Changed BSON timestamp type wrapper back to {"t": *int*, "i": *int*} for backwards compatibility. (The change in v1 to unsigned 64-bit string was premature optimization)
    • Changed BSON regular expression type wrapper to {"$regularExpression": {pattern: *string*, "options": *string*"}}.
    • Changed BSON binary type wrapper to {"$binary": {"base64": <base64-encoded payload as a *string*>, "subType": <BSON binary type as a one- or two-character *hex string*>}}
    • Added "Restrictions and limitations" section.
    • Clarified parser and generator rules.
  • 2017-02-01: Initial specification version 1.0.
1

This MUST conform to the Decimal128 specification

2

BSON Regular Expression options MUST be in alphabetical order.

5

Fractional seconds SHOULD have exactly 3 decimal places if the fractional part is non-zero. Otherwise, fractional seconds SHOULD be omitted if zero.

OP_MSG

  • Status: Accepted
  • Minimum Server Version: 3.6

Abstract

OP_MSG is a bi-directional wire protocol opcode introduced in MongoDB 3.6 with the goal of replacing most existing opcodes, merging their use into one extendable opcode.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Specification

Usage

OP_MSG is only available in MongoDB 3.6 (maxWireVersion >= 6) and later. MongoDB drivers MUST perform the MongoDB handshake using OP_MSG if an API version was declared on the client.

If no API version was declared, drivers that have historically supported MongoDB 3.4 and earlier MUST perform the handshake using OP_QUERY to determine if the node supports OP_MSG. Drivers that have only ever supported MongoDB 3.6 and newer MAY default to using OP_MSG.

If the node supports OP_MSG, any and all messages MUST use OP_MSG, optionally compressed with OP_COMPRESSED. Authentication messages MUST also use OP_MSG when it is supported, but MUST NOT use OP_COMPRESSED.

OP_MSG

Types used in this document

TypeMeaning
documentA BSON document
cstringNULL terminated string
int324 bytes (32-bit signed integer, two's complement)
uint81 byte (8-bit unsigned integer)
uint324 bytes (32-bit unsigned integer)
unionOne of the listed members

Elements inside brackets ([ ]) are optional parts of the message.

  • Zero or more instances: *
  • One or more instances: +

The new opcode, called OP_MSG, has the following structure:

struct Section {
    uint8 payloadType;
    union payload {
        document  document; // payloadType == 0
        struct sequence { // payloadType == 1
            int32      size;
            cstring    identifier;
            document*  documents;
        };
    };
};

struct OP_MSG {
    struct MsgHeader {
        int32  messageLength;
        int32  requestID;
        int32  responseTo;
        int32  opCode = 2013;
    };
    uint32      flagBits;
    Section+    sections;
    [uint32     checksum;]
};

Each OP_MSG MUST NOT exceed the maxMessageSizeBytes as configured by the MongoDB Handshake.

Each OP_MSG MUST have one section with Payload Type 0, and zero or more Payload Type 1. Bulk writes SHOULD use Payload Type 1, and MUST do so when the batch contains more than one entry.

Sections may exist in any order. Each OP_MSG MAY contain a checksum, and MUST set the relevant flagBits when that field is included.

FieldDescription
flagBitsNetwork level flags, such as signaling recipient that another message is incoming without any other actions in the meantime, and availability of message checksums
sectionsAn array of one or more sections
checksumcrc32c message checksum. When present, the appropriate flag MUST be set in the flagBits.
flagBits

flagBits contains a bit vector of specialized network flags. The low 16 bits declare what the current message contains, and what the expectations of the recipient are. The high 16 bits are designed to declare optional attributes of the current message and expectations of the recipient.

All unused bits MUST be set to 0.

Clients MUST error if any unsupported or undefined required bits are set to 1 and MUST ignore all undefined optional bits.

The currently defined flags are:

BitNameRequestResponseDescription
0checksumPresentxxChecksum present
1moreToComexxSender will send another message and is not prepared for overlapping messages
-16exhaustAllowedxClient is prepared for multiple replies (using the moreToCome bit) to this request
checksumPresent

This is a reserved field for future support of crc32c checksums.

moreToCome

The OP_MSG message is essentially a request-response protocol, one message per turn. However, setting the moreToCome flag indicates to the recipient that the sender is not ready to give up its turn and will send another message.

moreToCome On Requests

When the moreToCome flag is set on a request it signals to the recipient that the sender does not want to know the outcome of the message. There is no response to a request where moreToCome has been set. Clients doing unacknowledged writes MUST set the moreToCome flag, and MUST set the writeConcern to w=0.

If, during the processing of a moreToCome flagged write request, a server discovers that it is no longer primary, then the server will close the connection. All other errors during processing will be silently dropped, and will not result in the connection being closed.

moreToCome On Responses

When the moreToCome flag is set on a response it signals to the recipient that the sender will send additional responses on the connection. The recipient MUST continue to read responses until it reads a response with the moreToCome flag not set, and MUST NOT send any more requests on this connection until it reads a response with the moreToCome flag not set. The client MUST either consume all messages with the moreToCome flag set or close the connection.

When the server sends responses with the moreToCome flag set, each of these responses will have a unique messageId, and the responseTo field of every follow-up response will be the messageId of the previous response.

The client MUST be prepared to receive a response without moreToCome set prior to completing iteration of a cursor, even if an earlier response for the same cursor had the moreToCome flag set. To continue iterating such a cursor, the client MUST issue an explicit getMore request.

exhaustAllowed

Setting this flag on a request indicates to the recipient that the sender is prepared to handle multiple replies (using the moreToCome bit) to this request. The server will never produce replies with the moreToCome bit set unless the request has the exhaustAllowed bit set.

Setting the exhaustAllowed bit on a request does not guarantee that the responses will have the moreToCome bit set.

MongoDB server only handles the exhaustAllowed bit on the following operations. A driver MUST NOT set the exhaustAllowed bit on other operations.

OperationMinimum MongoDB Version
getMore4.2
hello (including legacy hello)4.4 (discoverable via topologyVersion)
sections

Each message contains one or more sections. A section is composed of an uint8 which determines the payload's type, and a separate payload field. The payload size for payload type 0 and 1 is determined by the first 4 bytes of the payload field (includes the 4 bytes holding the size but not the payload type).

FieldDescription
typeA byte indicating the layout and semantics of payload
payloadThe payload of a section can either be a single document, or a document sequence.

When the Payload Type is 0, the content of the payload is:

FieldDescription
documentThe BSON document. The payload size is inferred from the document's leading int32.

When the Payload Type is 1, the content of the payload is:

FieldDescription
sizePayload size (includes this 4-byte field)
identifierA unique identifier (for this message). Generally the name of the "command argument" it contains the value for
documents0 or more BSON documents. Each BSON document cannot be larger than maxBSONObjectSize.

Any unknown Payload Types MUST result in an error and the socket MUST be closed. There is no ordering implied by payload types. A section with payload type 1 can be serialized before payload type 0.

A fully constructed OP_MSG MUST contain exactly one Payload Type 0, and optionally any number of Payload Type 1 where each identifier MUST be unique per message.

Command Arguments As Payload

Certain commands support "pulling out" certain arguments to the command, and providing them as Payload Type 1, where the identifier is the command argument's name. Specifying a command argument as a separate payload removes the need to use a BSON Array. For example, Payload Type 1 allows an array of documents to be specified as a sequence of BSON documents on the wire without the overhead of array keys.

MongoDB 3.6 only allows certain command arguments to be provided this way. These are:

Command NameCommand Argument
insertdocuments
updateupdates
deletedeletes

Global Command Arguments

The new opcode contains no field for providing the database name. Instead, the protocol now has the concept of global command arguments. These global command arguments can be passed to all MongoDB commands alongside the rest of the command arguments.

Currently defined global arguments:

Argument NameDefault ValueDescription
$dbThe database name to execute the command on. MUST be provided and be a valid database name.
$readPreference{ "mode": "primary" }Determines server selection, and also whether a secondary server permits reads or responds "not writable primary". See Server Selection Spec for rules about when read preference must or must not be included, and for rules about when read preference "primaryPreferred" must be added automatically.

Additional global arguments are likely to be introduced in the future and defined in their own specs.

User originating commands

Drivers MUST NOT mutate user provided command documents in any way, whether it is adding required arguments, pulling out arguments, compressing it, adding supplemental APM data or any other modification.

Examples

Command Arguments As Payload Examples

For example, an insert can be represented like:

{
   "insert": "collectionName",
   "documents": [
      {"_id": "Document#1", "example": 1},
      {"_id": "Document#2", "example": 2},
      {"_id": "Document#3", "example": 3}
   ],
   "writeConcern": { w: "majority" }
}

Or, pulling out the "documents" argument out of the command document and Into Payload Type 1. The Payload Type 0 would then be:

{
   "insert": "collectionName",
   "$db": "databaseName",
   "writeConcern": { w: "majority" }
}

And Payload Type 1:

identifier: "documents"
documents: {"_id": "Document#1", "example": 1}{"_id": "Document#2", "example": 2}{"_id": "Document#3", "example": 3}

Note that the BSON documents are placed immediately after each other, not with any separator. The writeConcern is also left intact as a command argument in the Payload Type 0 section. The command name MUST continue to be the first key of the command arguments in the Payload Type 0 section.


An update can for example be represented like:

{
   "update": "collectionName",
   "updates": [
      {
         "q": {"example": 1},
         "u": { "$set": { "example": 4} }
      },
      {
         "q": {"example": 2},
         "u": { "$set": { "example": 5} }
      }
   ]
}

Or, pulling out the "update" argument out of the command document and Into Payload Type 1. The Payload Type 0 would then be:

{
   "update": "collectionName",
   "$db": "databaseName"
}

And Payload Type 1:

identifier: updates
documents: {"q": {"example": 1}, "u": { "$set": { "example": 4}}}{"q": {"example": 2}, "u": { "$set": { "example": 5}}}

Note that the BSON documents are placed immediately after each other, not with any separator.


A delete can for example be represented like:

{
   "delete": "collectionName",
   "deletes": [
      {
         "q": {"example": 3},
         "limit": 1
      },
      {
         "q": {"example": 4},
         "limit": 1
      }
   ]
}

Or, pulling out the "deletes" argument out of the command document and into Payload Type 1. The Payload Type 0 would then be:

{
   "delete": "collectionName",
   "$db": "databaseName"
}

And Payload Type 1:

identifier: delete
documents: {"q": {"example": 3}, "limit": 1}{"q": {"example": 4}, "limit": 1}

Note that the BSON documents are placed immediately after each other, not with any separator.

Test Plan

  • Create a single document and insert it over OP_MSG, ensure it works
  • Create two documents and insert them over OP_MSG, ensure each document is pulled out and presented as document sequence.
  • hello.maxWriteBatchSize might change and be bumped to 100,000
  • Repeat the previous 5 tests as updates, and then deletes.
  • Create one small document, and one large 16mb document. Ensure they are inserted, updated and deleted in one roundtrip.

Motivation For Change

MongoDB clients are currently required to work around various issues that each current opcode has, such as having to determine what sort of node is on the other end as it affects the actual structure of certain messages. MongoDB 3.6 introduces a new wire protocol opcode, OP_MSG, which aims to resolve most historical issues along with providing a future compatible and extendable opcode.

Backwards Compatibility

The hello.maxWriteBatchSize is being bumped, which also affects OP_QUERY, not only OP_MSG. As a sideeffect, write errors will now have the message truncated, instead of overflowing the maxMessageSize, if the server determines it would overflow the allowed size. This applies to all commands that write. The error documents are structurally the same, with the error messages simply replaced with empty strings.

Reference Implementations

  • mongoc
  • .net

Future Work

In the near future, this opcode is expected to be extended and include support for:

  • Message checksum (crc32c)
  • Output document sequences
  • moreToCome can also be used for other commands, such as killCursors to restore OP_KILL_CURSORS behaviour as currently any errors/replies are ignored.

Q & A

  • Has the maximum number of documents per batch changed ?

    • The maximum number of documents per batch is dictated by the maxWriteBatchSize value returned during the MongoDB Handshake. It is likely this value will be bumped from 1,000 to 100,000.
  • Has the maximum size of the message changed?

    • No. The maximum message size is still the maxMessageSizeBytes value returned during the MongoDB Handshake.
  • Is everything still little-endian?

    • Yes. As with BSON, all MongoDB opcodes must be serialized in little-endian format.
  • How does fire-and-forget (w=0 / unacknowledged write) work over OP_MSG?

    • The client sets the moreToCome flag on the request. The server will not send a response to such requests.
    • Malformed operation or errors such as duplicate key errors are not discoverable and will be swallowed by the server.
    • Write errors due to not-primary will close the connection, which clients will pickup on next time it uses the connection. This means at least one unacknowledged write operation will be lost as the client does not discover the failover until next time the socket is used.
  • Should we provide runMoreToComeCommand() helpers? Since the protocol allows any command to be tagged with moreToCome, effectively allowing any operation to become fire & forget, it might be a good idea to add such helper, rather then adding wire protocol headers as options to the existing runCommand helpers.

Changelog

  • 2024-04-30: Convert from RestructuredText to Markdown.
  • 2022-10-05: Remove spec front matter.
  • 2022-01-13: Clarify that OP_MSG must be used when using stable API
  • 2021-12-16: Clarify that old drivers should default to OP_QUERY handshakes
  • 2021-04-20: Suggest using OP_MSG for initial handshake when using stable API
  • 2021-04-06: Updated to use hello and not writable primary
  • 2017-11-12: Specify read preferences for OP_MSG with direct connection
  • 2017-08-17: Added the User originating command section
  • 2017-07-18: Published initial version

=========== Run Command

:Status: Accepted :Minimum Server Version: N/A

.. contents::


Abstract

This specification defines requirements and behaviors for drivers' run command and related APIs.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 <https://www.ietf.org/rfc/rfc2119.txt>_.

Specification


Terms

Command A structure representing a BSON document that has a shape matching a supported MongoDB operation.


Implementation requirements

All drivers MAY offer the operations defined in the following sections. This does not preclude a driver from offering more.


Deviations

Please refer to The CRUD specification's Guidance <../crud/crud.md#guidance>_ on how APIs may deviate between languages.

Cursor iterating APIs MAY be offered via language syntax or predefined iterable methods.


runCommand

The following represents how a runCommand API SHOULD be exposed.

.. code:: typescript

interface Database {
  /**
   * Takes an argument representing an arbitrary BSON document and executes it against the server.
   */
  runCommand(command: BSONDocument, options: RunCommandOptions): BSONDocument;
}

interface RunCommandOptions {
  /**
   * An optional readPreference setting to apply to server selection logic.
   * This value MUST be applied to the command document as the $readPreference global command argument if not set to primary.
   *
   * @defaultValue ReadPreference(mode: primary)
   *
   * @see ../server-selection/server-selection.md#read-preference
   */
  readPreference?: ReadPreference;

  /**
   * An optional explicit client session.
   * The associated logical session id (`lsid`) the driver MUST apply to the command.
   *
   * @see ../sessions/driver-sessions.md#clientsession
   */
  session?: ClientSession;

  /**
   * An optional timeout option to govern the amount of time that a single operation can execute before control is returned to the user.
   * This timeout applies to all of the work done to execute the operation, including but not limited to server selection, connection checkout, and server-side execution.
   *
   * @ see https://github.com/mongodb/specifications/blob/master/source/client-side-operations-timeout/client-side-operations-timeout.rst
   */
  timeoutMS?: number;
}

RunCommand implementation details ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

RunCommand provides a way to access MongoDB server commands directly without requiring a driver to implement a bespoke helper. The API is intended to take a document from a user and apply a number of common driver internal concerns before forwarding the command to a server. A driver MUST not inspect the user's command, this includes checking for the fields a driver MUST attach to the command sent as described below. Depending on a driver's BSON implementation this can result in these fields being overwritten or duplicated, a driver SHOULD document that using these fields has undefined behavior. A driver MUST not modify the user's command, a clone SHOULD be created before the driver attaches any of the required fields to the command.

Drivers that have historically modified user input SHOULD strive to instead clone the input such that appended fields do not affect the user's input in their next major version.

OP_MSG """"""

The $db global command argument MUST be set on the command sent to the server and it MUST equal the database name RunCommand was invoked on.

  • See OP_MSG's section on Global Command Arguments <../message/OP_MSG.md#global-command-arguments>_

ReadPreference """"""""""""""

For the purposes of server selection RunCommand MUST assume all commands are read operations. To facilitate server selection the RunCommand operation MUST accept an optional readPreference option.

  • See Server Selection's section on Use of read preferences with commands <../server-selection/server-selection.md#use-of-read-preferences-with-commands>_

If the provided ReadPreference is NOT {mode: primary} and the selected server is NOT a standalone, the command sent MUST include the $readPreference global command argument.

  • See OP_MSG's section on Global Command Arguments <../message/OP_MSG.md#global-command-arguments>_

Driver Sessions """""""""""""""

A driver's RunCommand MUST provide an optional session option to support explicit sessions and transactions. If a session is not provided the driver MUST attach an implicit session if the connection supports sessions. Drivers MUST NOT attempt to check the command document for the presence of an lsid.

Every ClientSession has a corresponding logical session ID representing the server-side session ID. The logical session ID MUST be included under lsid in the command sent to the server without modifying user input.

  • See Driver Sessions' section on Sending the session ID to the server on all commands <../sessions/driver-sessions.md#sending-the-session-id-to-the-server-on-all-commands>_

The command sent to the server MUST gossip the $clusterTime if cluster time support is detected.

  • See Driver Sessions' section on Gossipping the cluster time <../sessions/driver-sessions.md#gossipping-the-cluster-time>_

Transactions """"""""""""

If RunCommand is used within a transaction the read preference MUST be sourced from the transaction's options. The command sent to the server MUST include the transaction specific fields, summarized as follows:

  • If runCommand is executing within a transaction:

    • autocommit - The autocommit flag MUST be set to false.
    • txnNumber - MUST be set.
  • If runCommand is the first operation of the transaction:

    • startTransaction - MUST be set to true.
    • readConcern - MUST be set to the transaction's read concern if it is NOT the default.
  • See Generic RunCommand helper within a transaction <../transactions/transactions.md#generic-runcommand-helper-within-a-transaction>_ in the Transactions specification.

ReadConcern and WriteConcern """"""""""""""""""""""""""""

RunCommand MUST NOT support read concern and write concern options. Drivers MUST NOT attempt to check the command document for the presence of a readConcern and writeConcern field.

Additionally, unless executing within a transaction, RunCommand MUST NOT set the readConcern or writeConcern fields in the command document. For example, default values MUST NOT be inherited from client, database, or collection options.

If the user-provided command document already includes readConcern or writeConcern fields, the values MUST be left as-is.

  • See Read Concern's section on Generic Command Method <https://github.com/mongodb/specifications/blob/master/source/read-write-concern/read-write-concern.rst#generic-command-method>__
  • See Write Concern's section on Generic Command Method <https://github.com/mongodb/specifications/blob/master/source/read-write-concern/read-write-concern.rst#generic-command-method-1>__

Retryability """"""""""""

All commands executed via RunCommand are non-retryable operations. Drivers MUST NOT inspect the command to determine if it is a write and MUST NOT attach a txnNumber.

  • See Retryable Reads' section on Unsupported Read Operations <../retryable-reads/retryable-reads.md#unsupported-read-operations>_
  • See Retryable Writes' section on Behavioral Changes for Write Commands <../retryable-writes/retryable-writes.md#behavioral-changes-for-write-commands>_

Stable API """"""""""

The command sent MUST attach stable API fields as configured on the MongoClient.

  • See Stable API's section on Generic Command Helper Behaviour <https://github.com/mongodb/specifications/blob/master/source/versioned-api/versioned-api.rst#generic-command-helper>_

Client Side Operations Timeout """"""""""""""""""""""""""""""

RunCommand MUST provide an optional timeoutMS option to support client side operations timeout. Drivers MUST NOT attempt to check the command document for the presence of a maxTimeMS field. Drivers MUST document the behavior of RunCommand if a maxTimeMS field is already set on the command (such as overwriting the command field).

  • See Client Side Operations Timeout's section on runCommand <../client-side-operations-timeout/client-side-operations-timeout.md#runcommand>_
  • See Client Side Operations Timeout's section on runCommand behavior <../client-side-operations-timeout/client-side-operations-timeout.md#runcommand-behavior>_

runCursorCommand

Drivers MAY expose a runCursorCommand API with the following syntax.

.. code:: typescript

interface Database {
  /**
   * Takes an argument representing an arbitrary BSON document and executes it against the server.
   */
  runCursorCommand(command: BSONDocument, options: RunCursorCommandOptions): RunCommandCursor;
}

interface RunCursorCommandOptions extends RunCommandOptions {
  /**
   * This option is an enum with possible values CURSOR_LIFETIME and ITERATION.
   * For operations that create cursors, timeoutMS can either cap the lifetime of the cursor or be applied separately to the original operation and all subsequent calls.
   * To support both of these use cases, these operations MUST support a timeoutMode option.
   *
   * @defaultValue CURSOR_LIFETIME
   *
   * @see https://github.com/mongodb/specifications/blob/master/source/client-side-operations-timeout/client-side-operations-timeout.rst
   */
  timeoutMode?: ITERATION | CURSOR_LIFETIME;

  /**
   * See the `cursorType` enum defined in the crud specification.
   * @see https://github.com/mongodb/specifications/blob/master/source/crud/crud.md#read
   *
   * Identifies the type of cursor this is for client side operations timeout to properly apply timeoutMode settings.
   *
   * A tailable cursor can receive empty `nextBatch` arrays in `getMore` responses.
   * However, subsequent `getMore` operations may return documents if new data has become available.
   *
   * A tailableAwait cursor is an enhancement where instead of dealing with empty responses the server will block until data becomes available.
   *
   * @defaultValue NON_TAILABLE
   */
  cursorType?: CursorType;
}

/**
 * The following are the configurations a driver MUST provide to control how getMores are constructed.
 * How the options are controlled should be idiomatic to the driver's language.
 * See Executing ``getMore`` Commands.
 */
interface RunCursorCommandGetMoreOptions {
  /** Any positive integer is permitted. */
  batchSize?: int;
  /** Any non-negative integer is permitted. */
  maxTimeMS?: int;
  comment?: BSONValue;
}

RunCursorCommand implementation details ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

RunCursorCommand provides a way to access MongoDB server commands that return a cursor directly without requiring a driver to implement a bespoke cursor implementation. The API is intended to be built upon RunCommand and take a document from a user and apply a number of common driver internal concerns before forwarding the command to a server. A driver can expect that the result from running this command will return a document with a cursor field and MUST provide the caller with a language native abstraction to continue iterating the results from the server. If the response from the server does not include a cursor field the driver MUST throw an error either before returning from runCursorCommand or upon first iteration of the cursor.

High level RunCursorCommand steps:

  • Run the cursor creating command provided by the caller and retain the ClientSession used as well as the server the command was executed on.
  • Create a local cursor instance and store the firstBatch, ns, and id from the response.
  • When the current batch has been fully iterated, execute a getMore using the same server the initial command was executed on.
  • Store the nextBatch from the getMore response and update the cursor's id.
  • Continue to execute getMore commands as needed when the caller empties local batches until the cursor is exhausted or closed (i.e. id is zero).

Driver Sessions """""""""""""""

A driver MUST create an implicit ClientSession if none is provided and it MUST be attached for the duration of the cursor's lifetime. All getMore commands constructed for this cursor MUST send the same lsid used on the initial command. A cursor is considered exhausted or closed when the server reports its id as zero. When the cursor is exhausted the client session MUST be ended and the server session returned to the pool as early as possible rather than waiting for a caller to completely iterate the final batch.

  • See Drivers Sessions' section on Sessions and Cursors <../sessions/driver-sessions.md#sessions-and-cursors>_

Server Selection """"""""""""""""

RunCursorCommand MUST support a readPreference option that MUST be used to determine server selection. The selected server MUST be used for subsequent getMore commands.

Load Balancers """"""""""""""

When in loadBalanced mode, a driver MUST pin the connection used to execute the initial operation, and reuse it for subsequent getMore operations.

  • See Load Balancer's section on Behaviour With Cursors <../load-balancers/load-balancers.md#behaviour-with-cursors>_

Iterating the Cursor """"""""""""""""""""

Drivers MUST provide an API, typically, a method named next(), that returns one document per invocation. If the cursor's batch is empty and the cursor id is nonzero, the driver MUST perform a getMore operation.

Executing getMore Commands """"""""""""""""""""""""""""""

The cursor API returned to the caller MUST offer an API to configure batchSize, maxTimeMS, and comment options that are sent on subsequent getMore commands. If it is idiomatic for a driver to allow setting these options in RunCursorCommandOptions, the driver MUST document that the options only pertain to getMore commands. A driver MAY permit users to change getMore field settings at any time during the cursor's lifetime and subsequent getMore commands MUST be constructed with the changes to those fields. If that API is offered drivers MUST write tests asserting getMore commands are constructed with any updated fields.

  • See Find, getMore and killCursors commands' section on GetMore <https://github.com/mongodb/specifications/blob/master/source/find_getmore_killcursors_commands.rst#getmore>_

Tailable and TailableAwait """"""""""""""""""""""""""

  • See first: Find, getMore and killCursors commands's section on Tailable cursors <https://github.com/mongodb/specifications/blob/master/source/find_getmore_killcursors_commands.rst#tailable-cursors>_

It is the responsibility of the caller to construct their initial command with awaitData and tailable flags as well as inform RunCursorCommand of the cursorType that should be constructed. Requesting a cursorType that does not align with the fields sent to the server on the initial command SHOULD be documented as undefined behavior.

Resource Cleanup """"""""""""""""

Drivers MUST provide an explicit mechanism for releasing the cursor resources, typically a .close() method. If the cursor id is nonzero a KillCursors operation MUST be attempted, the result of the operation SHOULD be ignored. The ClientSession associated with the cursor MUST be ended and the ServerSession returned to the pool.

  • See Driver Sessions' section on When sending a killCursors command <../sessions/driver-sessions.md#when-sending-a-killcursors-command>_
  • See Find, getMore and killCursors commands' section on killCursors <https://github.com/mongodb/specifications/blob/master/source/find_getmore_killcursors_commands.rst#killcursors>_

Client Side Operations Timeout """"""""""""""""""""""""""""""

RunCursorCommand MUST provide an optional timeoutMS option to support client side operations timeout. Drivers MUST NOT attempt to check the command document for the presence of a maxTimeMS field. Drivers MUST document the behavior of RunCursorCommand if a maxTimeMS field is already set on the command. Drivers SHOULD raise an error if both timeoutMS and the getMore-specific maxTimeMS option are specified (see: Executing getMore Commands_). Drivers MUST document that attempting to set both options can have undefined behavior and is not supported.

When timeoutMS and timeoutMode are provided the driver MUST support timeout functionality as described in the CSOT specification.

  • See Client Side Operations Timeout's section on Cursors <../client-side-operations-timeout/client-side-operations-timeout.md#cursors>_

Changelog

:2023-05-10: Add runCursorCommand API specification. :2023-05-08: $readPreference is not sent to standalone servers :2023-04-20: Add run command specification.

Connection String Spec

  • Status: Accepted
  • Minimum Server Version: N/A

Abstract

The purpose of the Connection String is to provide a machine readable way of configuring a MongoClient, allowing users to configure and change the connection to their MongoDB system without requiring any application code changes.

This specification defines how the connection string is constructed and parsed. The aim is not to list all of connection string options and their semantics. Rather it defines the syntax of the connection string, including rules for parsing, naming conventions for options, and standard data types.

It should be noted that while the connection string specification is inspired by the URI specification as described in RFC 3986 and uses similar terminology, it does not conform to that specification.

Definitions

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

General Syntax

In general we follow URI style conventions, however unlike a URI the connection string supports multiple hosts.

mongodb://username:password@example.com:27017,example2.com:27017,...,example.comN:27017/database?key=value&keyN=valueN
\_____/   \_______________/ \_________/ \__/  \_______________________________________/ \______/ \_/ \___/
  |             |             |           |                    |                          |       |    |
Scheme          |            Host        Port        Alternative host identifiers         |      Key Value
             Userinfo       \_____________/                                               |      \_______/
                                   |                                              Auth database      |
                              Host Identifier                                                    Key Value Pair
                             \_______________________________________________________/          \___________________/
                                                      |                                                   |
                                                 Host Information                                  Connection Options

Scheme

The scheme mongodb represents that this is a connection string for a MongoClient.

Other schemes are also possible and are introduced through additional specifications. These additional schemes build on top of the connection string as documented in this specification.

For example the mongodb+srv specification, introduced with Initial DNS Seedlist Discovery, obtains information from DNS in addition to just the connection string.

Userinfo (optional)

The user information if present, is followed by a commercial at-sign ("@") that delimits it from the host.

A password may be supplied as part of the user information and is anything after the first colon (":") up until the end of the user information.

RFC 3986 has guidance for encoding user information in Section 2.1 ("Percent-Encoding"), Section 2.2 ("Reserved Characters"), and Section 3.2.1 ("User Information").

Specifically, Section 3.2.1 provides for the following allowed characters:

userinfo    = *( unreserved / pct-encoded / sub-delims / ":" )

If the user information contains an at-sign ("@"), more than one colon (":"), or a percent-sign ("%") that does not match the rules for "pct-encoded", then an exception MUST be thrown informing the user that the username and password must be URL encoded.

Above and beyond that restriction, drivers SHOULD require connection string user information to follow the "userinfo" encoding rules of RFC 3986 and SHOULD throw an exception if disallowed characters are detected. However, for backwards-compatibility reasons, drivers MAY allow reserved characters other than "@" and ":" to be present in user information without percent-encoding.

Host Information

Unlike a standard URI, the connection string allows for identifying multiple hosts. The host information section of the connection string MAY be delimited by the trailing slash ("/") or end of string.

The host information must contain at least one host identifier but may contain more (see the alternative hosts / ports in the general syntax diagram above). Multiple host identifiers are delimited by a comma (",").

Host Identifier

A host identifier consists of a host and an optional port.

Host

Identifies a server address to connect to. It can identify either a hostname, IP address, IP Literal, or UNIX domain socket. For definitions of hostname, IP address and IP Literal formats see RFC 3986 Section 3.2.2 .

UNIX domain sockets MUST end in ".sock" and MUST be URL encoded, for example:

mongodb://user:pass@%2Ftmp%2Fmongodb-27017.sock/authDB?replicaSet=rs

The host information cannot contain an unescaped slash ("/"), if it does then an exception MUST be thrown informing users that paths must be URL encoded. For example:

Unsupported host '/tmp/mongodb-27017.sock', UNIX socket domain paths must be URL encoded.

Support for UNIX domain sockets and IP Literals is OPTIONAL.

Unsupported host types MUST throw an exception informing the user they are not supported.

This specification does not define how host types should be differentiated (e.g. determining if a parsed host string is a socket path or hostname). It is merely concerned with extracting the host identifiers from the URI.

Port (optional)

The port is an integer between 1 and 65535 (inclusive) that identifies the port to connect to. See RFC 3986 3.2.3.

Auth Database (optional)

The database to authenticate against. If provided it is everything after the Host Information (ending with "/") and up to the first question mark ("?") or end of string. The auth database MUST be URL decoded by the parser.

The following characters MUST NOT appear in the database name, once it has been decoded: slash ("/"), backslash (""), space (" "), double-quote ("""), or dollar sign ("$"). The MongoDB Manual says that period (".") is also prohibited, but drivers MAY allow periods in order to express a namespace (database and collection name, perhaps containing multiple periods) in this part of the URL.

The presence of the auth database component without other credential data such as Userinfo or authentication parameters in connection options MUST NOT be interpreted as a request for authentication.

Connection Options (optional)

Any extra options to configure the MongoClient connection can be specified in the connection options part of the connection string. If provided, it is everything after the Host Information (ending with "/"), optional auth database, and first question mark ("?") to the end of the string. Connection Options consist of an ordered list of Key Value Pairs that are delimited by an ampersand ("&"). A delimiter of a semi colon (";") MAY also be supported for connection options for legacy reasons.

Key Value Pair

A key value pair represents the option key and its associated value. The key is everything up to the first equals sign ("=") and the value is everything afterwards. Key values contain the following information:

  • Key:
    The connection option's key string. Keys should be normalised and character case should be ignored.

  • Value: (optional)
    The value if provided otherwise it defaults to an empty string.

Defining connection options

Connection option key values MUST be defined in the relevant specification that describes the usage of the key and value. The value data type MUST also be defined there. The value's default value SHOULD also be defined if it is relevant.

Keys

Keys are strings and the character case must be normalized by lower casing the uppercase ASCII characters A through Z; other characters are left as-is.

When defining and documenting keys, specifications should follow the camelCase naming convention with the first letter in lowercase, snake_case MUST not be used. Keys that aren't supported by a driver MUST be ignored.

Keys that aren't supported by a driver MUST be ignored. A WARN level logging message MUST be issued for unsupported keys. For example:

Unsupported option 'connectMS'.

Keys should be descriptive and follow existing conventions:

Time based keys

If a key represents a unit of time it MUST end with that unit of time.

Key authors SHOULD follow the existing convention of defaulting to using milliseconds as the unit of time (e.g. connectionTimeoutMS).

Values

The values in connection options MUST be URL decoded by the parser. The values can represent the following data types:

  • Strings: The value

  • Integer: The value parsed as a integer. If the value is the empty string, the key MUST be ignored.

  • Boolean: "true" and "false" strings MUST be supported. If the value is the empty string, the key MUST be ignored.

    • For legacy reasons it is RECOMMENDED that alternative values for true and false be supported:
      • true: "1", "yes", "y" and "t"
      • false: "0", "-1", "no", "n" and "f".

    Alternative values are deprecated and MUST be removed from documentation and examples.

    If any of these alternative values are used, drivers MUST log a deprecation notice or issue a logging message at the WARNING level (as appropriate for your language). For example:

    Deprecated boolean value for "journal" : "1", please update to "journal=true"
    
  • Lists: Repeated keys represent a list in the Connection String consisting of the corresponding values in the same order as they appear in the Connection String. For example:

    ?readPreferenceTags=dc:ny,rack:1&readPreferenceTags=dc:ny&readPreferenceTags=
    
  • Key value pairs: A value that represents one or more key and value pairs. Multiple key value pairs are delimited by a comma (","). The key is everything up to the first colon sign (":") and the value is everything afterwards.

    For example:

    ?readPreferenceTags=dc:ny,rack:1
    

    Drivers MUST handle unencoded colon signs (":") within the value. For example, given the connection string:

    ?authMechanismProperties=TOKEN_RESOURCE:mongodb://foo
    

    the driver MUST interpret the key as TOKEN_RESOURCE and the value as mongodb://foo.

    For any option key-value pair that may contain a comma (such as TOKEN_RESOURCE), drivers MUST document that: a value containing a comma (",") MUST NOT be provided as part of the connection string. This prevents use of values that would interfere with parsing.

Any invalid Values for a given key MUST be ignored and MUST log a WARN level message. For example:

Unsupported value for "fsync" : "ifPossible"

Repeated Keys

If a key is repeated and the corresponding data type is not a List then the precedence of which key value pair will be used is undefined except where defined otherwise by the URI options spec.

Where possible, a warning SHOULD be raised to inform the user that multiple options were found for the same value.

Deprecated Key Value Pairs

If a key name was deprecated due to renaming it MUST still be supported. Users aren't expected to be vigilant on changes to key names.

If the renamed key is also defined in the connection string the deprecated key MUST NOT be applied and a WARN level message MUST be logged. For example:

Deprecated key "wtimeout" present and ignored as found replacement "wtimeoutms" value.

Deprecated keys MUST log a WARN level message informing the user that the option is deprecated and supply the alternative key name. For example:

Deprecated key "wtimeout" has been replaced with "wtimeoutms"

Legacy support

Semi colon (";") query parameter delimiters and alternative string representations of Boolean values MAY be supported only for legacy reasons.

As these options are not standard they might not be supported across all drivers. As such, these alternatives MUST NOT be used as general examples or documentation.

Language specific connection options

Connection strings are a mechanism to configure a MongoClient outside the user's application. As each driver may have language specific configuration options, those options SHOULD also be supported via the connection string. Where suitable, specifications MUST be updated to reflect new options.

Keys MUST follow existing connection option naming conventions as defined above. Values MUST also follow the existing, specific data types.

Any options that are not supported MUST raise a WARN log level as described in the keys section.

Connection options precedence

If a driver allows URI options to be specified outside of the connection string (e.g. dictionary parameter to the MongoClient constructor) it MUST document the precedence rules between all such mechanisms. For instance, a driver MAY allow a value for option foo in a dictionary parameter to override the value of foo in the connection string (or vice versa) so long as that behavior is documented.

Test Plan

See the README for tests.

Motivation for Change

The motivation for this specification is to publish how connection strings are formed and how they should be parsed. This is important because although the connection string follows the terminology of a standard URI format (as described in RFC 3986) it is not a standard URI and cannot be parsed by standard URI parsers.

The specification also formalizes the standard practice for the definition of new connection options and where the responsibility for their definition should be.

Design Rationale

The rationale for the Connection String is to provide a consistent, driver independent way to define the connection to a MongoDB system outside of the application. The connection string is an existing standard and is already widely used.

Backwards Compatibility

Connection Strings are already generally supported across languages and driver implementations. As the responsibility for the definitions of connections options relies on the specifications defining them, there should be no backwards compatibility breaks caused by this specification with regards to options.

Connection options precedence may cause some backwards incompatibilities as existing driver behaviour differs here. As such, it is currently only a recommendation.

Reference Implementation

The Java driver implements a ConnectionString class for the parsing of the connection string; however, it does not support UNIX domain sockets. The Python driver's uri_parser module implements connection string parsing for both hosts and UNIX domain sockets.

The following example parses a connection string into its components and can be used as a guide.

Given the string mongodb://foo:bar%3A@mongodb.example.com,%2Ftmp%2Fmongodb-27018.sock/admin?w=1:

  1. Validate and remove the scheme prefix mongodb://, leaving: foo:bar%3A@mongodb.example.com,%2Ftmp%2Fmongodb-27018.sock/admin?w=1
  2. Split the string by the first, unescaped / (if any), yielding:
    1. User information and host identifiers: foo:bar%3A@mongodb.example.com,%2Ftmp%2Fmongodb-27018.sock.
    2. Auth database and connection options: admin?w=1.
  3. Split the user information and host identifiers string by the last, unescaped @, yielding:
    1. User information: foo:bar%3A.
    2. Host identifiers: mongodb.example.com,%2Ftmp%2Fmongodb-27018.sock.
  4. Validate, split (if applicable), and URL decode the user information. In this example, the username and password would be foo and bar:, respectively.
  5. Validate, split, and URL decode the host identifiers. In this example, the hosts would be ["mongodb.example.com", "/tmp/mongodb-27018.sock"].
  6. Split the auth database and connection options string by the first, unescaped ?, yielding:
    1. Auth database: admin.
    2. Connection options: w=1.
  7. URL decode the auth database. In this example, the auth database is admin.
  8. Validate the [database contains no prohibited characters](#database contains no prohibited characters).
  9. Validate, split, and URL decode the connection options. In this example, the connection options are {w: 1}.

Q&A

Q: What about existing Connection Options that aren't currently defined in a specification?
Ideally all MongoClient options would already belong in their relevant specifications. As we iterate and produce more specifications these options should be covered.

Q: Why is it recommended that Connection Options take precedence over application set options?
This is only a recommendation but the reasoning is application code is much harder to change across deployments. By making the Connection String take precedence from outside the application it would be easier for the application to be portable across environments. The order of precedence of MongoClient hosts and options is recommended to be from low to high:

  1. Default values
  2. MongoClient hosts and options
  3. Connection String hosts and options

Q: Why WARN level warning on unknown options rather than throwing an exception?
It is responsible to inform users of possible misconfigurations and both methods achieve that. However, there are conflicting requirements of a Connection String. One goal is that any given driver should be configurable by a connection string but different drivers and languages have different feature sets. Another goal is that Connection Strings should be portable and as such some options supported by language X might not be relevant to language Y. Any given driver does not know is an option is specific to a different driver or is misspelled or just not supported. So the only way to stay portable and support configuration of all options is to not throw an exception but rather log a warning.

Q: How long should deprecation options be supported?
This is not declared in this specification. It's not deemed responsible to give a single timeline for how long deprecated options should be supported. As such any specifications that deprecate options that do have the context of the decision should provide the timeline.

Q: Why can I not use a standard URI parser?
The connection string format does not follow the standard URI format (as described in RFC 3986) we differ in two key areas:

  1. Hosts
    The connection string allows for multiple hosts for high availability reasons but standard URI's only ever define a single host.

  2. Query Parameters / Connection Options
    The connection string provides a concreted definition on how the Connection Options are parsed, including definitions of different data types. The RFC 3986 only defines that they are key=value pairs and gives no instruction on parsing. In fact different languages handle the parsing of query parameters in different ways and as such there is no such thing as a standard URI parser.

Q: Can the connection string contain non-ASCII characters?
The connection string can contain non-ASCII characters. The connection string is text, which can be encoded in any way appropriate for the application (e.g. the C Driver requires you to pass it a UTF-8 encoded connection string).

Q: Why does reference implementation check for a .sock suffix when parsing a socket path and possible auth database?
To simplify parsing of a socket path followed by an auth database, we rely on MongoDB's naming restrictions), which do not allow database names to contain a dot character, and the fact that socket paths must end with .sock. This allows us to differentiate the last part of a socket path from a database name. While we could immediately rule out an auth database on the basis of the dot alone, this specification is primarily concerned with breaking down the components of a URI (e.g. hosts, auth database, options) in a deterministic manner, rather than applying strict validation to those parts (e.g. host types, database names, allowed values for an option). Additionally, some drivers might allow a namespace (e.g. "db.collection") for the auth database part, so we do not want to be more strict than is necessary for parsing.

Q: Why throw an exception if the userinfo contains a percent sign ("%"), at-sign ("@"), or more than one colon (":")?
This is done to help users format the connection string correctly. Although at-signs ("@") or colons (":") in the username must be URL encoded, users may not be aware of that requirement. Take the following example:

mongodb://anne:bob:pass@localhost:27017

Is the username anne and the password bob:pass or is the username anne:bob and the password pass? Accepting this as the userinfo could cause authentication to fail, causing confusion for the user as to why. Allowing unescaped at-sign and percent symbols would invite further ambiguity. By throwing an exception users are made aware and then update the connection string so to be explicit about what forms the username and password.

Q: Why must UNIX domain sockets be URL encoded?
This has been done to reduce ambiguity between the socket name and the database name. Take the following example:

mongodb:///tmp/mongodb.sock/mongodb.sock

Is the host /tmp/mongodb.sock and the auth database mongodb.sock or does the connection string just contain the host /tmp/mongodb.sock/mongodb.sock and no auth database? By enforcing URL encoding on UNIX domain sockets it makes users be explicit about the host and the auth database. By requiring an exception to be thrown when the host contains a slash ("/") users can be informed on how to migrate their connection strings.

Q: Why must the auth database be URL decoded by the parser?
On Linux systems database names can contain a question mark ("?"), in these rare cases the auth database must be URL encoded. This disambiguates between the auth database and the connection options. Take the following example:

mongodb://localhost/admin%3F?w=1

In this case the auth database would be admin? and the connection options w=1.

Q: How should the space character be encoded in a connection string?
Space characters SHOULD be encoded as %20 rather than +, this will be portable across all implementations. Implementations MAY support decoding + into a space, as many languages treat strings as x-www-form-urlencoded data by default.

Changelog

  • 2024-05-29: Clarify handling of key-value pairs and add specification test.

  • 2024-02-15: Migrated from reStructuredText to Markdown.

  • 2016-07-22: In Port section, clarify that zero is not an acceptable port.

  • 2017-01-09: In Userinfo section, clarify that percent signs must be encoded.

  • 2017-06-10: In Userinfo section, require username and password to be fully URI
    encoded, not just "%", "@", and ":". In Auth Database, list the prohibited characters. In Reference Implementation, split at the first "/", not the last.

  • 2018-01-09: Clarified that space characters should be encoded to %20.

  • 2018-06-04: Revised Userinfo section to provide an explicit list of allowed
    characters and clarify rules for exceptions.

  • 2019-02-04: In Repeated Keys section, clarified that the URI options spec may
    override the repeated key behavior described here for certain options.

  • 2019-03-04: Require drivers to document option precedence rules

  • 2019-04-26: Database name in URI alone does not trigger authentication

  • 2020-01-21: Clarified how empty values in a connection string are parsed.

  • 2022-10-05: Remove spec front matter and reformat changelog.

  • 2022-12-27: Note that host information ends with a "/" character in connection
    options description.

  • 2023-08-02: Make delimiting slash between host information and connection options
    optional and update tests

URI Options Specification

  • Status: Accepted

  • Minimum Server Version: N/A

Abstract

Historically, URI options have been defined in individual specs, and drivers have defined any additional options independently of one another. Because of the frustration due to there not being a single place where all of the URI options are defined, this spec aims to do just that—namely, provide a canonical list of URI options that each driver defines.

THIS SPEC DOES NOT REQUIRE DRIVERS TO MAKE ANY BREAKING CHANGES.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Specification

Conflicting TLS options

Per the Connection String spec, the behavior of duplicates of most URI options is undefined. However, due to the security implications of certain options, drivers MUST raise an error to the user during parsing if any of the following circumstances occur:

  1. Both tlsInsecure and tlsAllowInvalidCertificates appear in the URI options.
  2. Both tlsInsecure and tlsAllowInvalidHostnames appear in the URI options.
  3. Both tlsInsecure and tlsDisableOCSPEndpointCheck appear in the URI options.
  4. Both tlsInsecure and tlsDisableCertificateRevocationCheck appear in the URI options.
  5. Both tlsAllowInvalidCertificates and tlsDisableOCSPEndpointCheck appear in the URI options.
  6. Both tlsAllowInvalidCertificates and tlsDisableCertificateRevocationCheck appear in the URI options.
  7. Both tlsDisableOCSPEndpointCheck and tlsDisableCertificateRevocationCheck appear in the URI options.
  8. All instances of tls and ssl in the URI options do not have the same value. If all instances of tls and ssl have the same value, an error MUST NOT be raised.

directConnection URI option with multiple seeds or SRV URI

The driver MUST report an error if the directConnection=true URI option is specified with multiple seeds.

The driver MUST report an error if the directConnection=true URI option is specified with an SRV URI, because the URI may resolve to multiple hosts. The driver MUST allow specifying directConnection=false URI option with an SRV URI.

srvServiceName and srvMaxHosts URI options

For URI option validation pertaining to srvServiceName and srvMaxHosts, please see the Initial DNS Seedlist Discovery spec for details.

Load Balancer Mode

For URI option validation in Load Balancer mode (i.e. loadBalanced=true), please see the Load Balancer spec for details.

SOCKS5 options

For URI option validation pertaining to proxyHost, proxyPort, proxyUsername and proxyPassword please see the SOCKS5 support spec for details.

List of specified options

Each driver option below MUST be implemented in each driver unless marked as optional. If an option is marked as optional, a driver MUST meet any conditions specified for leaving it out if it is not included. If a driver already provides the option under a different name, the driver MAY implement the old and new names as aliases. All keys and values MUST be encoded in UTF-8. All integer options are 32-bit unless specified otherwise. Note that all requirements and recommendations described in the Connection String spec pertaining to URI options apply here.

NameAccepted ValuesDefault ValueOptional to implement?Description
appnameany string that meets the criteria listed in the handshake specno appname specifiednoPassed into the server in the client metadata as part of the connection handshake
authMechanismany string; valid values are defined in the auth specNone; default values for authentication exist for constructing authentication credentials per the auth spec, but there is no default for the URI option itself.noThe authentication mechanism method to use for connection to the server
authMechanismPropertiescomma separated key:value pairs, e.g. "opt1:val1,opt2:val2"no properties specifiednoAdditional options provided for authentication (e.g. to enable hostname canonicalization for GSSAPI)
authSourceany stringNone; default values for authentication exist for constructing authentication credentials per the auth spec, but there is no default for the URI option itself.noThe database that connections should authenticate against
compressorscomma separated list of strings, e.g. "snappy,zlib"defined in compression specnoThe list of allowed compression types for wire protocol messages sent or received from the server
connectTimeoutMSnon-negative integer; 0 means "no timeout"10,000 ms (unless a driver already has a different default)noAmount of time to wait for a single TCP socket connection to the server to be established before erroring; note that this applies to SDAM hello and legacy hello operations
directConnection"true" or "false"defined in SDAM specnoWhether to connect to the deployment in Single topology.
heartbeatFrequencyMSinteger greater than or equal to 500defined in SDAM specnothe interval between regular server monitoring checks
journal"true" or "false"no "j" field specifiednoDefault write concern "j" field for the client
loadBalanced"true" or "false"defined in Load Balancer specnoWhether the driver is connecting to a load balancer.
localThresholdMSnon-negative integer; 0 means 0 ms (i.e. the fastest eligible server must be selected)defined in the server selection specnoThe amount of time beyond the fastest round trip time that a given server’s round trip time can take and still be eligible for server selection
maxIdleTimeMSnon-negative integer; 0 means no minimumdefined in the Connection Pooling specrequired for drivers with connection poolsThe amount of time a connection can be idle before it's closed
maxPoolSizenon-negative integer; 0 means no maximumdefined in the Connection Pooling specrequired for drivers with connection poolsThe maximum number of clients or connections able to be created by a pool at a given time. This count includes connections which are currently checked out.
maxConnectingpositive integerdefined in the Connection Pooling specrequired for drivers with connection poolsThe maximum number of Connections a Pool may be establishing concurrently.
maxStalenessSeconds-1 (no max staleness check) or integer >= 90defined in max staleness specnoThe maximum replication lag, in wall clock time, that a secondary can suffer and still be eligible for server selection
minPoolSizenon-negative integerdefined in the Connection Pooling specrequired for drivers with connection poolsThe number of connections the driver should create and maintain in the pool even when no operations are occurring. This count includes connections which are currently checked out.
proxyHostany stringdefined in the SOCKS5 support specnoThe IPv4/IPv6 address or domain name of a SOCKS5 proxy server used for connecting to MongoDB services.
proxyPortnon-negative integerdefined in the SOCKS5 support specnoThe port of the SOCKS5 proxy server specified in proxyHost.
proxyUsernameany stringdefined in the SOCKS5 support specnoThe username for username/password authentication to the SOCKS5 proxy server specified in proxyHost.
proxyPasswordany stringdefined in the SOCKS5 support specnoThe password for username/password authentication to the SOCKS5 proxy server specified in proxyHost.
readConcernLevelany string (to allow for forwards compatibility with the server)no read concern specifiednoDefault read concern for the client
readPreferenceany string; currently supported values are defined in the server selection spec, but must be lowercase camelCase, e.g. "primaryPreferred"defined in server selection specnoDefault read preference for the client (excluding tags)
readPreferenceTagscomma-separated key:value pairs (e.g. "dc:ny,rack:1" and "dc:ny)

can be specified multiple times; each instance of this key is a separate tag set
no tags specifiednoDefault read preference tags for the client; only valid if the read preference mode is not primary

The order of the tag sets in the read preference is the same as the order they are specified in the URI
replicaSetany stringno replica set name providednoThe name of the replica set to connect to
retryReads"true" or "false"defined in retryable reads specnoEnables retryable reads on server 3.6+
retryWrites"true" or "false"defined in retryable writes specnoEnables retryable writes on server 3.6+
serverMonitoringMode"stream", "poll", or "auto"defined in SDAM specrequired for multi-threaded or asynchronous driversConfigures which server monitoring protocol to use.
serverSelectionTimeoutMSpositive integer; a driver may also accept 0 to be used for a special case, provided that it documents the meaningdefined in server selection specnoA timeout in milliseconds to block for server selection before raising an error
serverSelectionTryOnce"true" or "false"defined in server selection specrequired for single-threaded driversScan the topology only once after a server selection failure instead of repeatedly until the server selection times out
socketTimeoutMSnon-negative integer; 0 means no timeoutno timeoutnoNOTE: This option is deprecated in favor of timeoutMS

Amount of time spent attempting to send or receive on a socket before timing out; note that this only applies to application operations, not SDAM.
srvMaxHostsnon-negative integer; 0 means no maximumdefined in the Initial DNS Seedlist Discovery specnoThe maximum number of SRV results to randomly select when initially populating the seedlist or, during SRV polling, adding new hosts to the topology.
srvServiceNamea valid SRV service name according to RFC 6335"mongodb"nothe service name to use for SRV lookup in initial DNS seedlist discovery and SRV polling
ssl"true" or "false"same as "tls"noalias of "tls"; required to ensure that Atlas connection strings continue to work
tls"true" or "false"TLS required if "mongodb+srv" scheme; otherwise, drivers may may enable TLS by default if other "tls"-prefixed options are present

Drivers MUST clearly document the conditions under which TLS is enabled implicitly
noWhether or not to require TLS for connections to the server
tlsAllowInvalidCertificates"true" or "false"error on invalid certificatesrequired if the driver’s language/runtime allows bypassing hostname verificationSpecifies whether or not the driver should error when the server’s TLS certificate is invalid
tlsAllowInvalidHostnames"true" or "false"error on invalid certificatesrequired if the driver’s language/runtime allows bypassing hostname verificationSpecifies whether or not the driver should error when there is a mismatch between the server’s hostname and the hostname specified by the TLS certificate
tlsCAFileany stringno certificate authorities specifiedrequired if the driver's language/runtime allows non-global configurationPath to file with either a single or bundle of certificate authorities to be considered trusted when making a TLS connection
tlsCertificateKeyFileany stringno client certificate specifiedrequired if the driver's language/runtime allows non-global configurationPath to the client certificate file or the client private key file; in the case that they both are needed, the files should be concatenated
tlsCertificateKeyFilePasswordany stringno password specifiedrequired if the driver's language/runtime allows non-global configurationPassword to decrypt the client private key to be used for TLS connections
tlsDisableCertificateRevocationCheck"true" or "false"false i.e. driver will reach check a certificate's revocation statusYesControls whether or not the driver will check a certificate's revocation status via CRLs or OCSP. See the OCSP Support Spec for additional information.
tlsDisableOCSPEndpointCheck"true" or "false"false i.e. driver will reach out to OCSP endpoints if needed.YesControls whether or not the driver will reach out to OCSP endpoints if needed. See the OCSP Support Spec for additional information.
tlsInsecure"true" or "false"No TLS constraints are relaxednoRelax TLS constraints as much as possible (e.g. allowing invalid certificates or hostname mismatches); drivers must document the exact constraints which are relaxed by this option being true
wnon-negative integer or stringno "w" value specifiednoDefault write concern "w" field for the client
waitQueueTimeoutMSpositive numberdefined in the Connection Pooling specrequired for drivers with connection pools, with exceptions described in the Connection Pooling specNOTE: This option is deprecated in favor of timeoutMS

Amount of time spent attempting to check out a connection from a server's connection pool before timing out
wTimeoutMSnon-negative 64-bit integer; 0 means no timeoutno timeoutnoNOTE: This option is deprecated in favor of timeoutMS

Default write concern "wtimeout" field for the client
zlibCompressionLevelinteger between -1 and 9 (inclusive)-1 (default compression level of the driver)noSpecifies the level of compression when using zlib to compress wire protocol messages; -1 signifies the default level, 0 signifies no compression, 1 signifies the fastest speed, and 9 signifies the best compression

Test Plan

Tests are implemented and described in the tests directory.

Design Rationale

Why allow drivers to provide the canonical names as aliases to existing options?

First and foremost, this spec aims not to introduce any breaking changes to drivers. Forcing a driver to change the name of an option that it provides will break any applications that use the old option. Moreover, it is already possible to provide duplicate options in the URI by specifying the same option more than once; drivers can use the same semantics to resolve the conflicts as they did before, whether it's raising an error, using the first option provided, using the last option provided, or simply telling users that the behavior is not defined.

Technically speaking, drivers already only support TLS, which supersedes SSL. While SSL is commonly used in parlance to refer to TLS connections, the fact remains that SSL is a weaker cryptographic protocol than TLS, and we want to accurately reflect the strict requirements that drivers have in ensuring the security of a TLS connection.

Why use the names "tlsAllowInvalidHostnames" and "tlsAllowInvalidCertificates"?

The "tls" prefix is used for the same reasons described above. The use of the terms "AllowInvalidHostnames" and "AllowInvalidCertificates" is an intentional choice in order to convey the inherent unsafety of these options, which should only be used for testing purposes. Additionally, both the server and the shell use "AllowInvalid" for their equivalent options.

Why provide multiple implementation options for the insecure TLS options (i.e. "tlsInsecure" vs. "tlsAllowInvalidHostnames"/"tlsAllowInvalidCertificates"?

Some TLS libraries (e.g. Go's standard library implementation) do not provide the ability to distinguish between allow invalid certificates and hostnames, meaning they either both are allowed, or neither are. However, when more granular options are available, it's better to expose these to the user to allow them to relax security constraints as little as they need.

Why leave the decision up to drivers to enable TLS implicitly when TLS options are present?

It can be useful to turn on TLS implicitly when options such as "tlsCAFile" are present and "tls" is not present. However, with options such as "tlsAllowInvalidHostnames", some drivers may not have the ability to distinguish between "false" being provided and the option not being specified. To keep the implicit enabling of TLS consistent between such options, we defer the decision to enable TLS based on the presence of "tls"-prefixed options (besides "tls" itself) to drivers.

Reference Implementations

Ruby and Python

Security Implication

Each of the "insecure" TLS options (i.e. "tlsInsecure", "tlsAllowInvalidHostnames", "tlsAllowInvalidCertificates", "tlsDisableOCSPEndpointCheck", and "tlsDisableCertificateRevocationCheck") default to the more secure option when TLS is enabled. In order to be backwards compatible with existing driver behavior, neither TLS nor authentication is enabled by default.

Future Work

This specification is intended to represent the current state of drivers URI options rather than be a static description of the options at the time it was written. Whenever another specification is written or modified in a way that changes the name or the semantics of a URI option or adds a new URI option, this specification MUST be updated to reflect those changes.

Changelog

  • 2024-05-08: Migrated from reStructuredText to Markdown.

  • 2023-08-21: Add serverMonitoringMode option.

  • 2022-10-05: Remove spec front matter and reformat changelog.

  • 2022-01-19: Add the timeoutMS option and deprecate some existing timeout options

  • 2021-12-14: Add SOCKS5 options

  • 2021-11-08: Add maxConnecting option.

  • 2021-10-14: Add srvMaxHosts option. Merge headings discussing URI validation
    for directConnection option.

  • 2021-09-15: Add srvServiceName option

  • 2021-09-13: Fix link to load balancer spec

  • 2021-04-15: Adding in behaviour for load balancer mode.

  • 2021-04-08: Updated to refer to hello and legacy hello

  • 2020-03-03: Add tlsDisableCertificateRevocationCheck option

  • 2020-02-26: Add tlsDisableOCSPEndpointCheck option

  • 2019-09-08: Add retryReads option

  • 2019-04-26: authSource and authMechanism have no default value

  • 2019-02-04: Specified errors for conflicting TLS-related URI options

  • 2019-01-25: Updated to reflect new Connection Monitoring and Pooling Spec


============ OCSP Support

:Status: Accepted :Minimum Server Version: 4.4

.. contents::


Abstract

This specification is about the ability for drivers to to support OCSP <https://en.wikipedia.org/wiki/Online_Certificate_Status_Protocol>—Online Certificate Status Protocol (RFC 6960 <https://tools.ietf.org/html/rfc6960>)—and two of its related extensions: OCSP stapling <https://en.wikipedia.org/wiki/OCSP_stapling>__ (RFC 6066 <https://tools.ietf.org/html/rfc6066>) and Must-Staple <https://scotthelme.co.uk/ocsp-must-staple/> (RFC 7633 <https://tools.ietf.org/html/rfc7633>__).

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 <https://www.ietf.org/rfc/rfc2119.txt>_.

Specification

Required Server Versions

The server supports attaching a stapled OCSP response with versions ≥ 4.4. Future backports will bring stapling support to server versions ≥ 3.6. Drivers need not worry about the version of the server as a driver’s TLS library should automatically perform the proper certificate revocation checking behavior once OCSP is enabled.

Enabling OCSP Support by Default

Drivers whose TLS libraries utilize application-wide settings for OCSP MUST respect the application’s settings and MUST NOT change any OCSP settings. Otherwise:

  • If a driver’s TLS library supports OCSP, OCSP MUST be enabled by default whenever possible (even if this also enables Certificate Revocation List (CRL) checking).

  • If a driver’s TLS library supports verifying stapled OCSP responses, this option MUST be enabled by default whenever possible (even if this also enables CRL checking).

.. _Suggested OCSP Behavior:

Suggested OCSP Behavior

Drivers SHOULD implement the OCSP behavior defined below to the extent that their TLS library allows. At any point in the steps defined below, if a certificate in or necessary to validate the chain is found to be invalid, the driver SHOULD end the connection.

  1. If a driver’s TLS library supports Stapled OCSP, the server has a Must-Staple certificate and the server does not present a stapled OCSP response, a driver SHOULD end the connection.

  2. If a driver’s TLS library supports Stapled OCSP and the server staples an OCSP response that does not cover the certificate it presents or is invalid per RFC 6960 Section 3.2 <https://tools.ietf.org/html/rfc6960#section-3.2>_, a driver SHOULD end the connection.

  3. If a driver’s TLS library supports Stapled OCSP and the server staples an OCSP response that does cover the certificate it presents, a driver SHOULD accept the stapled OCSP response and validate all of the certificates that are presented in the response.

  4. If any unvalidated certificates in the chain remain and the client possesses an OCSP cache, the driver SHOULD attempt to validate the status of the unvalidated certificates using the cache.

  5. If any unvalidated certificates in the chain remain and the driver has a user specified CRL, the driver SHOULD attempt to validate the status of the unvalidated certificates using the user-specified CRL.

  6. If any unvalidated certificates in the chain remain and the driver has access to cached CRLs (e.g. OS-level/application-level/user-level caches), the driver SHOULD attempt to validate the status of the unvalidated certificates using the cached CRLs.

  7. If the server’s certificate remains unvalidated, that certificate has a list of OCSP responder endpoints, and tlsDisableOCSPEndpointCheck or tlsDisableCertificateRevocationCheck is false (if the driver supports these options <MongoClient Configuration>), the driver SHOULD send HTTP requests to the responders in parallel. The first valid response that concretely marks the certificate status as good or revoked should be used. A timeout should be applied to requests per the Client Side Operations Timeout <../client-side-operations-timeout/client-side-operations-timeout.md>__ specification, with a default timeout of five seconds. The status for a response should only be checked if the response is valid per RFC 6960 Section 3.2 <https://tools.ietf.org/html/rfc6960#section-3.2>

  8. If any unvalidated intermediate certificates remain and those certificates have OCSP endpoints, for each certificate, the driver SHOULD NOT reach out to the OCSP endpoint specified and attempt to validate that certificate.*

  9. If any unvalidated intermediate certificates remain and those certificates have CRL distribution points, the driver SHOULD NOT download those CRLs and attempt to validate the status of all the other certificates using those CRLs.*

  10. Finally, the driver SHOULD continue the connection, even if the status of all the unvalidated certificates has not been confirmed yet. This means that the driver SHOULD default to "soft fail" behavior, connecting as long as there are no explicitly invalid certificates—i.e. the driver will connect even if the status of all the unvalidated certificates has not been confirmed yet (e.g. because an OCSP responder is down).

*: See Design Rationale: Suggested OCSP Behavior <ocsp-support.rst#id7>__

Suggested OCSP Response Validation Behavior

Drivers SHOULD validate OCSP Responses in the manner specified in RFC 6960: 3.2 <https://tools.ietf.org/html/rfc6960#section-3.2>__ to the extent that their TLS library allows.

Suggested OCSP Caching Behavior

Drivers with sufficient control over their TLS library's OCSP behavior SHOULD implement an OCSP cache. The key for this cache SHOULD be the certificate identifier (CertID) of the OCSP request as specified in RFC 6960: 4.1.1 <https://tools.ietf.org/html/rfc6960#section-4.1.1>__. For convenience, the relevant section has been duplicated below:

.. code::

CertID ::= SEQUENCE { hashAlgorithm AlgorithmIdentifier, issuerNameHash OCTET STRING, -- Hash of issuer's DN issuerKeyHash OCTET STRING, -- Hash of issuer's public key serialNumber CertificateSerialNumber }

If a driver would accept a conclusive OCSP response (stapled or non-stapled), the driver SHOULD cache that response. We define a conclusive OCSP response as an OCSP response that indicates that a certificate is either valid or revoked. Thus, an unknown certificate status SHOULD NOT be considered conclusive, and the corresponding OCSP response SHOULD NOT be cached.

In accordance with RFC: 6960: 3.2 <https://tools.ietf.org/html/rfc6960#section-3.2>__, a cached response SHOULD be considered valid up to and excluding the time specified in the response's nextUpdate field. In other words, if the current time is t, then the cache entry SHOULD be considered valid if thisUpdate ⩽ t < nextUpdate.

If a driver would accept a stapled OCSP response and that response has a later nextUpdate than the response already in the cache, drivers SHOULD replace the older entry in the cache with the fresher response.

MongoClient Configuration

This specification introduces the client-level configuration options defined below.

tlsDisableOCSPEndpointCheck ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Drivers that can, on a per MongoClient basis, disable non-stapled OCSP while keeping stapled OCSP enabled MUST implement this option.

This boolean option determines whether a MongoClient should refrain from reaching out to an OCSP endpoint i.e. whether non-stapled OCSP should be disabled. When set to true, a driver MUST NOT reach out to OCSP endpoints. When set to false, a driver MUST reach out to OCSP endpoints if needed (as described in Specification: Suggested OCSP Behavior <ocsp-support.rst#id1>__).

For drivers that pass the "Soft Fail Test" <tests/README.rst#integration-tests-permutations-to-be-tested>__, this option MUST default to false.

For drivers that fail the "Soft Fail Test" because their TLS library exhibits hard-fail behavior when a responder is unreachable, this option MUST default to true, and a driver MUST document this behavior. If this hard-failure behavior is specific to a particular platform (e.g. the TLS library hard-fails only on Windows) then this option MUST default to true only on the platform where the driver exhibits hard-fail behavior, and a driver MUST document this behavior.

tlsDisableCertificateRevocationCheck ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Drivers whose TLS libraries support an option to toggle general certificate revocation checking must implement this option if enabling general certificate revocation checking causes hard-fail behavior when no revocation mechanisms are available (i.e. no methods are defined or the CRL distribution points/OCSP endpoints are unreachable).

This boolean option determines whether a MongoClient should refrain checking certificate revocation status. When set to true, a driver MUST NOT check certificate revocation status via CRLs or OCSP. When set to false, a driver MUST check certificate revocation status, reach out to OCSP endpoints if needed (as described in Specification: Suggested OCSP Behavior <ocsp-support.rst#id1>__).

For drivers that pass the "Soft Fail Test" <tests/README.rst#integration-tests-permutations-to-be-tested>__ , this option MUST default to false.

If a driver does not support tlsDisableOCSPEndpointCheck and that driver fails the "Soft Fail Test" because their TLS library exhibits hard-fail behavior when a responder is unreachable, then that driver must default tlsDisableCertificateRevocationCheck to true. Such a driver also MUST document this behavior. If this hard-failure behavior is specific to a particular platform (e.g. the TLS library hard-fails only on Windows) then this option MUST default to true only on the platform where the driver exhibits hard-fail behavior, and a driver MUST document this behavior.

Naming Deviations ^^^^^^^^^^^^^^^^^^

Drivers MUST use the defined names of tlsDisableOCSPEndpointCheck and tlsDisableCertificateRevocationCheck for the connection string parameters to ensure portability of connection strings across applications and drivers. If drivers solicit MongoClient options through another mechanism (e.g. an options dictionary provided to the MongoClient constructor), drivers SHOULD use the defined name but MAY deviate to comply with their existing conventions. For example, a driver may use tls_disable_ocsp_endpoint_check instead of tlsDisableOCSPEndpointCheck.

How OCSP interacts with existing configuration options

The following requirements apply only to drivers that are able to enable/disable OCSP on a per MongoClient basis.

  1. If a connection string specifies tlsInsecure=true then the driver MUST disable OCSP.

  2. If a connection string contains both tlsInsecure and tlsDisableOCSPEndpointCheck then the driver MUST throw an error.

  3. If a driver supports tlsAllowInvalidCertificates, and a connection string specifies tlsAllowInvalidCertificates=true, then the driver MUST disable OCSP.

  4. If a driver supports tlsAllowInvalidCertificates, and a connection string specifies both tlsAllowInvalidCertificates and tlsDisableOCSPEndpointCheck, then the driver MUST throw an error.

The remaining requirements in this section apply only to drivers that expose an option to enable/disable certificate revocation checking on a per MongoClient basis.

  1. Driver MUST enable OCSP support (with stapling if possible) when certificate revocation checking is enabled unless their driver exhibits hard-fail behavior (see tlsDisableCertificateRevocationCheck_). In such a case, a driver MUST disable OCSP support on the platforms where its TLS library exhibits hard-fail behavior.

  2. Drivers SHOULD throw an error if any of tlsInsecure=true or tlsAllowInvalidCertificates=true or tlsDisableOCSPEndpointCheck=true is specified alongside the option to enable certificate revocation checking.

  3. If a connection string contains both tlsInsecure and tlsDisableCertificateRevocationCheck then the driver MUST throw an error.

  4. If a driver supports tlsAllowInvalidCertificates and a connection string specifies both tlsAllowInvalidCertificates and tlsDisableCertificateRevocationCheck, then the driver MUST throw an error.

  5. If a driver supports tlsDisableOCSPEndpointCheck, and a connection string specifies tlsDisableCertificateRevocationCheck, then the driver MUST throw an error.

TLS Requirements

Server Name Indication <https://en.wikipedia.org/wiki/Server_Name_Indication>__ (SNI) MUST BE used in the TLS connection that obtains the server's certificate, otherwise the server may present the incorrect certificate. This requirement is especially relevant to drivers whose TLS libraries allow for finer-grained control over their TLS behavior (e.g. Python, C).

Documentation Requirements

Drivers that cannot support OCSP MUST document this lack of support. Additionally, such drivers MUST document the following:

  • They MUST document that they will be unable to support certificate revocation checking with Atlas when Atlas moves to OCSP-only certificates.

  • They MUST document that users should be aware that if they use a Certificate Authority (CA) that issues OCSP-only certificates, then the driver cannot perform certificate revocation checking.

Drivers that support OCSP without stapling MUST document this lack of support for stapling. They also MUST document their behavior when an OCSP responder is unavailable and a server has a Must-Staple certificate. If a driver is able to connect in such a scenario due to the prevalence of "\ soft-fail <https://www.imperialviolet.org/2014/04/19/revchecking.html>__\ " behavior in TLS libraries (where a certificate is accepted when an answer from an OCSP responder cannot be obtained), they additionally MUST document that this ability to connect to a server with a Must-Staple certificate when an OCSP responder is unavailable differs from the mongo shell or a driver that does support OCSP-stapling, both of which will fail to connect (i.e. "hard-fail") in such a scenario.

If a driver (e.g. Python <https://api.mongodb.com/python/current/examples/tls.html>, C <http://mongoc.org/libmongoc/current/mongoc_ssl_opt_t.html>) allows the user to provide their own certificate revocation list (CRL), then that driver MUST document their TLS library’s preference between the user-provided CRL and OCSP.

Drivers that cannot enable OCSP by default on a per MongoClient basis (e.g. Java) MUST document this limitation.

Drivers that fail either of the "Malicious Server Tests" (i.e. the driver connects to a test server without TLS constraints being relaxed) as defined in the test plan below MUST document that their chosen TLS library will connect in the case that a server with a Must-Staple certificate does not staple a response.

Drivers that fail "Malicious Server Test 2" (i.e. the driver connects to the test server without TLS constraints being relaxed) as defined in the test plan below MUST document that their chosen TLS library will connect in the case that a server with a Must-Staple certificate does not staple a response and the OCSP responder is down.

Drivers that fail "Soft Fail Test" MUST document that their driver’s TLS library utilizes "hard fail" behavior in the case of an unavailable OCSP responder in contrast to the mongo shell and drivers that utilize "soft fail" behavior. They also MUST document the change in defaults for the applicable options (see MongoClient Configuration_).

If any changes related to defaults for OCSP behavior are made after a driver version that supports OCSP has been released, the driver MUST document potential backwards compatibility issues as noted in the Backwards Compatibility_ section.

Test Plan

See tests/README <tests/README.rst>__ for tests.

Motivation for Change

MongoDB Atlas intends to use LetsEncrypt <https://letsencrypt.org/>__, a Certificate Authority (CA) that does not use CRLs and only uses OCSP. (Atlas currently uses DigiCert certificates which specify both OCSP endpoints and CRL distribution points.) Therefore, the MongoDB server is adding support for OCSP, and drivers need to support OCSP in order for applications to continue to have the ability to verify the revocation status of an Atlas server’s certificate. Other CAs have also stopped using CRLs, so enabling OCSP support will ensure that a customer’s choice in CAs is not limited by a driver’s lack of OCSP support.

OCSP stapling will also help applications deployed behind a firewall with an outbound allowList. It’s a very natural mistake to neglect to allowList the CRL distribution points and the OCSP endpoints, which can prevent an application from connecting to a MongoDB instance if certificate revocation checking is enabled but the driver does not support OCSP stapling.

Finally, drivers whose TLS libraries support OCSP stapling <https://en.wikipedia.org/wiki/OCSP_stapling>__ extension will be able to minimize the number of network round trips for the client because the driver’s TLS library will read an OCSP response stapled to the server’s certificate that the server provides as part of the TLS handshake. Drivers whose TLS libraries support OCSP but not stapling will need to make an additional round trip to contact the OCSP endpoint.

Design Rationale

We have chosen not to force drivers whose TLS libraries do not support OCSP/stapling "out of the box" to implement OCSP support due to the extra work and research that this might require. Similarly, this specification uses "SHOULD" more commonly (when other specs would prefer "MUST") to account for the fact that some drivers may not be able to fully customize OCSP behavior in their TLS library.

We are requiring drivers to support both stapled OCSP and non-stapled OCSP in order to support revocation checking for server versions in Atlas that do not support stapling, especially after Atlas switches to Let’s Encrypt certificates (which do not have CRLs). Additionally, even when servers do support stapling, in the case of a non-"Must Staple" certificate (which is the type that Atlas is planning to use), if the server is unable to contact the OCSP responder (e.g. due to a network error) and staple a certificate, the driver being able to query the certificate’s OCSP endpoint allows for one final chance to attempt to verify the certificate’s validity.

Malicious Server Tests

"Malicious Server Test 2" is designed to reveal the behavior of TLS libraries of drivers in one of the worst case scenarios. Since a majority of the drivers will not have fine-grained control over their OCSP behavior, this test case provides signal about the soft/hard fail behavior in a driver’s TLS library so that we can document this.

A driver with control over its OCSP behavior will react the same in "Malicious Server Test 1" and "Malicious Server Test 2", terminating the connection as long as TLS constraints have not been relaxed.

Atlas Connectivity Tests

No additional Atlas connectivity tests will be added because the existing tests should provide sufficient coverage (provided that one of the non-free tier clusters is upgraded ≥ 3.6).

.. _Design Rationale for Suggested OCSP Behavior:

Suggested OCSP Behavior

For drivers with finer-grain control over their OCSP behavior, the suggested OCSP behavior was chosen as a balance between security and availability, erring on availability while minimizing network round trips. Therefore, in the Specification: Suggested OCSP Behavior <ocsp-support.rst#id1>__ section, in order to minimize network round trips, drivers are advised not to reach out to OCSP endpoints and CRL distribution points in order to verify the revocation status of intermediate certificates.

Backwards Compatibility

An application behind a firewall with an outbound allowList that upgrades to a driver implementing this specification may experience connectivity issues when OCSP is enabled. This is because the driver may need to contact OCSP endpoints or CRL distribution points [1]_ specified in the server’s certificate and if these OCSP endpoints and/or CRL distribution points are not accessible, then the connection to the server may fail. (N.B.: TLS libraries typically implement "soft fail" <https://blog.hboeck.de/archives/886-The-Problem-with-OCSP-Stapling-and-Must-Staple-and-why-Certificate-Revocation-is-still-broken.html>__ such that connections can continue even if the OCSP server is inaccessible, so this issue is much more likely in the case of a server with a certificate that only contains CRL distribution points.) In such a scenario, connectivity may be able to be restored by disabling non-stapled OCSP via tlsDisableOCSPEndpointCheck or by disabling certificate revocation checking altogether via tlsDisableCertificateRevocationCheck.

An application that uses a driver that utilizes hard-fail behavior when there are no certificate revocation mechanisms available may also experience connectivity issue. Cases in which no certificate revocation mechanisms being available include:

  1. When a server's certificate defines neither OCSP endpoints nor CRL distribution points
  2. When a certificate defines CRL distribution points and/or OCSP endpoints but these points are unavailable (e.g. the points are down or the application is deployed behind a restrictive firewall).

In such a scenario, connectivity may be able to be restored by disabling non-stapled OCSP via tlsDisableOCSPEndpointCheck or by disabling certificate revocation checking via tlsDisableCertificateRevocationCheck.

Reference Implementation

The .NET/C#, Python, C, and Go drivers will provide the reference implementations. See CSHARP-2817 <https://jira.mongodb.org/browse/CSHARP-2817>, PYTHON-2093 <https://jira.mongodb.org/browse/PYTHON-2093>, CDRIVER-3408 <https://jira.mongodb.org/browse/CDRIVER-3408>__ and GODRIVER-1467 <http://jira.mongodb.org/browse/GODRIVER-1467>__.

Security Implications

Customers should be aware that if they choose to use CA that only supports OCSP, they will not be able to check certificate validity in drivers that cannot support OCSP.

In the case that the server has a Must-Staple certificate and its OCSP responder is down (for longer than the server is able to cache and staple a previously acquired response), the mongo shell or a driver that supports OCSP stapling will not be able to connect while a driver that supports OCSP but not stapling will be able to connect.

TLS libraries may implement "\ soft-fail <https://www.imperialviolet.org/2014/04/19/revchecking.html>__\ " in the case of non-stapled OCSP which may be undesirable in highly secure contexts.

Drivers that fail the "Malicious Server" tests as defined in Test Plan will connect in the case that server with a Must-Staple certificate does not staple a response.

Testing Against Valid Certificate Chains

Some TLS libraries are stricter about the types of certificate chains they're willing to accept (and it can be difficult to debug why a particular certificate chain is considered invalid by a TLS library). Clients and servers with more control over their OCSP implementation may run into fewer up front costs, but this may be at the cost of not fully implementing every single aspect of OCSP.

For example, the server team’s certificate generation tool generated X509 V1 certificates which were used for testing OCSP without any issues in the server team’s tests. However, while we were creating a test plan for drivers, we discovered that Java’s keytool refused to import X509 V1 certificates into its trust store and thus had to modify the server team’s certificate generation tool to generate V3 certificates.

Another example comes from .NET on Linux <https://github.com/dotnet/corefx/issues/41475>, which currently enforces the CA/Browser forum requirement that while a leaf certificate can be covered solely by OCSP, "public CAs have to have CRL[s] covering their issuing CAs". This requirement is not enforced with Java’s default TLS libraries. See also: Future Work: CA/Browser Forum Requirements Complications <#cabrowser-forum-requirements-complications>.

Future Work

When the server work is backported, drivers will need to update their prose tests so that tests are run against a wider range of compatible servers.

Automated Atlas connectivity tests (DRIVERS-382 <https://jira.mongodb.org/browse/DRIVERS-382>__) may be updated with additional OCSP-related URIs when 4.4 becomes available for Atlas; alternatively, the clusters behind those URIs may be updated to 4.4 (or an earlier version where OCSP has been backported). Note: While the free tier cluster used for the Automated Atlas connectivity tests will automatically get updated to 4.4 when it is available, Atlas currently does not plan to enable OCSP for free and shared tier instances (i.e. Atlas Proxy).

Options to configure failure behavior (e.g. to maximize security or availability) may be added in the future.

CA/Browser Forum Requirements Complications

The test plan may need to be reworked if we discover that a driver’s TLS library strictly implements CA/Browser forum requirements (e.g. .NET on Linux <https://github.com/dotnet/corefx/issues/41475>__). This is because our current chain of certificates does not fulfill the following requirement: while a leaf certificate can be covered solely by OCSP, "public CAs have to have CRL[s] covering their issuing CAs." This rework of the test plan may happen during the initial implementation of OCSP support or happen later if a driver’s TLS library implements the relevant CA/Browser forum requirement.

Extending the chain to fulfill the CA/Browser requirement should solve this issue, although drivers that don't support manually supplying a CRL may need to host a web server that serves the required CRL during testing.

Q&A

Can we use one Evergreen task combined with distinct certificates for each column in the test matrix to prevent OCSP caching from affecting testing?

No. This is because Evergreen may reuse a host with an OCSP cache from a previous execution, so using distinct certificates per column would not obviate the need to clear all relevant OCSP caches prior to each test run. Since Evergreen does perform some cleanup between executions, having separate tasks for each test column offers an additional layer of safety in protecting against stale data in OCSP caches.

Should drivers use a nonce when creating an OCSP request?

A driver MAY use a nonce if desired, but including a nonce in an OCSP request <https://tools.ietf.org/html/rfc6960#section-4.4.1>__ is not required as the server does not explicitly support nonces.

Should drivers utilize a tolerance period when accepting OCSP responses?

No. Although RFC 5019, The Lightweight Online Certificate Status Protocol (OCSP) Profile for High-Volume Environments, <https://tools.ietf.org/html/rfc5019>__ allows for the configuration of a tolerance period for the acceptance of OCSP responses after nextUpdate, this spec is not adhering to that RFC.

Why was the decision made to allow OCSP endpoint checking to be enabled/disabled via a URI option?

We initially hoped that we would be able to not expose any options specifically related to OCSP to the user, in accordance with the "`No Knobs" drivers mantra https://github.com/mongodb/specifications#no-knobs`__. However, we later decided that users may benefit from having the ability to disable OCSP endpoint checking when applications are deployed behind restrictive firewall with outbound allowLists, and this benefit is worth adding another URI option.

Appendix

OS-Level OCSP Cache Manipulation

Windows ^^^^^^^

On Windows, the OCSP cache can be viewed like so:

.. code-block:: console

certutil -urlcache

To search the cache for "Lets Encrypt" OCSP cache entries, the following command could be used:

.. code-block:: console

certutil -urlcache | findstr letsencrypt.org

On Windows, the OCSP cache can be cleared like so:

.. code-block:: console

certutil -urlcache * delete

To delete only "Let’s Encrypt" related entries, the following command could be used:

.. code-block:: console

certutil -urlcache letsencrypt.org delete

macOS ^^^^^

On macOS 10.14, the OCSP cache can be viewed like so:

.. code-block:: console

find ~/profile/Library/Keychains -name 'ocspcache.sqlite3'
-exec sqlite3 "{}" 'SELECT responderURI FROM responses;' ;

To search the cache for "Let’s Encrypt" OCSP cache entries, the following command could be used:

.. code-block:: console

find ~/profile/Library/Keychains
-name 'ocspcache.sqlite3'
-exec sqlite3 "{}"
'SELECT responderURI FROM responses WHERE responderURI LIKE "http://%.letsencrypt.org%";' ;

On macOS 10.14, the OCSP cache can be cleared like so:

.. code-block:: console

find ~/profile/Library/Keychains -name 'ocspcache.sqlite3'
-exec sqlite3 "{}" 'DELETE FROM responses ;' ;

To delete only "Let’s Encrypt" related entries, the following command could be used:

.. code-block:: console

find ~/profile/Library/Keychains -name 'ocspcache.sqlite3'
-exec sqlite3 "{}"
'DELETE FROM responses WHERE responderURI LIKE "http://%.letsencrypt.org%";' ;

Optional Quick Manual Validation Tests of OCSP Support

These optional validation tests are not a required part of the test plan. However, these optional tests may be useful for drivers trying to quickly determine if their TLS library supports OCSP and/or as an initial manual testing goal when implementing OCSP support.

Optional test to ensure that the driver’s TLS library supports OCSP stapling ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Create a test application with a connection string with TLS enabled that connects to any server that has OCSP-only certificate and supports OCSP stapling.

For example, the test application could connect to C\ :sub:V, one of the special testing Atlas clusters with a valid OCSP-only certificate. see Future Work for additional information).

Alternatively, the test application can attempt to connect to a non-mongod server that supports OCSP-stapling and has a valid an OCSP-only certificate. The connection will fail of course, but we are only interested in the TLS handshake and the OCSP requests that may follow. For example, the following connection string could be used: mongodb://valid-isrgrootx1.letsencrypt.org:443/?tls=true

Run the test application and verify through packet analysis that the driver’s ClientHello message’s TLS extension section includes the status_request extension, thus indicating that the driver is advertising that it supports OCSP stapling.

Note: If using WireShark <https://www.wireshark.org/>__ as your chosen packet analyzer, the tls (case-sensitive) display filter may be useful in this endeavor.

OCSP Caching and the optional test to ensure that the driver’s TLS library supports non-stapled OCSP ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The "Optional test to ensure that the driver’s TLS library supports non-stapled OCSP" is complicated by the fact that OCSP allows the client to cache the OCSP responses <https://tools.ietf.org/html/rfc5019#section-6.1>__, so clearing an OCSP cache may be needed in order to force the TLS library to reach out to an OCSP endpoint. This cache may exist at the OS-level, application-level and/or at the user-level.

Optional test to ensure that the driver’s TLS library supports non-stapled OCSP ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Create a test application with a connection string with TLS enabled that connects to any server with an OCSP-only certificate.

Alternatively, the test application can attempt to connect to a non-mongod server that does not support OCSP-stapling and has a valid an OCSP-only certificate. The connection will fail of course, but we are only interested in the TLS handshake and the OCSP requests that may follow.

Alternatively, if it’s known that a driver’s TLS library does not support stapling or if stapling support can be toggled off, then any non-mongod server that has a valid an OCSP-only certificate will work, including the example shown in the "Optional test to ensure that the driver’s TLS library supports OCSP stapling."

Clear the OS/user/application OCSP cache, if one exists and the TLS library makes use of it.

Run the test application and ensure that the TLS handshake succeeds. connection succeeds. Ensure that the driver’s TLS library has contacted the OCSP endpoint specified in the server’s certificate. Two simple ways of checking this are:

  • Use a packet analyzer while the test application is running to ensure that the driver’s TLS library contacts the OCSP endpoint. When using WireShark, the ocsp and tls (case-sensitive) display filters may be useful in this endeavor.

  • If the TLS library utilizes an OCSP cache and the cache was cleared prior to starting the test application, check the OCSP cache for a response from an OCSP endpoint specified in the server's certificate.

Changelog

:2022-10-05: Remove spec front matter and reformat changelog. :2022-01-19: Require that timeouts be applied per the client-side operations timeout spec. :2021-04-07: Updated terminology to use allowList. :2020-07-01: Default tlsDisableOCSPEndpointCheck or tlsDisableCertificateRevocationCheck to true in the case that a driver's TLS library exhibits hard-fail behavior and add provision for platform-specific defaults. :2020-03-20: Clarify OCSP documentation requirements for drivers unable to enable OCSP by default on a per MongoClient basis. :2020-03-03: Add tlsDisableCertificateRevocationCheck URI option. Add Go as a reference implementation. Add hard-fail backwards compatibility documentation requirements. :2020-02-26: Add tlsDisableOCSPEndpointCheck URI option. :2020-02-19: Clarify behavior for reaching out to OCSP responders. :2020-02-10: Add cache requirement. :2020-01-31: Add SNI requirement and clarify design rationale regarding minimizing round trips. :2020-01-28: Clarify behavior regarding nonces and tolerance periods. :2020-01-16: Initial commit.

Endnotes

.. [1] Since this specification mandates that a driver must enable OCSP when possible, this may involve enabling certificate revocation checking in general, and thus the accessibility of CRL distribution points can become a factor.

================= MongoDB Handshake

:Status: Accepted :Minimum Server Version: 3.4

.. contents::


Abstract

MongoDB 3.4 has the ability to annotate connections with metadata provided by the connecting client. The intent of this metadata is to be able to identify client level information about the connection, such as application name, driver name and version. The provided information will be logged through the mongo[d|s].log and the profile logs; this should enable sysadmins to easily backtrack log entries the offending application. The active connection data will also be queryable through aggregation pipeline, to enable collecting and analyzing driver trends.

After connecting to a MongoDB node a hello command (if Stable API is requested) or a legacy hello command is issued, followed by authentication, if appropriate. This specification augments this handshake and defines certain arguments that clients provide as part of the handshake.

This spec furthermore adds a new connection string argument for applications to declare its application name to the server.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 <https://www.ietf.org/rfc/rfc2119.txt>_.

Terms

hello command The command named hello. It is the preferred and modern command for handshakes and topology monitoring.

legacy hello command The command named isMaster. It is the deprecated equivalent of the hello command. It was deprecated in MongoDB 5.0.

isMaster / ismaster The correct casing is isMaster, but servers will accept the alternate casing ismaster. Other case variations result in CommandNotFound. Drivers MUST take this case variation into account when determining which commands to encrypt, redact, or otherwise treat specially.

Specification


Connection handshake

MongoDB uses the hello or isMaster commands for handshakes and topology monitoring. hello is the modern and preferred command. hello must always be sent using the OP_MSG protocol. isMaster is referred to as "legacy hello" and is maintained for backwards compatibility with servers that do not support the hello command.

If a server API version <../versioned-api/versioned-api.rst>__ is requested or loadBalanced: True, drivers MUST use the hello command for the initial handshake and use the OP_MSG protocol. If server API version is not requested and loadBalanced: False, drivers MUST use legacy hello for the first message of the initial handshake with the OP_QUERY protocol (before switching to OP_MSG if the maxWireVersion indicates compatibility), and include helloOk:true in the handshake request.

ASIDE: If the legacy handshake response includes helloOk: true, then subsequent topology monitoring commands MUST use the hello command. If the legacy handshake response does not include helloOk: true, then subsequent topology monitoring commands MUST use the legacy hello command. See the Server Discovery and Monitoring spec <../server-discovery-and-monitoring/server-discovery-and-monitoring-summary.md>__ for further information.

The initial handshake MUST be performed on every socket to any and all servers upon establishing the connection to MongoDB, including reconnects of dropped connections and newly discovered members of a cluster. It MUST be the first command sent over the respective socket. If the command fails the client MUST disconnect. Timeouts MUST be applied to this command per the Client Side Operations Timeout <../client-side-operations-timeout/client-side-operations-timeout.md>__ specification.

hello and legacy hello commands issued after the initial connection handshake MUST NOT contain handshake arguments. Any subsequent hello or legacy hello calls, such as the ones for topology monitoring purposes, MUST NOT include this argument.

Example Implementation


Consider the following pseudo-code for establishing a new connection:

.. code:: python

 conn = Connection()
 conn.connect()  # Connect via TCP / TLS
 if stable_api_configured or client_options.load_balanced:
     cmd = {"hello": 1}
     conn.supports_op_msg = True  # Send the initial command via OP_MSG.
 else:
     cmd = {"legacy hello": 1, "helloOk": 1}
     conn.supports_op_msg = False  # Send the initial command via OP_QUERY.
 cmd["client"] = client_metadata
 if client_options.compressors:
     cmd["compression"] = client_options.compressors
 if client_options.load_balanced:
     cmd["loadBalanced"] = True
 creds = client_options.credentials
 if creds:
     # Negotiate auth mechanism and perform speculative auth. See Auth spec for details.
     if not creds.has_mechanism_configured():
         cmd["saslSupportedMechs"] = ...
     cmd["speculativeAuthenticate"] = ...

 reply = conn.send_command("admin", cmd)

 if reply["maxWireVersion"] >= 6:
     # Use OP_MSG for all future commands, including authentication.
     conn.supports_op_msg = True

 # Store the negotiated compressor, see OP_COMPRESSED spec.
 if reply.get("compression"):
     conn.compressor = reply["compression"][0]

 # Perform connection authentication. See Auth spec for details.
 negotiated_mechs = reply.get("saslSupportedMechs")
 speculative_auth = reply.get("speculativeAuthenticate")
 conn.authenticate(creds, negotiated_mechs, speculative_auth)

Hello Command
~~~~~~~~~~~~~

The initial handshake, as of MongoDB 3.4, supports a new argument, ``client``,
provided as a BSON object. This object has the following structure::

    {
        hello: 1,
        helloOk: true,
        client: {
            /* OPTIONAL. If present, the "name" is REQUIRED */
            application: {
                name: "<string>"
            },
            /* REQUIRED, including all sub fields */
            driver: {
                name: "<string>",
                version: "<string>"
            },
            /* REQUIRED */
            os: {
                type: "<string>",         /* REQUIRED */
                name: "<string>",         /* OPTIONAL */
                architecture: "<string>", /* OPTIONAL */
                version: "<string>"       /* OPTIONAL */
            },
            /* OPTIONAL */
            platform: "<string>",
            /* OPTIONAL */
            env: {
                name: "<string>",         /* OPTIONAL */
                timeout_sec: 42,          /* OPTIONAL */
                memory_mb: 1024,          /* OPTIONAL */
                region: "<string>",       /* OPTIONAL */
                /* OPTIONAL */
                container: {
                    runtime: "<string>",  /* OPTIONAL */
                    orchestrator: "<string>"  /* OPTIONAL */
                }
            }
        }
    }




client.application.name

This value is application configurable.

The application name is printed to the mongod logs upon establishing the connection. It is also recorded in the slow query logs and profile collections.

The recommended way for applications to provide this value is through the connection URI. The connection string key is appname.

Example connection string::

mongodb://server:27017/db?appname=mongodump

This option MAY also be provided on the MongoClient itself, if normal for the driver. It is only valid to set this attribute before any connection has been made to a server. Any attempt to set client.application.name MUST result in an failure when doing so will either change the existing value, or have any connection to MongoDB reporting inconsistent values.

Drivers MUST NOT provide a default value for this key.

client.driver.name


This value is required and is not application configurable.

The internal driver name. For drivers written on-top of other core drivers, the
underlying driver will typically expose a function to append additional name to
this field.

Example::

        - "pymongo"
        - "mongoc / phongo"


client.driver.version

This value is required and is not application configurable.

The internal driver version. The version formatting is not defined. For drivers written on-top of other core drivers, the underlying driver will typically expose a function to append additional name to this field.

Example::

    - "1.1.2-beta0"
    - "1.4.1 / 1.2.0"

client.os.type


This value is required and is not application configurable.

The Operating System primary identification type the client is running on.
Equivalent to ``uname -s`` on POSIX systems.  This field is REQUIRED and clients
must default to ``unknown`` when an appropriate value cannot be determined.

Example::

        - "Linux"
        - "Darwin"
        - "Windows"
        - "BSD"
        - "Unix"


client.os.name

This value is optional, but RECOMMENDED, it is not application configurable.

Detailed name of the Operating System’s, such as fully qualified distribution name. On systemd systems, this is typically PRETTY_NAME of os-release(5) (/etc/os-release) or the DISTRIB_DESCRIPTION (/etc/lsb-release, lsb_release(1) --description) on LSB systems. The exact value and method to determine this value is undefined.

Example::

    - "Ubuntu 16.04 LTS"
    - "macOS"
    - "CygWin"
    - "FreeBSD"
    - "AIX"

client.os.architecture


This value is optional, but RECOMMENDED, it is not application configurable.
The machine hardware name. Equivalent to ``uname -m`` on POSIX systems.

Example::

        - "x86_64"
        - "ppc64le"


client.os.version
~~~~~~~~~~~~~~~~~

This value is optional and is not application configurable.

The Operating System version.

Example::

        - "10"
        - "8.1"
        - "16.04.1"


client.platform
~~~~~~~~~~~~~~~

This value is optional and is not application configurable.

Driver specific platform details.

Example::

        - clang 3.8.0 CFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium"
        - "Oracle JVM EE 9.1.1"

client.env
~~~~~~~~~~

This value is optional and is not application configurable.

Information about the execution environment, including Function-as-a-Service (FaaS)
identification and container runtime.

The contents of ``client.env`` MUST be adjusted to keep the handshake below the size limit;
see `Limitations`_ for specifics.

If no fields of ``client.env`` would be populated, ``client.env`` MUST be entirely omitted.

FaaS
^^^^

FaaS details are captured in the ``name``, ``timeout_sec``, ``memory_mb``, and ``region`` fields
of ``client.env``. The ``name`` field is determined by which of the following environment
variables are populated:

+----------------+----------------------------------------------------------+
| ``aws.lambda`` | ``AWS_EXECUTION_ENV`` [#]_ or ``AWS_LAMBDA_RUNTIME_API`` |
+----------------+----------------------------------------------------------+
| ``azure.func`` | ``FUNCTIONS_WORKER_RUNTIME``                             |
+----------------+----------------------------------------------------------+
| ``gcp.func``   | ``K_SERVICE`` or ``FUNCTION_NAME``                       |
+----------------+----------------------------------------------------------+
| ``vercel``     | ``VERCEL``                                               |
+----------------+----------------------------------------------------------+

.. [#] ``AWS_EXECUTION_ENV`` must start with the string ``"AWS_Lambda_"``.

If none of those variables are populated the other FaaS values MUST be entirely omitted.  When
variables for multiple ``client.env.name`` values are present, ``vercel`` takes precedence over
``aws.lambda``; any other combination MUST cause the other FaaS values to be entirely omitted.

Depending on which ``client.env.name`` has been selected, other FaaS fields in ``client.env``
SHOULD be populated:

+----------------+----------------------------+-------------------------------------+---------------+
| Name           | Field                      | Environment Variable                | Expected Type |
+================+============================+=====================================+===============+
| ``aws.lambda`` | ``client.env.region``      | ``AWS_REGION``                      | string        |
+----------------+----------------------------+-------------------------------------+---------------+
|                | ``client.env.memory_mb``   | ``AWS_LAMBDA_FUNCTION_MEMORY_SIZE`` | int32         |
+----------------+----------------------------+-------------------------------------+---------------+
| ``gcp.func``   | ``client.env.memory_mb``   | ``FUNCTION_MEMORY_MB``              | int32         |
+----------------+----------------------------+-------------------------------------+---------------+
|                | ``client.env.timeout_sec`` | ``FUNCTION_TIMEOUT_SEC``            | int32         |
+----------------+----------------------------+-------------------------------------+---------------+
|                | ``client.env.region``      | ``FUNCTION_REGION``                 | string        |
+----------------+----------------------------+-------------------------------------+---------------+
| ``vercel``     | ``client.env.region``      | ``VERCEL_REGION``                   | string        |
+----------------+----------------------------+-------------------------------------+---------------+

Missing variables or variables with values not matching the expected type MUST cause the
corresponding ``client.env`` field to be omitted and MUST NOT cause a user-visible error.

Container
^^^^^^^^^

Container runtime information is captured in ``client.env.container``.

``client.env.container.runtime`` MUST be set to ``"docker"`` if the file ``.dockerenv``
exists in the root directory.

``client.env.container.orchestrator`` MUST be set to ``"kubernetes"`` if the environment
variable ``KUBERNETES_SERVICE_HOST`` is populated.

If no fields of ``client.env.container`` would be populated, ``client.env.container`` MUST
be entirely omitted.

--------------------------
Speculative Authentication
--------------------------

:since: 4.4

The initial handshake supports a new argument, ``speculativeAuthenticate``,
provided as a BSON document. Clients specifying this argument to ``hello`` or legacy
hello will speculatively include the first command of an authentication handshake.
This command may be provided to the server in parallel with any standard request for
supported authentication mechanisms (i.e. ``saslSupportedMechs``). This would permit
clients to merge the contents of their first authentication command with their
initial handshake request, and receive the first authentication reply along with
the initial handshake reply.

When the mechanism is ``MONGODB-X509``, ``speculativeAuthenticate`` has the same
structure as seen in the MONGODB-X509 conversation section in the
`Driver Authentication spec <../auth/auth.md#supported-authentication-methods>`_.

When the mechanism is ``SCRAM-SHA-1`` or ``SCRAM-SHA-256``, ``speculativeAuthenticate``
has the same fields as seen in the conversation subsection of the SCRAM-SHA-1 and
SCRAM-SHA-256 sections in the `Driver Authentication spec <../auth/auth.md#supported-authentication-methods>`_
with an additional ``db`` field to specify the name of the authentication database.

When the mechanism is ``MONGODB-OIDC``, ``speculativeAuthenticate`` has the same
structure as seen in the MONGODB-OIDC conversation section in the `Driver
Authentication spec
<../auth/auth.md#supported-authentication-methods>`_.

If the initial handshake command with a ``speculativeAuthenticate`` argument succeeds,
the client should proceed with the next step of the exchange. If the initial handshake
response does not include a ``speculativeAuthenticate`` reply and the ``ok`` field
in the initial handshake response is set to 1, drivers MUST authenticate using the standard
authentication handshake.

The ``speculativeAuthenticate`` reply has the same fields, except for the ``ok`` field,
as seen in the conversation sections for MONGODB-X509, SCRAM-SHA-1 and SCRAM-SHA-256
in the `Driver Authentication spec <../auth/auth.md#supported-authentication-methods>`_.

Drivers MUST NOT validate the contents of the ``saslSupportedMechs`` attribute
of the initial handshake reply. Drivers MUST NOT raise an error if the
``saslSupportedMechs`` attribute of the reply includes an unknown mechanism.

If an authentication mechanism is not provided either via connection string or code, but
a credential is provided, drivers MUST use the SCRAM-SHA-256 mechanism for speculative
authentication and drivers MUST send ``saslSupportedMechs``.

Older servers will ignore the ``speculativeAuthenticate`` argument. New servers will
participate in the standard authentication conversation if this argument is missing.


Supporting Wrapping Libraries
=============================

Drivers MUST allow libraries which wrap the driver to append to the client
metadata generated by the driver. The following class definition defines the
options which MUST be supported:

.. code:: typescript

    class DriverInfoOptions {
        /**
        * The name of the library wrapping the driver.
        */
        name: String;

        /**
        * The version of the library wrapping the driver.
        */
        version: Optional<String>;

        /**
        * Optional platform information for the wrapping driver.
        */
        platform: Optional<String>;
    }


Note that how these options are provided to a driver is left up to the implementer.

If provided, these options MUST NOT replace the values used for metadata generation.
The provided options MUST be appended to their respective fields, and be delimited by
a ``|`` character. For example, when `Motor <https://www.mongodb.com/docs/ecosystem/drivers/motor/>`_
wraps PyMongo, the following fields are updated to include Motor's "driver info":

.. code:: typescript

    {
        client: {
            driver: {
                name: "PyMongo|Motor",
                version: "3.6.0|2.0.0"
            }
        }
    }


**NOTE:** All strings provided as part of the driver info MUST NOT contain the delimiter used
for metadata concatention. Drivers MUST throw an error if any of these strings contains that
character.

----------
Deviations
----------

Some drivers have already implemented such functionality, and should not be required to make
breaking changes to comply with the requirements set forth here. A non-exhaustive list of
acceptable deviations are as follows:

* The name of `DriverInfoOptions` is non-normative, implementers may feel free to name this whatever they like.
* The choice of delimiter is not fixed, ``|`` is the recommended value, but some drivers currently use ``/``.
* For cases where we own a particular stack of drivers (more than two), it may be preferable to accept a *list* of strings for each field.

Limitations
===========

The entire ``client`` metadata BSON document MUST NOT exceed 512 bytes. This includes
all BSON overhead.  The ``client.application.name`` cannot exceed 128 bytes.  MongoDB
will return an error if these limits are not adhered to, which will result in
handshake failure. Drivers MUST validate these values and truncate or omit driver
provided values if necessary.  Implementers SHOULD cumulatively update fields in
the following order until the document is under the size limit:

1. Omit fields from ``env`` except ``env.name``.
2. Omit fields from ``os`` except ``os.type``.
3. Omit the ``env`` document entirely.
4. Truncate ``platform``.

Additionally, implementers are encouraged to place high priority information about the
platform earlier in the string, in order to avoid possible truncating of those details.

Test Plan
=========

Drivers that capture values for ``client.env`` should test that a connection and hello
command succeeds in the presence of the following sets of environment variables:

1. Valid AWS

+-------------------------------------+----------------------+
| ``AWS_EXECUTION_ENV``               | ``AWS_Lambda_java8`` |
+-------------------------------------+----------------------+
| ``AWS_REGION``                      | ``us-east-2``        |
+-------------------------------------+----------------------+
| ``AWS_LAMBDA_FUNCTION_MEMORY_SIZE`` | ``1024``             |
+-------------------------------------+----------------------+

2. Valid Azure

+------------------------------+----------+
| ``FUNCTIONS_WORKER_RUNTIME`` | ``node`` |
+------------------------------+----------+

3. Valid GCP

+--------------------------+-----------------+
| ``K_SERVICE``            | ``servicename`` |
+--------------------------+-----------------+
| ``FUNCTION_MEMORY_MB``   | ``1024``        |
+--------------------------+-----------------+
| ``FUNCTION_TIMEOUT_SEC`` | ``60``          |
+--------------------------+-----------------+
| ``FUNCTION_REGION``      | ``us-central1`` |
+--------------------------+-----------------+

4. Valid Vercel

+-------------------+------------------+
| ``VERCEL``        | ``1``            |
+-------------------+------------------+
| ``VERCEL_REGION`` | ``cdg1``         |
+-------------------+------------------+

5. Invalid - multiple providers

+------------------------------+----------------------+
| ``AWS_EXECUTION_ENV``        | ``AWS_Lambda_java8`` |
+------------------------------+----------------------+
| ``FUNCTIONS_WORKER_RUNTIME`` | ``node``             |
+------------------------------+----------------------+

6. Invalid - long string

+-----------------------+--------------------------+
| ``AWS_EXECUTION_ENV`` | ``AWS_Lambda_java8``     |
+-----------------------+--------------------------+
| ``AWS_REGION``        | ``a`` repeated 512 times |
+-----------------------+--------------------------+

7. Invalid - wrong types

+-------------------------------------+----------------------+
| ``AWS_EXECUTION_ENV``               | ``AWS_Lambda_java8`` |
+-------------------------------------+----------------------+
| ``AWS_LAMBDA_FUNCTION_MEMORY_SIZE`` | ``big``              |
+-------------------------------------+----------------------+

8. Invalid - ``AWS_EXECUTION_ENV`` does not start with ``"AWS_Lambda_"``

+-----------------------+---------+
| ``AWS_EXECUTION_ENV`` | ``EC2`` |
+-----------------------+---------+

Motivation For Change
=====================

Being able to annotate individual connections with custom data will allow users
and sysadmins to easily correlate events happening on their MongoDB deployment
to a specific application. For support engineers, it furthermore helps identify
potential problems in drivers or client platforms, and paves the way for
providing proactive support via Cloud Manager and/or Atlas to advise customers
about out of date driver versions.


Design Rationale
================

Drivers run on a multitude of platforms, languages, environments and systems.
There is no defined list of data points that may or may not be valuable to
every system. Rather than specifying such a list it was decided we would report
the basics; something that everyone can discover and consider valuable. The
obvious requirement here being the driver itself and its version. Any
additional information is generally very system specific. Scala may care to
know the Java runtime, while Python would like to know if it was built with C
extensions - and C would like to know the compiler.

Having to define dozens of arguments that may or may not be useful to one or
two drivers isn’t a good idea. Instead, we define a ``platform`` argument that is
driver dependent. This value will not have defined value across drivers and is
therefore not generically queryable -- however, it will gain defined schema for
that particular driver, and will therefore over time gain defined structure
that can be formatted and value extracted from.

Backwards Compatibility
=======================

The legacy hello command currently ignores arguments. (i.e. If arguments are
provided the legacy hello command discards them without erroring out). Adding
client metadata functionality has therefore no backwards compatibility concerns.

This also allows a driver to determine if the ``hello`` command is supported. On
server versions that support the ``hello`` command, the legacy hello command with
``helloOk: true`` will respond with ``helloOk: true``. On server versions that do
not support the ``hello`` command, the ``helloOk: true`` argument is ignored and
the legacy hello response will not contain ``helloOk: true``.

Reference Implementation
========================

`C Driver <https://github.com/mongodb/mongo-c-driver/blob/master/src/libmongoc/src/mongoc/mongoc-handshake.c>`_.

Q&A
===

* The 128 bytes application.name limit, does that include BSON overhead
   * No, just the string itself
* The 512 bytes limit, does that include BSON overhead?
   * Yes
* The 512 bytes limit, does it apply to the full initial handshake document or just the ``client`` subdocument
   * Just the subdocument
* Should I really try to fill the 512 bytes with data?
   * Not really. The server does not attempt to normalize or compress this data in anyway, so it will hold it in memory as-is per connection. 512 bytes for 20,000 connections is ~ 10mb of memory the server will need.
* What happens if I pass new arguments in the legacy hello command to previous MongoDB versions?
   * Nothing. Arguments passed to the legacy hello command to prior versions of MongoDB are not treated in any special way and have no effect one way or another.
* Are there wire version bumps or anything accompanying this specification?
   * No
* Is establishing the handshake required for connecting to MongoDB 3.4?
   * No, it only augments the connection. MongoDB will not reject connections without it
* Does this affect SDAM implementations?
   * Possibly. There are a couple of gotchas. If the application.name is not in the URI...
      * The SDAM monitoring cannot be launched until the user has had the ability
        to set the application name because the application name has to be sent in the
        initial handshake. This means that the connection pool cannot be established until
        the first user initiated command, or else some connections will have the
        application name while other won’t
      * The initial handshake must be called on all sockets, including administrative background
        sockets to MongoDB
* My language doesn't have ``uname``, but does instead provide its own variation of these values, is that OK?
   * Absolutely. As long as the value is identifiable it is fine. The exact method and values are undefined by this specification

Changelog
=========

:2019-11-13: Added section about supporting wrapping libraries
:2020-02-12: Added section about speculative authentication
:2021-04-27: Updated to define ``hello`` and legacy hello
:2022-01-13: Updated to disallow ``hello`` using ``OP_QUERY``
:2022-01-19: Require that timeouts be applied per the client-side operations timeout spec.
:2022-02-24: Rename Versioned API to Stable API
:2022-10-05: Remove spec front matter and reformat changelog.
:2023-03-13: Add ``env`` to ``client`` document
:2023-04-03: Simplify truncation for metadata
:2023-05-04: ``AWS_EXECUTION_ENV`` must start with ``"AWS_Lambda_"``
:2023-08-24: Added container awareness
:2024-04-22: Clarify that driver should not validate ``saslSupportedMechs`` content.

Wire Compression in Drivers

  • Status: Accepted

  • Minimum Server Version: 3.4

Abstract

This specification describes how to implement Wire Protocol Compression between MongoDB drivers and mongo[d|s].

Compression is achieved through a new bi-directional wire protocol opcode, referred to as OP_COMPRESSED.

Server side compressor support is checked during the initial MongoDB Handshake, and is compatible with all historical versions of MongoDB. If a client detects a compatible compressor it will use the compressor for all valid requests.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Terms

Compressor - A compression algorithm. There are currently three supported algorithms: snappy, zlib, and zstd.

Specification

MongoDB Handshake Amendment

The MongoDB Handshake Protocol describes an argument passed to the handshake command, client. This specification adds an additional argument, compression, that SHOULD be provided to the handshake command if a driver supports wire compression.

The value of this argument is an array of supported compressors.

Example:

    {
        hello: 1,
        client: {}, /* See MongoDB Handshake */
        compression: ["snappy", "zlib", "zstd"]
    }

When no compression is enabled on the client, drivers SHOULD send an empty compression argument.

Example:

    {
        hello: 1,
        client: {}, /* See MongoDB Handshake */
        compression: []
    }

Clients that want to compress their messages need to send a list of the algorithms - in the order they are specified in the client configuration - that they are willing to support to the server during the initial handshake call. For example, a client wishing to compress with the snappy algorithm, should send:

    { hello: 1, ... , compression: [ "snappy" ] }

The server will respond with the intersection of its list of supported compressors and the client's. For example, if the server had both snappy and zlib enabled and the client requested snappy, it would respond with:

    { ... , compression: [ "snappy" ], ok: 1 }

If the client included both snappy and zlib, the server would respond with something like:

    { ... , compression: [ "snappy", "zlib" ], ok: 1 }

If the server has no compression algorithms in common with the client, it sends back a handshake response without a compression field. Clients MAY issue a log level event to inform the user, but MUST NOT error.

When MongoDB server receives a compressor it can support it MAY reply to any and all requests using the selected compressor, including the reply of the initial MongoDB Handshake. As each OP_COMPRESSED message contains the compressor ID, clients MUST NOT assume which compressor each message uses, but MUST decompress the message using the compressor identified in the OP_COMPRESSED opcode header.

When compressing, clients MUST use the first compressor in the client's configured compressors list that is also in the servers list.

Connection String Options

Two new connection string options:

compressors

Comma separated list of compressors the client should present to the server. Unknown compressors MUST yield a warning, as per the Connection String specification, and MUST NOT be included in the handshake. By default, no compressor is configured and thus compression will not be used. When multiple compressors are provided, the list should be treated as a priority ordered list of compressors to use, with highest priority given to the first compressor in the list, and lowest priority to the last compressor in the list.

Example:

    mongodb://localhost/?compressors=zstd,snappy,zlib

zlibCompressionLevel

Integer value from -1 - 9. This configuration option only applies to the zlib compressor. When zlib is not one of the compressors enumerated by the compressors configuration option then providing this option has no meaning, but clients SHOULD NOT issue a warning.

LevelDescription
-1Default Compression (usually 6)
0No compression
1Best Speed
9Best Compression

Note that this value only applies to the client side compression level, not the response.

OP_COMPRESSED

The new opcode, called OP_COMPRESSED, has the following structure:

    struct OP_COMPRESSED {
        struct MsgHeader {
            int32  messageLength;
            int32  requestID;
            int32  responseTo;
            int32  opCode = 2012;
        };
        int32_t  originalOpcode;
        int32_t  uncompressedSize;
        uint8_t  compressorId;
        char    *compressedMessage;
    };
FieldDescription
originalOpcodeContains the value of the wrapped opcode.
uncompressedSizeThe size of the deflated compressedMessage, which excludes the MsgHeader
compressorIdThe ID of the compressor that compressed the message
compressedMessageThe opcode itself, excluding the MsgHeader

Compressor IDs

Each compressor is assigned a predefined compressor ID.

compressorIdHandshake ValueDescription
0noopThe content of the message is uncompressed.
This is realistically only used for testing
1snappyThe content of the message is compressed using snappy.
2zlibThe content of the message is compressed using zlib.
3zstdThe content of the message is compressed using zstd.
4-255reservedReserved for future use.

Compressible messages

Any opcode can be compressed and wrapped in an OP_COMPRESSED header. The OP_COMPRESSED is strictly a wire protocol without regards to what opcode it wraps, be it OP_QUERY, OP_REPLY, OP_MSG or any other future or past opcode. The compressedMessage contains the original opcode, excluding the standard MsgHeader. The originalOpcode value therefore effectively replaces the standard MsgHeader of the compressed opcode.

There is no guarantee that a response will be compressed even though compression was negotiated for in the handshake. Clients MUST be able to parse both compressed and uncompressed responses to both compressed and uncompressed requests.

MongoDB 3.4 will always reply with a compressed response when compression has been negotiated, but future versions may not.

A client MAY choose to implement compression for only OP_QUERY, OP_REPLY, and OP_MSG, and perhaps for future opcodes, but not to implement it for OP_INSERT, OP_UPDATE, OP_DELETE, OP_GETMORE, and OP_KILLCURSORS.

Note that certain messages, such as authentication commands, MUST NOT be compressed. All other messages MUST be compressed, when compression has been negotiated and the driver has implemented compression for the opcode in use.

Messages not allowed to be compressed

In efforts to mitigate against current and previous attacks, certain messages MUST NOT be compressed, such as authentication requests.

Messages using the following commands MUST NOT be compressed:

  • hello
  • legacy hello (see MongoDB Handshake Protocol for details)
  • saslStart
  • saslContinue
  • getnonce
  • authenticate
  • createUser
  • updateUser
  • copydbSaslStart
  • copydbgetnonce
  • copydb

Test Plan

There are no automated tests accompanying this specification, instead the following is a description of test scenarios clients should implement.

In general, after implementing this functionality and the test cases, running the traditional client test suite against a server with compression enabled, and ensuring the test suite is configured to provide a valid compressor as part of the connection string, is a good idea. MongoDB-supported drivers MUST add such variant to their CI environment.

The following cases assume a standalone MongoDB 3.4 (or later) node configured with:

    mongod --networkMessageCompressors "snappy" -vvv

Create an example application which connects to a provided connection string, runs ping: 1, and then quits the program normally.

Connection strings, and results

      {"t":{"$date":"2021-04-08T13:28:38.885-06:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"127.0.0.1:50635","uuid":"03961627-aec7-4543-8a17-9690f87273a6","connectionId":2,"connectionCount":1}}
      {"t":{"$date":"2021-04-08T13:28:38.886-06:00"},"s":"D3", "c":"EXECUTOR", "id":22983,   "ctx":"listener","msg":"Starting new executor thread in passthrough mode"}
      {"t":{"$date":"2021-04-08T13:28:38.887-06:00"},"s":"D3", "c":"-",        "id":5127801, "ctx":"thread27","msg":"Setting the Client","attr":{"client":"conn2"}}
      {"t":{"$date":"2021-04-08T13:28:38.887-06:00"},"s":"D2", "c":"COMMAND",  "id":21965,   "ctx":"conn2","msg":"About to run the command","attr":{"db":"admin","commandArgs":{"hello":1,"client":{"application":{"name":"MongoDB Shell"},"driver":{"name":"MongoDB Internal Client","version":"4.9.0-alpha7-555-g623aa8f"},"os":{"type":"Darwin","name":"Mac OS X","architecture":"x86_64","version":"19.6.0"}},"compression":["snappy"],"apiVersion":"1","apiStrict":true,"$db":"admin"}}}
      {"t":{"$date":"2021-04-08T13:28:38.888-06:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn2","msg":"client metadata","attr":{"remote":"127.0.0.1:50635","client":"conn2","doc":{"application":{"name":"MongoDB Shell"},"driver":{"name":"MongoDB Internal Client","version":"4.9.0-alpha7-555-g623aa8f"},"os":{"type":"Darwin","name":"Mac OS X","architecture":"x86_64","version":"19.6.0"}}}}
      {"t":{"$date":"2021-04-08T13:28:38.889-06:00"},"s":"D3", "c":"NETWORK",  "id":22934,   "ctx":"conn2","msg":"Starting server-side compression negotiation"}
      {"t":{"$date":"2021-04-08T13:28:38.889-06:00"},"s":"D3", "c":"NETWORK",  "id":22937,   "ctx":"conn2","msg":"supported compressor","attr":{"compressor":"snappy"}}
      {"t":{"$date":"2021-04-08T13:28:38.889-06:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn2","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"hello":1,"client":{"application":{"name":"MongoDB Shell"},"driver":{"name":"MongoDB Internal Client","version":"4.9.0-alpha7-555-g623aa8f"},"os":{"type":"Darwin","name":"Mac OS X","architecture":"x86_64","version":"19.6.0"}},"compression":["snappy"],"apiVersion":"1","apiStrict":true,"$db":"admin"},"numYields":0,"reslen":351,"locks":{},"remote":"127.0.0.1:50635","protocol":"op_query","durationMillis":1}}
      {"t":{"$date":"2021-04-08T13:28:38.890-06:00"},"s":"D2", "c":"QUERY",    "id":22783,   "ctx":"conn2","msg":"Received interrupt request for unknown op","attr":{"opId":596,"knownOps":[]}}
      {"t":{"$date":"2021-04-08T13:28:38.890-06:00"},"s":"D3", "c":"-",        "id":5127803, "ctx":"conn2","msg":"Released the Client","attr":{"client":"conn2"}}
      {"t":{"$date":"2021-04-08T13:28:38.890-06:00"},"s":"D3", "c":"-",        "id":5127801, "ctx":"conn2","msg":"Setting the Client","attr":{"client":"conn2"}}
      {"t":{"$date":"2021-04-08T13:28:38.891-06:00"},"s":"D3", "c":"NETWORK",  "id":22927,   "ctx":"conn2","msg":"Decompressing message","attr":{"compressor":"snappy"}}
      {"t":{"$date":"2021-04-08T13:28:38.891-06:00"},"s":"D2", "c":"COMMAND",  "id":21965,   "ctx":"conn2","msg":"About to run the command","attr":{"db":"admin","commandArgs":{"whatsmyuri":1,"apiStrict":false,"$db":"admin","apiVersion":"1"}}}
      {"t":{"$date":"2021-04-08T13:28:38.892-06:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn2","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"whatsmyuri":1,"apiStrict":false,"$db":"admin","apiVersion":"1"},"numYields":0,"reslen":63,"locks":{},"remote":"127.0.0.1:50635","protocol":"op_msg","durationMillis":0}}
      {"t":{"$date":"2021-04-08T13:28:38.892-06:00"},"s":"D2", "c":"QUERY",    "id":22783,   "ctx":"conn2","msg":"Received interrupt request for unknown op","attr":{"opId":597,"knownOps":[]}}
      {"t":{"$date":"2021-04-08T13:28:38.892-06:00"},"s":"D3", "c":"NETWORK",  "id":22925,   "ctx":"conn2","msg":"Compressing message","attr":{"compressor":"snappy"}}
      {"t":{"$date":"2021-04-08T13:28:38.893-06:00"},"s":"D3", "c":"-",        "id":5127803, "ctx":"conn2","msg":"Released the Client","attr":{"client":"conn2"}}
      {"t":{"$date":"2021-04-08T13:28:38.893-06:00"},"s":"D3", "c":"-",        "id":5127801, "ctx":"conn2","msg":"Setting the Client","attr":{"client":"conn2"}}
      {"t":{"$date":"2021-04-08T13:28:38.895-06:00"},"s":"D3", "c":"NETWORK",  "id":22927,   "ctx":"conn2","msg":"Decompressing message","attr":{"compressor":"snappy"}}
      {"t":{"$date":"2021-04-08T13:28:38.895-06:00"},"s":"D2", "c":"COMMAND",  "id":21965,   "ctx":"conn2","msg":"About to run the command","attr":{"db":"admin","commandArgs":{"buildinfo":1.0,"apiStrict":false,"$db":"admin","apiVersion":"1"}}}
      {"t":{"$date":"2021-04-08T13:28:38.896-06:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn2","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"buildinfo":1.0,"apiStrict":false,"$db":"admin","apiVersion":"1"},"numYields":0,"reslen":2606,"locks":{},"remote":"127.0.0.1:50635","protocol":"op_msg","durationMillis":0}}
      {"t":{"$date":"2021-04-08T13:28:38.896-06:00"},"s":"D2", "c":"QUERY",    "id":22783,   "ctx":"conn2","msg":"Received interrupt request for unknown op","attr":{"opId":598,"knownOps":[]}}
      {"t":{"$date":"2021-04-08T13:28:38.897-06:00"},"s":"D3", "c":"NETWORK",  "id":22925,   "ctx":"conn2","msg":"Compressing message","attr":{"compressor":"snappy"}}
      {"t":{"$date":"2021-04-08T13:28:38.897-06:00"},"s":"D3", "c":"-",        "id":5127803, "ctx":"conn2","msg":"Released the Client","attr":{"client":"conn2"}}
      {"t":{"$date":"2021-04-08T13:28:38.897-06:00"},"s":"D3", "c":"-",        "id":5127801, "ctx":"conn2","msg":"Setting the Client","attr":{"client":"conn2"}}
      {"t":{"$date":"2021-04-08T13:28:38.898-06:00"},"s":"D3", "c":"NETWORK",  "id":22927,   "ctx":"conn2","msg":"Decompressing message","attr":{"compressor":"snappy"}}
      {"t":{"$date":"2021-04-08T13:28:38.899-06:00"},"s":"D2", "c":"COMMAND",  "id":21965,   "ctx":"conn2","msg":"About to run the command","attr":{"db":"admin","commandArgs":{"endSessions":[{"id":{"$uuid":"c4866af5-ed6b-4f01-808b-51a3f8aaaa08"}}],"$db":"admin","apiVersion":"1","apiStrict":true}}}
      {"t":{"$date":"2021-04-08T13:28:38.899-06:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn2","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"endSessions":[{"id":{"$uuid":"c4866af5-ed6b-4f01-808b-51a3f8aaaa08"}}],"$db":"admin","apiVersion":"1","apiStrict":true},"numYields":0,"reslen":38,"locks":{},"remote":"127.0.0.1:50635","protocol":"op_msg","durationMillis":0}}
      {"t":{"$date":"2021-04-08T13:28:38.900-06:00"},"s":"D2", "c":"QUERY",    "id":22783,   "ctx":"conn2","msg":"Received interrupt request for unknown op","attr":{"opId":599,"knownOps":[]}}
      {"t":{"$date":"2021-04-08T13:28:38.900-06:00"},"s":"D3", "c":"NETWORK",  "id":22925,   "ctx":"conn2","msg":"Compressing message","attr":{"compressor":"snappy"}}
      {"t":{"$date":"2021-04-08T13:28:38.900-06:00"},"s":"D3", "c":"-",        "id":5127803, "ctx":"conn2","msg":"Released the Client","attr":{"client":"conn2"}}
      {"t":{"$date":"2021-04-08T13:28:38.901-06:00"},"s":"D3", "c":"-",        "id":5127801, "ctx":"conn2","msg":"Setting the Client","attr":{"client":"conn2"}}
      {"t":{"$date":"2021-04-08T13:28:38.901-06:00"},"s":"D2", "c":"NETWORK",  "id":22986,   "ctx":"conn2","msg":"Session from remote encountered a network error during SourceMessage","attr":{"remote":"127.0.0.1:50635","error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"}}}
      {"t":{"$date":"2021-04-08T13:28:38.902-06:00"},"s":"D1", "c":"-",        "id":23074,   "ctx":"conn2","msg":"User assertion","attr":{"error":"HostUnreachable: Connection closed by peer","file":"src/mongo/transport/service_state_machine.cpp","line":410}}
      {"t":{"$date":"2021-04-08T13:28:38.902-06:00"},"s":"W",  "c":"EXECUTOR", "id":4910400, "ctx":"conn2","msg":"Terminating session due to error","attr":{"error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"}}}
      {"t":{"$date":"2021-04-08T13:28:38.902-06:00"},"s":"I",  "c":"NETWORK",  "id":5127900, "ctx":"conn2","msg":"Ending session","attr":{"error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection closed by peer"}}}
      {"t":{"$date":"2021-04-08T13:28:38.903-06:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn2","msg":"Connection ended","attr":{"remote":"127.0.0.1:50635","uuid":"03961627-aec7-4543-8a17-9690f87273a6","connectionId":2,"connectionCount":0}}
      {"t":{"$date":"2021-04-08T13:28:38.903-06:00"},"s":"D3", "c":"-",        "id":5127803, "ctx":"conn2","msg":"Released the Client","attr":{"client":"conn2"}}

The result of the program should have been:

      { "ok" : 1 }
      {"t":{"$date":"2021-04-20T09:57:26.823-06:00"},"s":"D2", "c":"COMMAND",  "id":21965,   "ctx":"conn5","msg":"About to run the command","attr":{"db":"admin","commandArgs":{"hello":1,"client":{"driver":{"name":"mongo-csharp-driver","version":"2.12.2.0"},"os":{"type":"macOS","name":"Darwin 19.6.0 Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST 2021; root:xnu-6153.141.16~1/RELEASE_X86_64","architecture":"x86_64","version":"19.6.0"},"platform":".NET 5.0.3"},"compression":[],"$readPreference":{"mode":"secondaryPreferred"},"$db":"admin"}}}
      {"t":{"$date":"2021-04-20T09:57:26.823-06:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn5","msg":"client metadata","attr":{"remote":"127.0.0.1:54372","client":"conn5","doc":{"driver":{"name":"mongo-csharp-driver","version":"2.12.2.0"},"os":{"type":"macOS","name":"Darwin 19.6.0 Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST 2021; root:xnu-6153.141.16~1/RELEASE_X86_64","architecture":"x86_64","version":"19.6.0"},"platform":".NET 5.0.3"}}}
      {"t":{"$date":"2021-04-20T09:57:26.824-06:00"},"s":"D3", "c":"NETWORK",  "id":22934,   "ctx":"conn5","msg":"Starting server-side compression negotiation"}
      {"t":{"$date":"2021-04-20T09:57:26.824-06:00"},"s":"D3", "c":"NETWORK",  "id":22936,   "ctx":"conn5","msg":"No compressors provided"}
      {"t":{"$date":"2021-04-20T09:57:26.825-06:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn5","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"hello":1,"client":{"driver":{"name":"mongo-csharp-driver","version":"2.12.2.0"},"os":{"type":"macOS","name":"Darwin 19.6.0 Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST 2021; root:xnu-6153.141.16~1/RELEASE_X86_64","architecture":"x86_64","version":"19.6.0"},"platform":".NET 5.0.3"},"compression":[],"$readPreference":{"mode":"secondaryPreferred"},"$db":"admin"},"numYields":0,"reslen":319,"locks":{},"remote":"127.0.0.1:54372","protocol":"op_query","durationMillis":1}}

e.g., empty compression: [] array. No operations should have been compressed.

The results of the program should have been:

      WARNING: Unsupported compressor: 'snoopy'
      { "ok" : 1 }
      {"t":{"$date":"2021-04-08T13:28:38.898-06:00"},"s":"D3", "c":"NETWORK",  "id":22927,   "ctx":"conn2","msg":"Decompressing message","attr":{"compressor":"snappy"}}

The results of the program should have been:

      { "ok" : 1 }
      {"t":{"$date":"2021-04-08T13:28:38.898-06:00"},"s":"D3", "c":"NETWORK",  "id":22927,   "ctx":"conn2","msg":"Decompressing message","attr":{"compressor":"zlib"}}

The results of the program should have been:

      { "ok" : 1 }
  • Create example program that authenticates to the server using SCRAM-SHA-1, then creates another user (MONGODB-CR), then runs hello followed with serverStatus.

  • Reconnect to the same server using the created MONGODB-CR credentials. Observe that the only command that was decompressed on the server was serverStatus, while the server replied with OP_COMPRESSED for at least the serverStatus command.

Motivation For Change

Drivers provide the canonical interface to MongoDB. Most tools for MongoDB are written with the aid of MongoDB drivers. There exist a lot of tools for MongoDB that import massive datasets which could stand to gain a lot from compression. Even day-to-day applications stand to gain from reduced bandwidth utilization at low cpu costs, especially when doing large reads off the network.

Not all use cases fit compression, but we will allow users to decide if wire compression is right for them.

Design rationale

Snappy has minimal cost and provides a reasonable compression ratio, but it is not expected to be available for all languages MongoDB Drivers support. Supporting zlib is therefore important to the ecosystem, but for languages that do support snappy we expected it to be the default choice. While snappy has no knobs to tune, zlib does have support for specifying the compression level (tuned from speed to compression). As we don’t anticipate adding support for compression libraries with complex knobs to tune this specification has opted not to define a complex configuration structure and only define the currently relevant zlibCompressionLevel. When other compression libraries are supported, adding support for configuring that library (if any is needed) should be handled on a case by case basis.

More recently, the MongoDB server added Zstandard (zstd) support for another modern alternative to zlib.

Backwards Compatibility

The new argument provided to the MongoDB Handshake has no backwards compatible implications as servers that do not expect it will simply ignore it. This means a server will therefore never reply with a list of acceptable compressors which in turns means a client CANNOT use the OP_COMPRESSED opcode.

Reference Implementation

Future Work

Possible future improvements include defining an API to determine compressor and configuration per operation, rather than needing to create two different client pools, one for compression and one without, when the user is expecting only needing to (not) compress very few operations.

Q & A

  • Statistics?

  • How to try this/enable it?

    • mongod --networkMessageCompressors "snappy"
  • The server MAY reply with compressed data even if the request was not compressed?

    • Yes, and this is in fact the behaviour of MongoDB 3.4
  • Can drivers compress the initial MongoDB Handshake/hello request?

    • No.
  • Can the server reply to the MongoDB Handshake/hello compressed?

    • Yes, yes it can. Be aware it is completely acceptable for the server to use compression for any and all replies, using any supported compressor, when the client announced support for compression - this includes the reply to the actual MongoDB Handshake/hello where the support was announced.
  • This is billed a MongoDB 3.6 feature -- but I hear it works with MongoDB3.4?

    • Yes, it does. All MongoDB versions support the compression argument to the initial handshake and all MongoDB versions will reply with an intersection of compressors it supports. This works even with MongoDB 3.0, as it will not reply with any compressors. It also works with MongoDB 3.4 which will reply with snappy if it was part of the driver's list. MongoDB 3.6 will likely include zlib support.
  • Which compressors are currently supported?

    • MongoDB 3.4 supports snappy
    • MongoDB 3.6 supports snappy and zlib
    • MongoDB 4.2 supports snappy, zlib, and zstd
  • My language supports xyz compressor, should I announce them all in the handshake?

    • No. But you are allowed to if you really want to make sure you can use that compressor with MongoDB 42 and your current driver versions.
  • My language does not support xzy compressor. What do I do?

    • That is OK. You don’t have to support xyz.
  • No MongoDB supported compressors are available for my language

    • That is OK. You don’t have to support compressors you can’t support. All it means is you can’t compress the request, and since you never declared support for any compressor, you won’t be served with compressed responses either.
  • Why did the server not support zlib in MongoDB 3.4?

    • Snappy was selected for its very low performance hit, while giving reasonable compression, resulting in quite significant bandwidth reduction. Zlib characteristics are slightly different out-of-the-box and did not make sense for the initial goal of reducing bandwidth between replica set nodes.
  • If snappy is preferable to zlib, why add support for zlib in MongoDB 3.6?

    • Zlib is available on every platform known to man. Snappy is not. Having zlib support makes sense for client traffic, which could originate on any type of platform, which may or may not support snappy.

Changelog

  • 2024-02-16: Migrated from reStructuredText to Markdown.
  • 2022-10-05: Remove spec front matter and reformat changelog.
  • 2021-04-06: Use 'hello' command
  • 2019-05-13: Add zstd as supported compression algorithm
  • 2017-06-13: Don't require clients to implement legacy opcodes
  • 2017-05-10: Initial commit.

============== SOCKS5 Support

:Status: Accepted :Minimum Server Version: N/A

.. contents::


Abstract

SOCKS5 is a standardized protocol for connecting to network services through a separate proxy server. It can be used for connecting to hosts that would otherwise be unreachable from the local network by connecting to a proxy server, which receives the intended target host’s address from the client and then connects to that address.

This specification defines driver behaviour when connecting to MongoDB services through a SOCKS5 proxy.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 <https://www.ietf.org/rfc/rfc2119.txt>__.

Specification

Terms

SOCKS5 ^^^^^^

The SOCKS protocol, version 5, as defined in RFC1928 <https://datatracker.ietf.org/doc/html/rfc1928>, restricted to either no authentication or username/password authentication as defined in RFC1929 <https://datatracker.ietf.org/doc/html/rfc1929>.

MongoClient Configuration

proxyHost ^^^^^^^^^

To specify to the driver to connect using a SOCKS5 proxy, a connection string option of :code:proxyHost=host MUST be added to the connection string or passed through an equivalent :code:MongoClient option. This option specifies a domain name or IPv4 or IPv6 address on which a SOCKS5 proxy is listening. This option MUST only be configurable at the level of a :code:MongoClient.

proxyPort ^^^^^^^^^

To specify to the driver to connect using a SOCKS5 proxy listening on a non-default port, a connection string option of :code:proxyPort=port MUST be added to the connection string or passed through an equivalent :code:MongoClient option. This option specifies a TCP port number. The default for this option MUST be :code:1080. This option MUST only be configurable at the level of a :code:MongoClient. Drivers MUST error if this option was specified and :code:proxyHost was not specified.

proxyUsername ^^^^^^^^^^^^^

To specify to the driver to connect using a SOCKS5 proxy requiring username/password authentication, a connection string option of :code:proxyUsername=username MUST be added to the connection string or passed through an equivalent :code:MongoClient option. This option specifies a string of non-zero length. Drivers MUST ignore this option if it specifies a zero-length string. Drivers MUST error if this option was specified and :code:proxyHost was not specified or :code:proxyPassword was not specified.

proxyPassword ^^^^^^^^^^^^^

To specify to the driver to connect using a SOCKS5 proxy requiring username/password authentication, a connection string option of :code:proxyPassword=password MUST be added to the connection string or passed through an equivalent :code:MongoClient option. This option specifies a string of non-zero length. Drivers MUST ignore this option if it specifies a zero-length string. Drivers MUST error if this option was specified and :code:proxyHost was not specified or :code:proxyUsername was not specified.

Connection Pooling

Connection Establishment ^^^^^^^^^^^^^^^^^^^^^^^^

When establishing a new outgoing TCP connection, drivers MUST perform the following steps if :code:proxyHost was specified:

#. Connect to the SOCKS5 proxy host, using :code:proxyHost and :code:proxyPort as specified.

#. Perform a SOCKS5 handshake as specified in RFC1928.

If :code:`proxyUsername` and :code:`proxyPassword` were passed,
drivers MUST indicate in the handshake that both "no authentication"
and "username/password authentication" are supported. Otherwise,
drivers MUST indicate support for "no authentication" only.

Drivers MUST NOT attempt to perform DNS A or AAAA record resolution
of the destination hostname and instead pass the hostname to the
proxy as-is.

#. Continue with connection establishment as if the connection was one to the destination host.

Drivers MUST use the SOCKS5 proxy for connections to MongoDB services and client-side field-level encryption KMS servers <../client-side-encryption/client-side-encryption.md#kms-provider>__.

Drivers MUST NOT use the SOCKS5 proxy for connections to :code:mongocryptd processes spawned for automatic client-side field-level encryption.

Drivers MUST treat a connection failure when connecting to the SOCKS5 proxy or a SOCKS5 handshake or authentication failure the same as a network error (e.g. ECONNREFUSED).

Events

SOCKS5 proxies are fully transparent to connection monitoring events. In particular, in :code:CommandStartedEvent, :code:CommandSucceededEvent, and :code:CommandFailedEvent, the driver SHOULD NOT reference the SOCKS5 proxy as part of the :code:connectionId field or other fields.

Q&A

Why not include DNS requests in the set of proxied network communication? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

While SOCKS5 as a protocol does support UDP forwarding, using this feature has a number of downsides. Notably, only a subset of SOCKS5 client libraries and SOCKS5 server implementations support UDP forwarding (e.g. the OpenSSH client’s dynamic forwarding feature does not). This would also considerably increase implementation complexity in drivers that do not use DNS libraries in which the driver is in control of how the UDP packets are sent and received.

Why not support other proxy protocols, such as Socks4/Socks4a, HTTP Connect proxies, etc.? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

SOCKS5 is a powerful, standardized and widely used proxy protocol. It is likely that almost all users which require tunneling/proxying of some sort will be able to use it, and those who require another protocol or a more advanced setup like proxy chaining, can work around that by using a local SOCKS5 intermediate proxy.

Why are the connection string parameters generic, with no explicit mention of SOCKS5? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In the case that future changes will enable drivers using other proxy protocols, keeping the option names generic allows their re-use. In that case, another option would specify the protocol and SOCKS5 would be the implied default. However, since there is no reason to believe that such additions will be made in the foreseeable future, no option for specifying the proxy protocol is introduced here.

Why is support for authentication methods limited to no authentication and username/password authentication? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This matches the set of authentication methods most commonly implemented by SOCKS5 client libraries and thus reduces implementation complexity for drivers. This advantage is sufficient to ignore the possible advantages that would come with enabling other authentication methods.

Design Rationales

Alternative Designs

Changelog

:2022-10-05: Remove spec front matter

Initial DNS Seedlist Discovery

  • Status: Accepted
  • Minimum Server Version: N/A

Abstract

Presently, seeding a driver with an initial list of ReplicaSet or MongoS addresses is somewhat cumbersome, requiring a comma-delimited list of host names to attempt connections to. A standardized answer to this problem exists in the form of SRV records, which allow administrators to configure a single domain to return a list of host names. Supporting this feature would assist our users by decreasing maintenance load, primarily by removing the need to maintain seed lists at an application level.

This specification builds on the Connection String specification. It adds a new protocol scheme and modifies how the Host Information is interpreted.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Specification

Connection String Format

The connection string parser in the driver is extended with a new protocol mongodb+srv as a logical pre-processing step before it considers the connection string and SDAM specifications. In this protocol, the comma separated list of host names is replaced with a single host name. The format is:

mongodb+srv://{hostname}.{domainname}/{options}

{options} refers to the optional elements from the Connection String specification following the Host Information. This includes the Auth database and Connection Options.

MongoClient Configuration

srvMaxHosts

This option is used to limit the number of mongos connections that may be created for sharded topologies. This option limits the number of SRV records used to populate the seedlist during initial discovery, as well as the number of additional hosts that may be added during SRV polling. This option requires a non-negative integer and defaults to zero (i.e. no limit). This option MUST only be configurable at the level of a MongoClient.

srvServiceName

This option specifies a valid SRV service name according to RFC 6335, with the exception that it may exceed 15 characters as long as the 63rd (62nd with prepended underscore) character DNS query limit is not surpassed. This option requires a string value and defaults to "mongodb". This option MUST only be configurable at the level of a MongoClient.

URI Validation

The driver MUST report an error if either the srvServiceName or srvMaxHosts URI options are specified with a non-SRV URI (i.e. scheme other than mongodb+srv). The driver MUST allow specifying the srvServiceName and srvMaxHosts URI options with an SRV URI (i.e. mongodb+srv scheme).

If srvMaxHosts is a positive integer, the driver MUST throw an error in the following cases:

  • The connection string contains a replicaSet option.
  • The connection string contains a loadBalanced option with a value of true.

When validating URI options, the driver MUST first do the SRV and TXT lookup and then perform the validation. For drivers that do SRV lookup asynchronously this may result in a MongoClient being instantiated but erroring later during operation execution.

Seedlist Discovery

Validation Before Querying DNS

It is an error to specify a port in a connection string with the mongodb+srv protocol, and the driver MUST raise a parse error and MUST NOT do DNS resolution or contact hosts.

It is an error to specify more than one host name in a connection string with the mongodb+srv protocol, and the driver MUST raise a parse error and MUST NOT do DNS resolution or contact hosts.

A driver MUST verify that in addition to the {hostname}, the {domainname} consists of at least two parts: the domain name, and a TLD. Drivers MUST raise an error and MUST NOT contact the DNS server to obtain SRV (or TXT records) if the full URI does not consist of at least three parts.

If mongodb+srv is used, a driver MUST implicitly also enable TLS. Clients can turn this off by passing tls=false in either the Connection String, or options passed in as parameters in code to the MongoClient constructor (or equivalent API for each driver), but not through a TXT record (discussed in a later section).

Querying DNS

In this preprocessing step, the driver will query the DNS server for SRV records on {hostname}.{domainname}, prefixed with the SRV service name and protocol. The SRV service name is provided in the srvServiceName URI option and defaults to mongodb. The protocol is always tcp. After prefixing, the URI should look like: _{srvServiceName}._tcp.{hostname}.{domainname}. This DNS query is expected to respond with one or more SRV records.

The priority and weight fields in returned SRV records MUST be ignored.

If the DNS result returns no SRV records, or no records at all, or a DNS error happens, an error MUST be raised indicating that the URI could not be used to find hostnames. The error SHALL include the reason why they could not be found.

A driver MUST verify that the host names returned through SRV records have the same parent {domainname}. Drivers MUST raise an error and MUST NOT initiate a connection to any returned host name which does not share the same {domainname}.

The driver MUST NOT attempt to connect to any hosts until the DNS query has returned its results.

If srvMaxHosts is zero or greater than or equal to the number of hosts in the DNS result, the driver MUST populate the seedlist with all hosts.

If srvMaxHosts is greater than zero and less than the number of hosts in the DNS result, the driver MUST randomly select that many hosts and use them to populate the seedlist. Drivers SHOULD use the Fisher-Yates shuffle for randomization.

Default Connection String Options

As a second preprocessing step, a Client MUST also query the DNS server for TXT records on {hostname}.{domainname}. If available, a TXT record provides default connection string options. The maximum length of a TXT record string is 255 characters, but there can be multiple strings per TXT record. A Client MUST support multiple TXT record strings and concatenate them as if they were one single string in the order they are defined in each TXT record. The order of multiple character strings in each TXT record is guaranteed. A Client MUST NOT allow multiple TXT records for the same host name and MUST raise an error when multiple TXT records are encountered.

Information returned within a TXT record is a simple URI string, just like the {options} in a connection string.

A Client MUST only support the authSource, replicaSet, and loadBalanced options through a TXT record, and MUST raise an error if any other option is encountered. Although using mongodb+srv:// implicitly enables TLS, a Client MUST NOT allow the ssl option to be set through a TXT record option.

TXT records MAY be queried either before, in parallel, or after SRV records. Clients MUST query both the SRV and the TXT records before attempting any connection to MongoDB.

A Client MUST use options specified in the Connection String, and options passed in as parameters in code to the MongoClient constructor (or equivalent API for each driver), to override options provided through TXT records.

If any connection string option in a TXT record is incorrectly formatted, a Client MUST throw a parse exception.

This specification does not change the behaviour of handling unknown keys or incorrect values as is set out in the Connection String spec. Unknown keys or incorrect values in default options specified through TXT records MUST be handled in the same way as unknown keys or incorrect values directly specified through a Connection String. For example, if a driver that does not support the authSource option finds authSource=db in a TXT record, it MUST handle the unknown option according to the rules in the Connection String spec.

CNAME not supported

The use of DNS CNAME records is not supported. Clients MUST NOT check for a CNAME record on {hostname}.{domainname}. A system's DNS resolver could transparently handle CNAME, but because of how clients validate records returned from SRV queries, use of CNAME could break validation. Seedlist discovery therefore does not recommend or support the use of CNAME records in concert with SRV or TXT records.

Example

If we provide the following URI:

mongodb+srv://server.mongodb.com/

The driver needs to request the DNS server for the SRV record _mongodb._tcp.server.mongodb.com. This could return:

Record                            TTL   Class    Priority Weight Port  Target
_mongodb._tcp.server.mongodb.com. 86400 IN SRV   0        5      27317 mongodb1.mongodb.com.
_mongodb._tcp.server.mongodb.com. 86400 IN SRV   0        5      27017 mongodb2.mongodb.com.

The returned host names (mongodb1.mongodb.com and mongodb2.mongodb.com) must share the same parent domain name (mongodb.com) as the provided host name (server.mongodb.com).

The driver also needs to request the DNS server for the TXT records on server.mongodb.com. This could return:

Record              TTL   Class    Text
server.mongodb.com. 86400 IN TXT   "replicaSet=replProduction&authSource=authDB"

From the DNS results, the driver now MUST treat the host information as if the following URI was used instead:

mongodb://mongodb1.mongodb.com:27317,mongodb2.mongodb.com:27107/?ssl=true&replicaSet=replProduction&authSource=authDB

If we provide the following URI with the same DNS (SRV and TXT) records:

mongodb+srv://server.mongodb.com/?authSource=otherDB

Then the default in the TXT record for authSource is not used as the value in the connection string overrides it. The Client MUST treat the host information as if the following URI was used instead:

mongodb://mongodb1.mongodb.com:27317,mongodb2.mongodb.com:27107/?ssl=true&replicaSet=replProduction&authSource=otherDB

Test Plan

See README.md in the accompanying test directory.

Additionally, see the mongodb+srv test invalid-uris.yml in the Connection String Spec tests.

Motivation

Several of our users have asked for this through tickets:

Design Rationale

The design specifically calls for a pre-processing stage of the processing of connection URLs to minimize the impact on existing functionality.

Justifications

Why Are Multiple Key-Value Pairs Allowed in One TXT Record?

One could imagine an alternative design in which each TXT record would allow only one URI option. No & character would be allowed as a delimiter within TXT records.

In this spec we allow multiple key-value pairs within one TXT record, delimited by &, because it will be common for all options to fit in a single 255-character TXT record, and it is much more convenient to configure one record in this case than to configure several.

Secondly, in some cases the order in which options occur is important. For example, readPreferenceTags can appear both multiple times, and the order in which they appear is significant. Because DNS servers may return TXT records in any order, it is only possible to guarantee the order in which readPreferenceTags keys appear by having them in the same TXT record.

Why Is There No Mention of UTF-8 Characters?

Although DNS TXT records allow any octet to exist in its value, many DNS providers do not allow non-ASCII characters to be configured. As it is unlikely that any option names or values in the connection string have non-ASCII characters, we left the behaviour of supporting UTF-8 characters as unspecified.

Reference Implementation

None yet.

Backwards Compatibility

There are no backwards compatibility concerns.

Future Work

In the future we could consider using the priority and weight fields of the SRV records.

ChangeLog

  • 2024-03-06: Migrated from reStructuredText to Markdown.

  • 2022-10-05: Revise spec front matter and reformat changelog.

  • 2021-10-14: Add srvMaxHosts MongoClient option and restructure Seedlist
    Discovery section. Improve documentation for the srvServiceName MongoClient option and add a new URI Validation section.

  • 2021-09-15: Clarify that service name only defaults to mongodb, and should
    be defined by the srvServiceName URI option.

  • 2021-04-15: Adding in behaviour for load balancer mode.

  • 2019-03-07: Clarify that CNAME is not supported

  • 2018-02-08: Clarify that {options}} in the Specification section includes
    all the optional elements from the Connection String specification.

  • 2017-11-21: Add clause that using mongodb+srv:// implies enabling TLS. Add
    restriction that only authSource and replicaSet are allows in TXT records. Add restriction that only one TXT record is supported share the same parent domain name as the given host name.

  • 2017-11-17: Add new rule that indicates that host names in returned SRV records
    MUST share the same parent domain name as the given host name. Remove language and tests for non-ASCII characters.

  • 2017-11-07: Clarified that all parts of listable options such as
    readPreferenceTags are ignored if they are also present in options to the MongoClient constructor. Clarified which host names to use for SRV and TXT DNS queries.

  • 2017-11-01: Clarified that individual TXT records can have multiple strings.

  • 2017-10-31: Added a clause that specifying two host names with a
    mongodb+srv:// URI is not allowed. Added a few more test cases.

  • 2017-10-18: Removed prohibition of raising DNS related errors when parsing the URI.

  • 2017-10-04: Removed from Future Work the line about multiple MongoS
    discovery. The current specification already allows for it, as multiple host names which are all MongoS servers is already allowed under SDAM. And this specification does not modify SDAM. Added support for connection string options through TXT records.

  • 2017-09-19: Clarify that host names in mongodb+srv:// URLs work like normal
    host specifications.

  • 2017-09-01: Updated test plan with YAML tests, and moved prose tests for URI
    parsing into invalid-uris.yml in the Connection String Spec tests.

.. note:: This specification has been converted to Markdown and renamed to server-discovery-and-monitoring.md <server-discovery-and-monitoring.md>_.

Connection Monitoring and Pooling

  • Status: Accepted
  • Minimum Server Version: N/A

Abstract

Drivers currently support a variety of options that allow users to configure connection pooling behavior. Users are confused by drivers supporting different subsets of these options. Additionally, drivers implement their connection pools differently, making it difficult to design cross-driver pool functionality. By unifying and codifying pooling options and behavior across all drivers, we will increase user comprehension and code base maintainability.

This specification does not apply to drivers that do not support multitasking.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Definitions

Connection

A Connection (when linked) refers to the Connection type defined in the Connection Pool Members section of this specification. It does not refer to an actual TCP connection to an Endpoint. A Connection will attempt to create and wrap such a TCP connection over the course of its existence, but it is not equivalent to one nor does it wrap an active one at all times.

For the purposes of testing, a mocked Connection type could be used with the pool that never actually creates a TCP connection or performs any I/O.

Endpoint

For convenience, an Endpoint refers to either a mongod or mongos instance.

Thread

For convenience, a Thread refers to:

  • A shared-address-space process (a.k.a. a thread) in multi-threaded drivers
  • An Execution Frame / Continuation in asynchronous drivers
  • A goroutine in Go

Behavioral Description

Which Drivers this applies to

This specification is solely concerned with drivers that implement a connection pool. A driver SHOULD implement a connection pool, but is not required to.

Connection Pool Options

All drivers that implement a connection pool MUST implement and conform to the same MongoClient options. There can be slight deviation in naming to make the options idiomatic to the driver language.

Connection Pool Behaviors

All driver connection pools MUST provide an API that allows the driver to check out a connection, check in a connection back to the pool, and clear all connections in the pool. This API is for internal use only, and SHOULD NOT be documented as a public API.

Connection Pool Monitoring

All drivers that implement a connection pool MUST provide an API that allows users to subscribe to events emitted from the pool. Conceptually, event emission is instantaneous, i.e., one may talk about the instant an event is emitted, and represents the start of an activity of delivering the event to a subscribed user.

Detailed Design

Connection Pool Options

Drivers that implement a Connection Pool MUST support the following ConnectionPoolOptions:

interface ConnectionPoolOptions {
  /**
   *  The maximum number of Connections that may be associated
   *  with a pool at a given time. This includes in use and
   *  available connections.
   *  If specified, MUST be an integer >= 0.
   *  A value of 0 means there is no limit.
   *  Defaults to 100.
   */
  maxPoolSize?: number;

  /**
   *  The minimum number of Connections that MUST exist at any moment
   *  in a single connection pool.
   *  If specified, MUST be an integer >= 0. If maxPoolSize is > 0
   *  then minPoolSize must be <= maxPoolSize
   *  Defaults to 0.
   */
  minPoolSize?: number;

  /**
   *  The maximum amount of time a Connection should remain idle
   *  in the connection pool before being marked idle.
   *  If specified, MUST be a number >= 0.
   *  A value of 0 means there is no limit.
   *  Defaults to 0.
   */
  maxIdleTimeMS?: number;

  /**
   *  The maximum number of Connections a Pool may be establishing concurrently.
   *  Establishment of a Connection is a part of its life cycle
   *  starting after a ConnectionCreatedEvent and ending before a ConnectionReadyEvent.
   *  If specified, MUST be a number > 0.
   *  Defaults to 2.
   */
  maxConnecting?: number;
}

Additionally, Drivers that implement a Connection Pool MUST support the following ConnectionPoolOptions UNLESS that driver meets ALL of the following conditions:

interface ConnectionPoolOptions {
  /**
   *  NOTE: This option has been deprecated in favor of timeoutMS.
   *
   *  The maximum amount of time a thread can wait for
   *  either an available non-perished connection (limited by `maxPoolSize`),
   *  or a pending connection (limited by `maxConnecting`).
   *  If specified, MUST be a number >= 0.
   *  A value of 0 means there is no limit.
   *  Defaults to 0.
   */
  waitQueueTimeoutMS?: number;
}

These options MUST be specified at the MongoClient level, and SHOULD be named in a manner idiomatic to the driver's language. All connection pools created by a MongoClient MUST use the same ConnectionPoolOptions.

When parsing a mongodb connection string, a user MUST be able to specify these options using the default names specified above.

Deprecated Options

The following ConnectionPoolOptions are considered deprecated. They MUST NOT be implemented if they do not already exist in a driver, and they SHOULD be deprecated and removed from drivers that implement them as early as possible:

interface ConnectionPoolOptions {
  /**
   *  The maximum number of threads that can simultaneously wait
   *  for a Connection to become available.
   */
  waitQueueSize?: number;

  /**
   *  An alternative way of setting waitQueueSize, it specifies
   *  the maximum number of threads that can wait per connection.
   *  waitQueueSize === waitQueueMultiple \* maxPoolSize
   */
  waitQueueMultiple?: number
}

Connection Pool Members

Connection

A driver-defined wrapper around a single TCP connection to an Endpoint. A Connection has the following properties:

  • Single Endpoint: A Connection MUST be associated with a single Endpoint. A Connection MUST NOT be associated with multiple Endpoints.
  • Single Lifetime: A Connection MUST NOT be used after it is closed.
  • Single Owner: A Connection MUST belong to exactly one Pool, and MUST NOT be shared across multiple pools
  • Single Track: A Connection MUST limit itself to one request / response at a time. A Connection MUST NOT multiplex/pipeline requests to an Endpoint.
  • Monotonically Increasing ID: A Connection MUST have an ID number associated with it. Connection IDs within a Pool MUST be assigned in order of creation, starting at 1 and increasing by 1 for each new Connection.
  • Valid Connection: A connection MUST NOT be checked out of the pool until it has successfully and fully completed a MongoDB Handshake and Authentication as specified in the Handshake, OP_COMPRESSED, and Authentication specifications.
  • Perishable: it is possible for a Connection to become Perished. A Connection is considered perished if any of the following are true:
    • Stale: The Connection 's generation does not match the generation of the parent pool
    • Idle: The Connection is currently "available" (as defined below) and has been for longer than maxIdleTimeMS.
    • Errored: The Connection has experienced an error that indicates it is no longer recommended for use. Examples include, but are not limited to:
      • Network Error
      • Network Timeout
      • Endpoint closing the connection
      • Driver-Side Timeout
      • Wire-Protocol Error
interface Connection {
  /**
   *  An id number associated with the Connection
   */
  id: number;

  /**
   *  The address of the pool that owns this Connection
   */
  address: string;

  /**
   *  An integer representing the "generation" of the pool
   *  when this Connection was created.
   */
  generation: number;

  /**
   * The current state of the Connection.
   *
   * Possible values are the following:
   *   - "pending":       The Connection has been created but has not yet been established. Contributes to
   *                      totalConnectionCount and pendingConnectionCount.
   *
   *   - "available":     The Connection has been established and is waiting in the pool to be checked
   *                      out. Contributes to both totalConnectionCount and availableConnectionCount.
   *
   *   - "in use":        The Connection has been established, checked out from the pool, and has yet
   *                      to be checked back in. Contributes to totalConnectionCount.
   *
   *   - "closed":        The Connection has had its socket closed and cannot be used for any future
   *                      operations. Does not contribute to any connection counts.
   *
   * Note: this field is mainly used for the purposes of describing state
   * in this specification. It is not required that drivers
   * actually include this field in their implementations of Connection.
   */
  state: "pending" | "available" | "in use" | "closed";
}

WaitQueue

A concept that represents pending requests for Connections. When a thread requests a Connection from a Pool, the thread enters the Pool's WaitQueue. A thread stays in the WaitQueue until it either receives a Connection or times out. A WaitQueue has the following traits:

  • Thread-Safe: When multiple threads attempt to enter or exit a WaitQueue, they do so in a thread-safe manner.
  • Ordered/Fair: When Connections are made available, they are issued out to threads in the order that the threads entered the WaitQueue.
  • Timeout aggressively: Members of a WaitQueue MUST timeout if they are enqueued for longer than the computed timeout and MUST leave the WaitQueue immediately in this case.

The implementation details of a WaitQueue are left to the driver. Example implementations include:

  • A fair Semaphore
  • A Queue of callbacks

Connection Pool

A driver-defined entity that encapsulates all non-monitoring Connections associated with a single Endpoint. The pool has the following properties:

  • Thread Safe: All Pool behaviors MUST be thread safe.
  • Not Fork-Safe: A Pool is explicitly not fork-safe. If a Pool detects that is it being used by a forked process, it MUST immediately clear itself and update its pid
  • Single Owner: A Pool MUST be associated with exactly one Endpoint, and MUST NOT be shared between Endpoints.
  • Emit Events and Log Messages: A Pool MUST emit pool events and log messages when dictated by this spec (see Connection Pool Monitoring). Users MUST be able to subscribe to emitted events and log messages in a manner idiomatic to their language and driver.
  • Closeable: A Pool MUST be able to be manually closed. When a Pool is closed, the following behaviors change:
  • Clearable: A Pool MUST be able to be cleared. Clearing the pool marks all pooled and checked out Connections as stale and lazily closes them as they are checkedIn or encountered in checkOut. Additionally, all requests are evicted from the WaitQueue and return errors that are considered non-timeout network errors.
  • Pausable: A Pool MUST be able to be paused and resumed. A Pool is paused automatically when it is cleared, and it can be resumed by being marked as "ready". While the Pool is paused, it exhibits the following behaviors:
    • Attempting to check out a Connection from the Pool results in a non-timeout network error
    • Connections are not created in the background to satisfy minPoolSize
  • Capped: a pool is capped if maxPoolSize is set to a non-zero value. If a pool is capped, then its total number of Connections (including available and in use) MUST NOT exceed maxPoolSize
  • Rate-limited: A Pool MUST limit the number of Connections being established concurrently via the maxConnecting pool option.
interface ConnectionPool {
  /**
   *  The Queue of threads waiting for a Connection to be available
   */
  waitQueue: WaitQueue;

  /**
   *  A generation number representing the SDAM generation of the pool.
   */
  generation: number;

  /**
   * A map representing the various generation numbers for various services
   * when in load balancer mode.
   */
  serviceGenerations: Map<ObjectId, [number, number]>;

  /**
   * The state of the pool.
   *
   * Possible values are the following:
   *   - "paused":        The initial state of the pool. Connections may not be checked out nor can they
   *                      be established in the background to satisfy minPoolSize. Clearing a pool
   *                      transitions it to this state.
   *
   *   - "ready":         The healthy state of the pool. It can service checkOut requests and create
   *                      connections in the background. The pool can be set to this state via the
   *                      ready() method.
   *
   *   - "closed":        The pool is destroyed. No more Connections may ever be checked out nor any
   *                      created in the background. The pool can be set to this state via the close()
   *                      method. The pool cannot transition to any other state after being closed.
   */
  state: "paused" | "ready" | "closed";

  // Any of the following connection counts may be computed rather than
  // actually stored on the pool.

  /**
   *  An integer expressing how many total Connections
   *  ("pending" + "available" + "in use") the pool currently has
   */
  totalConnectionCount: number;

  /**
   *  An integer expressing how many Connections are currently
   *  available in the pool.
   */
  availableConnectionCount: number;

  /**
   *  An integer expressing how many Connections are currently
   *  being established.
   */
  pendingConnectionCount: number;

  /**
   *  Returns a Connection for use
   */
  checkOut(): Connection;

  /**
   *  Check in a Connection back to the Connection pool
   */
  checkIn(connection: Connection): void;

  /**
   *  Mark all current Connections as stale, clear the WaitQueue, and mark the pool as "paused".
   *  No connections may be checked out or created in this pool until ready() is called again.
   *  interruptInUseConnections specifies whether the pool will force interrupt "in use" connections as part of the clear. 
   *  Default false.
   */
  clear(interruptInUseConnections: Optional<Boolean>): void;

  /**
   *  Mark the pool as "ready", allowing checkOuts to resume and connections to be created in the background.
   *  A pool can only transition from "paused" to "ready". A "closed" pool
   *  cannot be marked as "ready" via this method.
   */
  ready(): void;

  /**
   *  Marks the pool as "closed", preventing the pool from creating and returning new Connections
   */
  close(): void;
}

Connection Pool Behaviors

Creating a Connection Pool

This specification does not define how a pool is to be created, leaving it up to the driver. Creation of a connection pool is generally an implementation detail of the driver, i.e., is not a part of the public API of the driver. The SDAM specification defines when the driver should create connection pools.

When a pool is created, its state MUST initially be set to "paused". Even if minPoolSize is set, the pool MUST NOT begin being populated with Connections until it has been marked as "ready". SDAM will mark the pool as "ready" on each successful check. See Connection Pool Management section in the SDAM specification for more information.

set generation to 0
set state to "paused"
emit PoolCreatedEvent and equivalent log message

Closing a Connection Pool

When a pool is closed, it MUST first close all available Connections in that pool. This results in the following behavior changes:

  • In use Connections MUST be closed when they are checked in to the closed pool.
  • Attempting to check out a Connection MUST result in an error.
mark pool as "closed"
for connection in availableConnections:
  close connection
emit PoolClosedEvent and equivalent log message

Marking a Connection Pool as Ready

Connection Pools start off as "paused", and they are marked as "ready" by monitors after they perform successful server checks. Once a pool is "ready", it can start checking out Connections and populating them in the background.

If the pool is already "ready" when this method is invoked, then this method MUST immediately return and MUST NOT emit a PoolReadyEvent.

mark pool as "ready"
emit PoolReadyEvent and equivalent log message
allow background thread to create connections

Note that the PoolReadyEvent MUST be emitted before the background thread is allowed to resume creating new connections, and it must be the case that no observer is able to observe actions of the background thread related to creating new connections before observing the PoolReadyEvent event.

Creating a Connection (Internal Implementation)

When creating a Connection, the initial Connection is in a "pending" state. This only creates a "virtual" Connection, and performs no I/O.

connection = new Connection()
increment totalConnectionCount
increment pendingConnectionCount
set connection state to "pending"
tConnectionCreated = current instant (use a monotonic clock if possible)
emit ConnectionCreatedEvent and equivalent log message
return connection

Establishing a Connection (Internal Implementation)

Before a Connection can be marked as either "available" or "in use", it must be established. This process involves performing the initial handshake, handling OP_COMPRESSED, and performing authentication.

try:
  connect connection via TCP / TLS
  perform connection handshake
  handle OP_COMPRESSED
  perform connection authentication
  tConnectionReady = current instant (use a monotonic clock if possible)
  emit ConnectionReadyEvent(duration = tConnectionReady - tConnectionCreated) and equivalent log message
  return connection
except error:
  close connection
  throw error # Propagate error in manner idiomatic to language.

Closing a Connection (Internal Implementation)

When a Connection is closed, it MUST first be marked as "closed", removing it from being counted as "available" or "in use". Once that is complete, the Connection can perform whatever teardown is necessary to close its underlying socket. The Driver SHOULD perform this teardown in a non-blocking manner, such as via the use of a background thread or async I/O.

original state = connection state
set connection state to "closed"

if original state is "available":
  decrement availableConnectionCount
else if original state is "pending":
  decrement pendingConnectionCount

decrement totalConnectionCount
emit ConnectionClosedEvent and equivalent log message

# The following can happen at a later time (i.e. in background
# thread) or via non-blocking I/O.
connection.socket.close()

Marking a Connection as Available (Internal Implementation)

A Connection is "available" if it is able to be checked out. A Connection MUST NOT be marked as "available" until it has been established. The pool MUST keep track of the number of currently available Connections.

increment availableConnectionCount
set connection state to "available"
add connection to availableConnections

Populating the Pool with a Connection (Internal Implementation)

"Populating" the pool involves preemptively creating and establishing a Connection which is marked as "available" for use in future operations.

Populating the pool MUST NOT block any application threads. For example, it could be performed on a background thread or via the use of non-blocking/async I/O. Populating the pool MUST NOT be performed unless the pool is "ready".

If an error is encountered while populating a connection, it MUST be handled via the SDAM machinery according to the Application Errors section in the SDAM specification.

If minPoolSize is set, the Connection Pool MUST be populated until it has at least minPoolSize total Connections. This MUST occur only while the pool is "ready". If the pool implements a background thread, it can be used for this. If the pool does not implement a background thread, the checkOut method is responsible for ensuring this requirement is met.

When populating the Pool, pendingConnectionCount has to be decremented after establishing a Connection similarly to how it is done in Checking Out a Connection to signal that another Connection is allowed to be established. Such a signal MUST become observable to any Thread after the action that marks the established Connection as "available" becomes observable to the Thread. Informally, this order guarantees that no Thread tries to start establishing a Connection when there is an "available" Connection established as a result of populating the Pool.

wait until pendingConnectionCount < maxConnecting and pool is "ready"
create connection
try:
  establish connection
  mark connection as available
except error:
  # Defer error handling to SDAM.
  topology.handle_pre_handshake_error(error)

Checking Out a Connection

A Pool MUST have a method that allows the driver to check out a Connection. Checking out a Connection involves submitting a request to the WaitQueue and, once that request reaches the front of the queue, having the Pool find or create a Connection to fulfill that request. Requests MUST be subject to a timeout which is computed per the rules in Client Side Operations Timeout: Server Selection.

To service a request for a Connection, the Pool MUST first iterate over the list of available Connections, searching for a non-perished one to be returned. If a perished Connection is encountered, such a Connection MUST be closed (as described in Closing a Connection) and the iteration of available Connections MUST continue until either a non-perished available Connection is found or the list of available Connections is exhausted.

If the list is exhausted, the total number of Connections is less than maxPoolSize, and pendingConnectionCount < maxConnecting, the pool MUST create a Connection, establish it, mark it as "in use" and return it. If totalConnectionCount == maxPoolSize or pendingConnectionCount == maxConnecting, then the pool MUST wait to service the request until neither of those two conditions are met or until a Connection becomes available, re-entering the checkOut loop in either case. This waiting MUST NOT prevent Connections from being checked into the pool. Additionally, the Pool MUST NOT service any newer checkOut requests before fulfilling the original one which could not be fulfilled. For drivers that implement the WaitQueue via a fair semaphore, a condition variable may also be needed to to meet this requirement. Waiting on the condition variable SHOULD also be limited by the WaitQueueTimeout, if the driver supports one and it was specified by the user.

If the pool is "closed" or "paused", any attempt to check out a Connection MUST throw an Error. The error thrown as a result of the pool being "paused" MUST be considered a retryable error and MUST NOT be an error that marks the SDAM state unknown.

If the pool does not implement a background thread, the checkOut method is responsible for ensuring that the pool is populated with at least minPoolSize Connections.

A Connection MUST NOT be checked out until it is established. In addition, the Pool MUST NOT prevent other threads from checking out Connections while establishing a Connection.

Before a given Connection is returned from checkOut, it must be marked as "in use", and the pool's availableConnectionCount MUST be decremented.

connection = Null
tConnectionCheckOutStarted = current instant (use a monotonic clock if possible)
emit ConnectionCheckOutStartedEvent and equivalent log message
try:
  enter WaitQueue
  wait until at top of wait queue
  # Note that in a lock-based implementation of the wait queue would
  # only allow one thread in the following block at a time
  while connection is Null:
    if a connection is available:
      while connection is Null and a connection is available:
        connection = next available connection
        if connection is perished:
          close connection
          connection = Null
    else if totalConnectionCount < maxPoolSize:
      if pendingConnectionCount < maxConnecting:
        connection = create connection
      else:
        # this waiting MUST NOT prevent other threads from checking Connections
        # back in to the pool.
        wait until pendingConnectionCount < maxConnecting or a connection is available
        continue

except pool is "closed":
  tConnectionCheckOutFailed = current instant (use a monotonic clock if possible)
  emit ConnectionCheckOutFailedEvent(reason="poolClosed", duration = tConnectionCheckOutFailed - tConnectionCheckOutStarted) and equivalent log message
  throw PoolClosedError
except pool is "paused":
  tConnectionCheckOutFailed = current instant (use a monotonic clock if possible)
  emit ConnectionCheckOutFailedEvent(reason="connectionError", duration = tConnectionCheckOutFailed - tConnectionCheckOutStarted) and equivalent log message
  throw PoolClearedError
except timeout:
  tConnectionCheckOutFailed = current instant (use a monotonic clock if possible)
  emit ConnectionCheckOutFailedEvent(reason="timeout", duration = tConnectionCheckOutFailed - tConnectionCheckOutStarted) and equivalent log message
  throw WaitQueueTimeoutError
finally:
  # This must be done in all drivers
  leave wait queue

# If the Connection has not been established yet (TCP, TLS,
# handshake, compression, and auth), it must be established
# before it is returned.
# This MUST NOT block other threads from acquiring connections.
if connection state is "pending":
  try:
    establish connection
  except connection establishment error:
    tConnectionCheckOutFailed = current instant (use a monotonic clock if possible)
    emit ConnectionCheckOutFailedEvent(reason="connectionError", duration = tConnectionCheckOutFailed - tConnectionCheckOutStarted) and equivalent log message
    decrement totalConnectionCount
    throw
  finally:
    decrement pendingConnectionCount
else:
    decrement availableConnectionCount
set connection state to "in use"

# If there is no background thread, the pool MUST ensure that
# there are at least minPoolSize total connections.
do asynchronously:
  while totalConnectionCount < minPoolSize:
    populate the pool with a connection

tConnectionCheckedOut = current instant (use a monotonic clock if possible)
emit ConnectionCheckedOutEvent(duration = tConnectionCheckedOut - tConnectionCheckOutStarted) and equivalent log message
return connection

Checking In a Connection

A Pool MUST have a method of allowing the driver to check in a Connection. The driver MUST NOT be allowed to check in a Connection to a Pool that did not create that Connection, and MUST throw an Error if this is attempted.

When the Connection is checked in, it MUST be closed if any of the following are true:

  • The Connection is perished.
  • The pool has been closed.

Otherwise, the Connection is marked as available.

emit ConnectionCheckedInEvent and equivalent log message
if connection is perished OR pool is closed:
  close connection
else:
  mark connection as available

Clearing a Connection Pool

Clearing the pool involves different steps depending on whether the pool is in load balanced mode or not. The traditional / non-load balanced clearing behavior MUST NOT be used by pools in load balanced mode, and the load balanced pool clearing behavior MUST NOT be used in non-load balanced pools.

Clearing a non-load balanced pool

A Pool MUST have a method of clearing all Connections when instructed. Rather than iterating through every Connection, this method should simply increment the generation of the Pool, implicitly marking all current Connections as stale. It should also transition the pool's state to "paused" to halt the creation of new connections until it is marked as "ready" again. The checkOut and checkIn algorithms will handle clearing out stale Connections. If a user is subscribed to Connection Monitoring events and/or connection log messages, a PoolClearedEvent and log message MUST be emitted after incrementing the generation / marking the pool as "paused". If the pool is already "paused" when it is cleared, then the pool MUST NOT emit a PoolCleared event or log message.

As part of clearing the pool, the WaitQueue MUST also be cleared, meaning all requests in the WaitQueue MUST fail with errors indicating that the pool was cleared while the checkOut was being performed. The error returned as a result of the pool being cleared MUST be considered a retryable error and MUST NOT be an error that marks the SDAM state unknown. Clearing the WaitQueue MUST happen eagerly so that any operations waiting on Connections can retry as soon as possible. The pool MUST NOT rely on WaitQueueTimeoutMS to clear requests from the WaitQueue.

The clearing method MUST provide the option to interrupt any in-use connections as part of the clearing (henceforth referred to as the interruptInUseConnections flag in this specification). "Interrupting a Connection" is defined as canceling whatever task the Connection is currently performing and marking the Connection as perished (e.g. by closing its underlying socket). The interrupting of these Connections MUST be performed as soon as possible but MUST NOT block the pool or prevent it from processing further requests. If the pool has a background thread, and it is responsible for interrupting in-use connections, its next run MUST be scheduled as soon as possible.

The pool MUST only interrupt in-use Connections whose generation is less than or equal to the generation of the pool at the moment of the clear (before the increment) that used the interruptInUseConnections flag. Any operations that have their Connections interrupted in this way MUST fail with a retryable error. If possible, the error SHOULD be a PoolClearedError with the following message: "Connection to <pool address> interrupted due to server monitor timeout".

Clearing a load balanced pool

A Pool MUST also have a method of clearing all Connections for a specific serviceId for use when in load balancer mode. This method increments the generation of the pool for that specific serviceId in the generation map. A PoolClearedEvent and log message MUST be emitted after incrementing the generation. Note that this method MUST NOT transition the pool to the "paused" state and MUST NOT clear the WaitQueue.

Load Balancer Mode

For load-balanced deployments, pools MUST maintain a map from serviceId to a tuple of (generation, connection count) where the connection count refers to the total number of connections that exist for a specific serviceId. The pool MUST remove the entry for a serviceId once the connection count reaches 0. Once the MongoDB handshake is done, the connection MUST get the generation number that applies to its serviceId from the map and update the map to increment the connection count for this serviceId.

See the Load Balancer Specification for details.

Forking

A Connection is explicitly not fork-safe. The proper behavior in the case of a fork is to ResetAfterFork by:

  • clear all Connection Pools in the child process
  • closing all Connections in the child-process.

Drivers that support forking MUST document that Connections to an Endpoint are not fork-safe, and document the proper way to ResetAfterFork in the driver.

Drivers MAY aggressively ResetAfterFork if the driver detects it has been forked.

Optional Behaviors

The following features of a Connection Pool SHOULD be implemented if they make sense in the driver and driver's language.

Background Thread

A Pool SHOULD have a background Thread that is responsible for monitoring the state of all available Connections. This background thread SHOULD

A pool SHOULD allow immediate scheduling of the next background thread iteration after a clear is performed.

Conceptually, the aforementioned activities are organized into sequential Background Thread Runs. A Run MUST do as much work as readily available and then end instead of waiting for more work. For example, instead of waiting for pendingConnectionCount to become less than maxConnecting when satisfying minPoolSize, a Run MUST either proceed with the rest of its duties, e.g., closing available perished connections, or end.

The duration of intervals between the end of one Run and the beginning of the next Run is not specified, but the Test Format and Runner Specification may restrict this duration, or introduce other restrictions to facilitate testing.

withConnection

A Pool SHOULD implement a scoped resource management mechanism idiomatic to their language to prevent Connections from not being checked in. Examples include Python's "with" statement and C#'s "using" statement. If implemented, drivers SHOULD use this method as the default method of checking out and checking in Connections.

Connection Pool Monitoring

All drivers that implement a connection pool MUST provide an API that allows users to subscribe to events emitted from the pool. If a user subscribes to Connection Monitoring events, these events MUST be emitted when specified in "Connection Pool Behaviors". Events SHOULD be created and subscribed to in a manner idiomatic to their language and driver.

Events

See the Load Balancer Specification for details on the serviceId field.

/**
 *  Emitted when a Connection Pool is created
 */
interface PoolCreatedEvent {
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;

  /**
   *  Any non-default pool options that were set on this Connection Pool.
   */
  options: {...}
}

/**
 *  Emitted when a Connection Pool is marked as ready.
 */
interface PoolReadyEvent {
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;
}

/**
 *  Emitted when a Connection Pool is cleared
 */
interface PoolClearedEvent {
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;

  /**
   * The service id for which the pool was cleared for in load balancing mode.
   * See load balancer specification for more information about this field.
   */
  serviceId: Optional<ObjectId>;

  /**
   * A flag whether the pool forced interrupting "in use" connections as part of the clear.
  */
  interruptInUseConnections: Optional<Boolean>;
}

/**
 *  Emitted when a Connection Pool is closed
 */
interface PoolClosedEvent {
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;
}

/**
 *  Emitted when a Connection Pool creates a Connection object.
 *  NOTE: This does not mean that the Connection is ready for use.
 */
interface ConnectionCreatedEvent { 
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;

  /**
   *  The ID of the Connection
   */
  connectionId: int64;
}

/**
 *  Emitted when a Connection has finished its setup, and is now ready to use
 */
interface ConnectionReadyEvent {
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;

  /**
   *  The ID of the Connection
   */
  connectionId: int64;

  /**
   * The time it took to establish the connection.
   * In accordance with the definition of establishment of a connection
   * specified by `ConnectionPoolOptions.maxConnecting`,
   * it is the time elapsed between emitting a `ConnectionCreatedEvent`
   * and emitting this event as part of the same checking out.
   *
   * Naturally, when establishing a connection is part of checking out,
   * this duration is not greater than
   * `ConnectionCheckedOutEvent`/`ConnectionCheckOutFailedEvent.duration`.
   *
   * A driver MAY choose the type idiomatic to the driver.
   * If the type chosen does not convey units, e.g., `int64`,
   * then the driver MAY include units in the name, e.g., `durationMS`.
   */
  duration: Duration;
}

/**
 *  Emitted when a Connection Pool closes a Connection
 */
interface ConnectionClosedEvent {
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;

  /**
   *  The ID of the Connection
   */
  connectionId: int64;

  /**
   * A reason explaining why this Connection was closed.
   * Can be implemented as a string or enum.
   * Current valid values are:
   *   - "stale":           The pool was cleared, making the Connection no longer valid
   *   - "idle":            The Connection became stale by being available for too long
   *   - "error":           The Connection experienced an error, making it no longer valid
   *   - "poolClosed":      The pool was closed, making the Connection no longer valid
   */
  reason: string|Enum;
}

/**
 *  Emitted when the driver starts attempting to check out a Connection
 */
interface ConnectionCheckOutStartedEvent {
  /**
   * The ServerAddress of the Endpoint the pool is attempting
   * to connect to.
   */
  address: string;
}

/**
 *  Emitted when the driver's attempt to check out a Connection fails
 */
interface ConnectionCheckOutFailedEvent {
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;

  /**
   *  A reason explaining why Connection check out failed.
   *  Can be implemented as a string or enum.
   *  Current valid values are:
   *   - "poolClosed":      The pool was previously closed, and cannot provide new Connections
   *   - "timeout":         The Connection check out attempt exceeded the specified timeout
   *   - "connectionError": The Connection check out attempt experienced an error while setting up a new Connection
   */
  reason: string|Enum;

  /**
   * See `ConnectionCheckedOutEvent.duration`.
   */
  duration: Duration;
}

/**
 *  Emitted when the driver successfully checks out a Connection
 */
interface ConnectionCheckedOutEvent {
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;

  /**
   *  The ID of the Connection
   */
  connectionId: int64;

  /**
   * The time it took to check out the connection.
   * More specifically, the time elapsed between
   * emitting a `ConnectionCheckOutStartedEvent`
   * and emitting this event as part of the same checking out.
   *
   * Naturally, if a new connection was not created (`ConnectionCreatedEvent`)
   * and established (`ConnectionReadyEvent`) as part of checking out,
   * this duration is usually
   * not greater than `ConnectionPoolOptions.waitQueueTimeoutMS`,
   * but MAY occasionally be greater than that,
   * because a driver does not provide hard real-time guarantees.
   *
   * A driver MAY choose the type idiomatic to the driver.
   * If the type chosen does not convey units, e.g., `int64`,
   * then the driver MAY include units in the name, e.g., `durationMS`.
   */
  duration: Duration;
}

/**
 *  Emitted when the driver checks in a Connection back to the Connection Pool
 */
interface ConnectionCheckedInEvent {
  /**
   * The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;

  /**
   *  The ID of the Connection
   */
  connectionId: int64;
}

Connection Pool Logging

Please refer to the logging specification for details on logging implementations in general, including log levels, log components, handling of null values in log messages, and structured versus unstructured logging.

Drivers MUST support logging of connection pool information via the following types of log messages. These messages MUST be logged at Debug level and use the connection log component. These messages MUST be emitted when specified in "Connection Pool Behaviors".

The log messages are intended to match the information contained in the events above. Drivers MAY implement connection logging support via an event subscriber if it is convenient to do so.

The types used in the structured message definitions below are demonstrative, and drivers MAY use similar types instead so long as the information is present (e.g. a double instead of an integer, or a string instead of an integer if the structured logging framework does not support numeric types).

Common Fields

All connection log messages MUST contain the following key-value pairs:

KeySuggested TypeValue
serverHostStringthe hostname, IP address, or Unix domain socket path for the endpoint the pool is for.
serverPortIntThe port for the endpoint the pool is for. Optional; not present for Unix domain sockets. When the user does not specify a port and the default (27017) is used, the driver SHOULD include it here.

Pool Created Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection pool created"
maxIdleTimeMSIntThe maxIdleTimeMS value for this pool. Optional; only required to include if the user specified a value.
minPoolSizeIntThe minPoolSize value for this pool. Optional; only required to include if the user specified a value.
maxPoolSizeIntThe maxPoolSize value for this pool. Optional; only required to include if the user specified a value.
maxConnectingIntThe maxConnecting value for this pool. Optional; only required to include if the driver supports this option and the user specified a value.
waitQueueTimeoutMSIntThe waitQueueTimeoutMS value for this pool. Optional; only required to include if the driver supports this option and the user specified a value.
waitQueueSizeIntThe waitQueueSize value for this pool. Optional; only required to include if the driver supports this option and the user specified a value.
waitQueueMultipleIntThe waitQueueMultiple value for this pool. Optional; only required to include if the driver supports this option and the user specified a value.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection pool created for {{serverHost}}:{{serverPort}} using options maxIdleTimeMS={{maxIdleTimeMS}}, minPoolSize={{minPoolSize}}, maxPoolSize={{maxPoolSize}}, maxConnecting={{maxConnecting}}, waitQueueTimeoutMS={{waitQueueTimeoutMS}}, waitQueueSize={{waitQueueSize}}, waitQueueMultiple={{waitQueueMultiple}}

Pool Ready Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection pool ready"

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection pool ready for {{serverHost}}:{{serverPort}}

Pool Cleared Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection pool cleared"
serviceIdStringThe hex string representation of the service ID which the pool was cleared for. Optional; only present in load balanced mode.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection pool for {{serverHost}}:{{serverPort}} cleared for serviceId {{serviceId}}

Pool Closed Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection pool closed"

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection pool closed for {{serverHost}}:{{serverPort}}

Connection Created Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection created"
driverConnectionIdInt64The driver-generated ID for the connection as defined in Connection.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection created: address={{serverHost}}:{{serverPort}}, driver-generated ID={{driverConnectionId}}

Connection Ready Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection ready"
driverConnectionIdInt64The driver-generated ID for the connection as defined in Connection.
durationMSInt64ConnectionReadyEvent.duration converted to milliseconds.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection ready: address={{serverHost}}:{{serverPort}}, driver-generated ID={{driverConnectionId}}, established in={{durationMS}} ms

Connection Closed Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection closed"
driverConnectionIdInt64The driver-generated ID for the connection as defined in a Connection.
reasonStringA string describing the reason the connection was closed. The following strings MUST be used for each possible reason as defined in Events above:
- Stale: "Connection became stale because the pool was cleared
- Idle: "Connection has been available but unused for longer than the configured max idle time"
- Error: "An error occurred while using the connection"
- Pool closed: "Connection pool was closed"
errorFlexibleIf reason is Error, the associated error.
The type and format of this value is flexible; see the logging specification for details on representing errors in log messages.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection closed: address={{serverHost}}:{{serverPort}}, driver-generated ID={{driverConnectionId}}. Reason: {{reason}}. Error: {{error}}

Connection Checkout Started Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection checkout started"

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Checkout started for connection to {{serverHost}}:{{serverPort}}

Connection Checkout Failed Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection checkout failed"
reasonStringA string describing the reason checkout. The following strings MUST be used for each possible reason as defined in Events above:
- Timeout: "Wait queue timeout elapsed without a connection becoming available"
- ConnectionError: "An error occurred while trying to establish a new connection"
- Pool closed: "Connection pool was closed"
errorFlexibleIf reason is ConnectionError, the associated error. The type and format of this value is flexible; see the logging specification for details on representing errors in log messages.
durationMSInt64ConnectionCheckOutFailedEvent.duration converted to milliseconds.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Checkout failed for connection to {{serverHost}}:{{serverPort}}. Reason: {{reason}}. Error: {{error}}. Duration: {{durationMS}} ms

Connection Checked Out

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection checked out"
driverConnectionIdInt64The driver-generated ID for the connection as defined in Connection.
durationMSInt64ConnectionCheckedOutEvent.duration converted to milliseconds.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection checked out: address={serverHost}}:{{serverPort}}, driver-generated ID={{driverConnectionId}}, duration={{durationMS}} ms

Connection Checked In

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection checked in"
driverConnectionIdInt64The driver-generated ID for the connection as defined in Connection.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection checked in: address={{serverHost}}:{{serverPort}}, driver-generated ID={{driverConnectionId}}

Connection Pool Errors

A connection pool throws errors in specific circumstances. These Errors MUST be emitted by the pool. Errors SHOULD be created and dispatched in a manner idiomatic to the Driver and Language.

/**
 *  Thrown when the driver attempts to check out a
 *  Connection from a closed Connection Pool
 */
interface PoolClosedError {
  message: 'Attempted to check out a Connection from closed connection pool';
  address: <pool address>;
}

/**
 *  Thrown when the driver attempts to check out a
 *  Connection from a paused Connection Pool
 */
interface PoolClearedError extends RetryableError {
  message: 'Connection pool for <pool address> was cleared because another operation failed with: <original error which cleared the pool>';
  address: <pool address>;
}

/**
 *  Thrown when a driver times out when attempting to check out
 *  a Connection from a Pool
 */
interface WaitQueueTimeoutError {
  message: 'Timed out while checking out a Connection from connection pool';
  address: <pool address>;
}

Test Plan

See tests/README

Design Rationale

Why do we set minPoolSize across all members of a replicaSet, when most traffic will be against a Primary?

Currently, we are attempting to codify our current pooling behavior with minimal changes, and minPoolSize is currently uniform across all members of a replicaSet. This has the benefit of offsetting connection swarming during a Primary Step-Down, which will be further addressed in our Advanced Pooling Behaviors.

Why do we have separate ConnectionCreated and ConnectionReady events, but only one ConnectionClosed event?

ConnectionCreated and ConnectionReady each involve different state changes in the pool.

  • ConnectionCreated adds a new "pending" Connection, meaning the totalConnectionCount and pendingConnectionCount increase by one
  • ConnectionReady establishes that the Connection is ready for use, meaning the availableConnectionCount increases by one

ConnectionClosed indicates that the Connection is no longer a member of the pool, decrementing totalConnectionCount and potentially availableConnectionCount. After this point, the Connection is no longer a part of the pool. Further hypothetical events would not indicate a change to the state of the pool, so they are not specified here.

Why are waitQueueSize and waitQueueMultiple deprecated?

These options were originally only implemented in three drivers (Java, C#, and Python), and provided little value. While these fields would allow for faster diagnosis of issues in the connection pool, they would not actually prevent an error from occurring.

Additionally, these options have the effect of prioritizing older requests over newer requests, which is not necessarily the behavior that users want. They can also result in cases where queue access oscillates back and forth between full and not full. If a driver has a full waitQueue, then all requests for Connections will be rejected. If the client is continually spammed with requests, you could wind up with a scenario where as soon as the waitQueue is no longer full, it is immediately filled. It is not a favorable situation to be in, partially b/c it violates the fairness guarantee that the waitQueue normally provides.

Because of these issues, it does not make sense to go against driver mantras and provide an additional knob. We may eventually pursue an alternative configuration to address wait queue size in Advanced Pooling Behaviors.

Users that wish to have this functionality can achieve similar results by utilizing other methods to limit concurrency. Examples include implementing either a thread pool or an operation queue with a capped size in the user application. Drivers that need to deprecate waitQueueSize and/or waitQueueMultiple SHOULD refer users to these examples.

Why is waitQueueTimeoutMS optional for some drivers?

We are anticipating eventually introducing a single client-side timeout mechanism, making us hesitant to introduce another granular timeout control. Therefore, if a driver/language already has an idiomatic way to implement their timeouts, they should leverage that mechanism over implementing waitQueueTimeoutMS.

Why must populating the pool require the use of a background thread or async I/O?

Without the use of a background thread, the pool is populated with enough connections to satisfy minPoolSize during checkOut. Connections are established as part of populating the pool though, so if Connection establishment were done in a blocking fashion, the first operations after a clearing of the pool would experience unacceptably high latency, especially for larger values of minPoolSize. Thus, populating the pool must occur on a background thread (which is acceptable to block) or via the usage of non-blocking (async) I/O.

Why should closing a connection be non-blocking?

Because idle and perished Connections are cleaned up as part of checkOut, performing blocking I/O while closing such Connections would block application threads, introducing unnecessary latency. Once a Connection is marked as "closed", it will not be checked out again, so ensuring the socket is torn down does not need to happen immediately and can happen at a later time, either via async I/O or a background thread.

Why can the pool be paused?

The distinction between the "paused" state and the "ready" state allows the pool to determine whether or not the endpoint it is associated with is available or not. This enables the following behaviors:

  1. The pool can halt the creation of background connection establishments until the endpoint becomes available again. Without the "paused" state, the pool would have no way of determining when to begin establishing background connections again, so it would just continually attempt, and often fail, to create connections until minPoolSize was satisfied, even after repeated failures. This could unnecessarily waste resources both server and driver side.
  2. The pool can evict requests that enter the WaitQueue after the pool was cleared but before the server was in a known state again. Such requests can occur when a server is selected at the same time as it becomes marked as Unknown in highly concurrent workloads. Without the "paused" state, the pool would attempt to service these requests, since it would assume they were routed to the pool because its endpoint was available, not because of a race between SDAM and Server Selection. These requests would then likely fail with potentially high latency, again wasting resources both server and driver side.

Why not emit PoolCleared events and log messages when clearing a paused pool?

If a pool is already paused when it is cleared, that means it was previously cleared and no new connections have been created since then. Thus, clearing the pool in this case is essentially a no-op, so there is no need to notify any listeners that it has occurred. The generation is still incremented, however, to ensure future errors that caused the duplicate clear will stop attempting to clear the pool again. This situation is possible if the pool is cleared by the background thread after it encounters an error establishing a connection, but the ServerDescription for the endpoint was not updated accordingly yet.

Why does the pool need to support interrupting in use connections as part of its clear logic?

If a SDAM monitor has observed a network timeout, we assume that all connections including "in use" connections are no longer healthy. In some cases connections will fail to detect the network timeout fast enough. For example, a server request can hang at the OS level in TCP retry loop up for 17 minutes before failing. Therefore these connections MUST be proactively interrupted in the case of a server monitor network timeout. Requesting an immediate background thread run will speed up this process.

Why don't we configure TCP_USER_TIMEOUT?

Ideally, a reasonable TCP_USER_TIMEOUT can help with detecting stale connections as an alternative to interruptInUseConnections in Clear. Unfortunately this approach is platform dependent and not each driver allows easily configuring it. For example, C# driver can configure this socket option on linux only with target frameworks higher or equal to .net 5.0. On macOS, there is no straight equivalent for this option, it's possible that we can find some equivalent configuration, but this configuration will also require target frameworks higher than or equal to .net 5.0. The advantage of using Background Thread to manage perished connections is that it will work regardless of environment setup.

Backwards Compatibility

As mentioned in Deprecated Options, some drivers currently implement the options waitQueueSize and/or waitQueueMultiple. These options will need to be deprecated and phased out of the drivers that have implemented them.

Reference Implementations

  • JAVA (JAVA-3079)
  • RUBY (RUBY-1560)

Future Development

SDAM

This specification does not dictate how SDAM Monitoring connections are managed. SDAM specifies that "A monitor SHOULD NOT use the client's regular Connection pool". Some possible solutions for this include:

  • Having each Endpoint representation in the driver create and manage a separate dedicated Connection for monitoring purposes
  • Having each Endpoint representation in the driver maintain a separate pool of maxPoolSize 1 for monitoring purposes.
  • Having each Pool maintain a dedicated Connection for monitoring purposes, with an API to expose that Connection.

Advanced Pooling Behaviors

This spec does not address all advanced pooling behaviors like predictive pooling or aggressive Connection creation. Future work may address this.

Add support for OP_MSG exhaustAllowed

Exhaust Cursors may require changes to how we close Connections in the future, specifically to add a way to close and remove from its pool a Connection which has unread exhaust messages.

Changelog

  • 2024-01-23: Migrated from reStructuredText to Markdown.

  • 2019-06-06: Add "connectionError" as a valid reason for ConnectionCheckOutFailedEvent

  • 2020-09-03: Clarify Connection states and definition. Require the use of a
    background thread and/or async I/O. Add tests to ensure ConnectionReadyEvents are fired after ConnectionCreatedEvents.

  • 2020-09-24: Introduce maxConnecting requirement

  • 2020-12-17: Introduce "paused" and "ready" states. Clear WaitQueue on pool clear.

  • 2021-01-12: Clarify "clear" method behavior in load balancer mode.

  • 2021-01-19: Require that timeouts be applied per the client-side operations
    timeout specification.

  • 2021-04-12: Adding in behaviour for load balancer mode.

  • 2021-06-02: Formalize the behavior of a Background Thread.

  • 2021-11-08: Make maxConnecting configurable.

  • 2022-04-05: Preemptively cancel in progress operations when SDAM heartbeats timeout.

  • 2022-10-05: Remove spec front matter and reformat changelog.

  • 2022-10-14: Add connection pool log messages and associated tests.

  • 2023-04-17: Fix duplicate logging test description.

  • 2023-08-04: Add durations to connection pool events.

  • 2023-10-04: Commit to the currently specified requirements regarding durations in events.


Load Balancer Support

  • Status: Accepted
  • Minimum Server Version: 5.0

Abstract

This specification defines driver behaviour when connected to MongoDB services through a load balancer.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Specification

Terms

SDAM

An abbreviated form of "Server Discovery and Monitoring", specification defined in Server Discovery and Monitoring Specification.

Service

Any MongoDB service that can run behind a load balancer.

MongoClient Configuration

loadBalanced

To specify to the driver to operate in load balancing mode, a connection string option of loadBalanced=true MUST be added to the connection string. This boolean option specifies whether or not the driver is connecting to a MongoDB cluster through a load balancer. The default value MUST be false. This option MUST only be configurable at the level of a MongoClient.

URI Validation

When loadBalanced=true is provided in the connection string, the driver MUST throw an exception in the following cases:

  • The connection string contains more than one host/port.
  • The connection string contains a replicaSet option.
  • The connection string contains a directConnection option with a value of true.
  • The connection string contains an srvMaxHosts option with a positive integer value.

If a URI is provided with the mongodb+srv scheme, the driver MUST first do the SRV and TXT lookup and then perform the validation. For drivers that do SRV lookup asynchronously this may result in a MongoClient being instantiated but erroring later during operation execution.

DNS Seedlist Discovery

The connection string option for loadBalanced=true MUST be valid in a TXT record and when present MUST be validated as defined in the URI Validation section.

When a MongoClient is configured with an SRV URI and loadBalanced=true, the driver MUST NOT poll for changes in the SRV record as is done for non-load balanced sharded clusters.

Server Discovery Logging and Monitoring

Monitoring

When loadBalanced=true is specified in the URI the topology MUST start in type LoadBalanced and MUST remain as LoadBalanced indefinitely. The topology MUST contain 1 ServerDescription with a ServerType of - Code:LoadBalancer. The "address" field of the ServerDescription MUST be set to the address field of the load balancer. All other fields in the - Code:ServerDescription MUST remain unset. In this mode the driver MUST NOT start a monitoring connection. The TopologyDescription's compatible field MUST always be true.

Although there is no monitoring connection in load balanced mode, drivers MUST emit the following series of SDAM events:

  • TopologyOpeningEvent when the topology is created.
  • TopologyDescriptionChangedEvent. The previousDescription field MUST have TopologyType Unknown and no servers. The newDescription MUST have TopologyType LoadBalanced and one server with ServerType Unknown.
  • ServerOpeningEvent when the server representing the load balancer is created.
  • ServerDescriptionChangedEvent. The previousDescription MUST have ServerType Unknown. The newDescription MUST have ServerType LoadBalancer.
  • TopologyDescriptionChangedEvent. The newDescription MUST have TopologyType LoadBalanced and one server with ServerType LoadBalancer.

Drivers MUST also emit a ServerClosedEvent followed by a TopologyDescriptionChangedEvent that transitions the Topology to the UNKNOWN state and a TopologyClosedEvent when the topology is closed and MUST NOT emit any other events when operating in this mode.

Log Messages

SDAM events details described in Monitoring apply to corresponding log messages. Please refer to the SDAM logging specification for details on SDAM logging. Drivers MUST emit the relevant SDAM log messages, such as:

Driver Sessions

Session Support

When the TopologyType is LoadBalanced, sessions are always supported.

Session Expiration

When in load balancer mode, drivers MUST ignore logicalSessionTimeoutMinutes and MUST NOT prune client sessions from the session pool when implemented by the driver.

Data-Bearing Server Type

A ServerType of LoadBalancer MUST be considered a data-bearing server.

Server Selection

A deployment of topology type Load Balanced contains one server of type LoadBalancer.

For read and write operations, the single server in the topology MUST always be selected.

During command construction, the LoadBalancer server MUST be treated like a mongos and drivers MUST add a $readPreference field to the command when required by Passing read preference to mongos and load balancers.

Connection Pooling

Connection Establishment

In the case of the driver having the loadBalanced=true connection string option specified, every pooled connection MUST add a loadBalanced field to the - Code:hello command in its handshake. The value of the field MUST be true. If loadBalanced=true is specified then the OP_MSG protocol MUST be used for all steps of the connection handshake.

Example:

Driver connection string contains loadBalanced=true:

{ hello: 1, loadBalanced: true }

Driver connection string contains loadBalanced=false or no - Code:loadBalanced option:

{ hello: 1 }

When the server's hello response does not contain a serviceId field, the driver MUST throw an exception with the message "Driver attempted to initialize in load balancing mode, but the server does not support this mode."

For single threaded drivers that do not use a connection pool, the driver MUST have only 1 socket connection to the load balancer in load balancing mode.

Connection Pinning

Some features in MongoDB such as cursors and transactions require sending multiple commands to the same mongos in a sharded cluster. In load balanced mode, it is not possible to target the same mongos behind a load balancer when pooling connections. To account for this, drivers MUST pin to a single connection for these features. When using a pinned connection, the driver MUST emit only 1 - Code:ConnectionCheckOutStartedEvent, and only 1 ConnectionCheckedOutEvent or ConnectionCheckOutFailedEvent. Similarly, the driver MUST only publish 1 - Code:ConnectionCheckedInEvent.

Behaviour With Cursors

When the driver is in load balancing mode and executing any cursor-initiating command, the driver MUST NOT check the connection back into the pool unless the command fails or the server returns a cursor ID of 0 (i.e. all documents are returned in a single batch). Otherwise, the driver MUST continue to use the same connection for all subsequent - Code:getMore commands for the cursor. The driver MUST check the connection back into the pool if the server returns a cursor ID of 0 in a getMore response (i.e. the cursor is drained). When the cursor's close method is invoked, either explicitly or via an implicit resource cleanup mechanism, the driver MUST use the same connection to execute a killCursors command if necessary and then check the connection back into the pool regardless of the result.

For multi-threaded drivers, cursors with pinned connections MUST either document to the user that calling next() and close() operations on the cursor concurrently is not permitted, or explicitly prevent cursors from executing those operations simultaneously.

If a getMore fails with a network error, drivers MUST leave the connection pinned to the cursor. When the cursor's close method is invoked, drivers MUST NOT execute a killCursors command because the pinned connection is no longer valid and MUST return the connection back to the pool.

Behaviour With Transactions

When executing a transaction in load balancing mode, drivers MUST follow the rules outlined in Sharded Transactions with one exception: drivers MUST use the same connection for all commands in the transaction (excluding retries of commitTranscation and abortTransaction in some cases). Pinning to a single connection ensures that all commands in the transaction target the same service behind the load balancer. The rules for pinning to a connection and releasing a pinned connection are the same as those for server pinning in non-load balanced sharded transactions as described in When to unpin. Drivers MUST NOT use the same connection for two concurrent transactions run under different sessions from the same client.

Connection Tracking

The driver connection pool MUST track the purpose for which connections are checked out in the following 3 categories:

  • Connections checked out for cursors
  • Connections checked out for transactions
  • Connections checked out for operations not falling under the previous 2 categories

When the connection pool's maxPoolSize is reached and the pool times out waiting for a new connection the WaitQueueTimeoutError MUST include a new detailed message, "Timeout waiting for connection from the connection pool. maxPoolSize: n, connections in use by cursors: n, connections in use by transactions: n, connections in use by other operations: n".

Error Handling

Initial Handshake Errors

When establishing a new connection in load balanced mode, drivers MUST NOT perform SDAM error handling for any errors that occur before the MongoDB Handshake (i.e. hello command) is complete. Errors during the MongoDB Handshake MUST also be ignored for SDAM error handling purposes. Once the initial handshake is complete, the connection MUST determine its generation number based on the serviceId field in the handshake response. Any errors that occur during the rest of connection establishment (e.g. errors during authentication commands) MUST go through the SDAM error handling flow but MUST NOT mark the server as - Code:Unknown and when requiring the connection pool to be cleared, MUST only clear connections for the serviceId.

Post-Handshake Errors

When the driver is operating in load balanced mode and an application operation receives a state change error, the driver MUST NOT make any changes to the TopologyDescription or the ServerDescription of the load balancer (i.e. it MUST NOT mark the load balancer as Unknown). If the error requires the connection pool to be cleared, the driver MUST only clear connections with the same serviceId as the connection which errored.

Events

When in load balancer mode the driver MUST now include the serviceId in the - Code:CommandStartedEvent, CommandSucceededEvent, and - Code:CommandFailedEvent. The driver MAY decide how to expose this information. Drivers that have a ConnectionId object for example, MAY choose to provide a - Code:serviceId in that object. The serviceId field is only present when in load balancer mode and connected to a service that is behind a load balancer.

Additionally the PoolClearedEvent MUST also contain a serviceId field.

Downstream Visible Behavioral Changes

Services MAY add a command line option or other configuration parameter, that tells the service it is running behind a load balancer. Services MAY also dynamically determine whether they are behind a load balancer.

All services which terminate TLS MUST be configured to return a TLS certificate for a hostname which matches the hostname the client is connecting to.

All services behind a load balancer that have been started with the aforementioned option MUST add a top level serviceId field to their response to the hello command. This field MUST be a BSON ObjectId and SHOULD NOT change while the service is running. When a driver is configured to not be in load balanced mode and the service is configured behind a load balancer, the service MAY return an error from the driver's hello command that the driver is not configured to use it properly.

All services that have the behaviour of reaping idle cursors after a specified period of time MAY also close the connection associated with the cursor when the cursor is reaped. Conversely, those services MAY reap a cursor when the connection associated with the cursor is closed.

All services that have the behaviour of reaping idle transactions after a specified period of time MAY also close the connection associated with the transaction when the transaction is reaped. Conversely, those services must abort a transaction when the connection associated with the transaction is closed.

Any applications that connect directly to services and not through the load balancer MUST connect via the regular service port as they normally would and not the port specified by the loadBalancerPort option. The loadBalanced=true URI option MUST be omitted in this case.

Q&A

Why use a connection string option instead of a new URI scheme?

Use of a connection string option would allow the driver to continue to use SRV records that pointed at a load balancer instead of a replica set without needing to change the URI provided to the MongoClient. The SRV records could also provide the default loadBalanced=true in the TXT records.

Why explicitly opt-in to this behaviour instead of letting mongos inform the driver of the load balancer?

Other versions of this design proposed a scheme in which the application does not have to opt-in to load balanced mode. Instead, the server would send a special field in hello command responses to indicate that it was running behind a load balancer and the driver would change its behavior accordingly. We opted to take an approach that required code changes instead because load balancing changes driver behavior in ways that could cause unexpected application errors, so it made sense to have applications consciously opt-in to this mode. For example, connection pinning creates new stresses on connection pools because we go from a total of numMongosServers * maxPoolSize connections to simply maxPoolSize. Furthermore, connections get pinned to open cursors and transactions, further straining resource availability. Due to this change, applications may also need to increase the configured maxPoolSize when opting into this mode.

Why does this specification instruct drivers to not check connections back into the connection pool in some circumstances?

In the case of a load balancer fronting multiple services, it is possible that a connection to the load balancer could result in a connection behind the load balancer to a different service. In order to guarantee these operations execute on the same service they need to be executed on the same socket - not checking a connection back into the pool for the entire operation guarantees this.

What reason has a client side connection reaper for idle cursors not been put into this specification?

It was discussed as a potential solution for maxed out connection pools that the drivers could potentially behave similar to the server and close long running cursors after a specified time period and return their connections to the pool. Due to the high complexity of that solution it was determined that better error messaging when the connection pool was maxed out would suffice in order for users to easily debug when the pool ran out of connections and fix their applications or adjust their pool options accordingly.

Why are we requiring mongos servers to add a new serviceId field in hello responses rather than reusing the existing topologyVersion.processId?

This option was previously discussed, but we opted to add a new hello response field in order to not mix intentions.

Why does this specification not address load balancer restarts or maintenance?

The Layer 4 load balancers that would be in use for this feature lack the ability that a layer 7 load balancer could potentially have to be able to understand the MongoDB wire protocol and respond to monitoring requests.

Design Rationales

Services cannot dynamically switch from running behind a load balancer and not running behind a load balancer. Based on that, this design forces the application to opt-in to this behaviour and make potential changes that require restarts to their applications. If this were to change, see alternative designs below.

Alternative Designs

Service PROXY Detection

An alternative to the driver using a connection string option to put it into load balancing mode would be for the service the driver is connected to to inform the driver it is behind a load balancer. A possible solution for this would be for all services to understand the PROXY protocol such as Data Lake does, and to alter their hello responses to inform the driver they are behind a load balancer, potentially with the IP address of the load balancer itself.

The benefit of this solution would be that no changes would be required from the application side, and could also not require a restart of any application. A single request to the service through the load balancer could automatically trigger the change in the hello response and cause the driver to switch into load balancing mode pointing at the load balancer's IP address. Also with this solution it would provide services the ability to record the original IP addresses of the application that was connecting to it as they are provided the PROXY protocol's header bytes.

The additional complexity of this alternative on the driver side is that instead of starting in a single mode and remaining there for the life of the application, the driver would need to deal with additional state changes based on the results of the server monitors. From a service perspective, every service would need to be updated to understand the PROXY protocol header bytes prepended to the initial connection and modify their states and hello responses accordingly. Additionally load balancers would need to have additional configuration as noted in the reference section below, and only load balancers that support the PROXY protocol would be supported.

Changelog

  • 2024-04-25: Clarify that TopologyDescriptionChangedEvent must be emitted on topology close
  • 2024-03-06: Migrated from reStructuredText to Markdown.
  • 2022-10-05: Remove spec front matter and reformat changelog.
  • 2022-01-18: Clarify that OP_MSG must be used in load balanced mode.
  • 2021-12-22: Clarify that pinned connections in transactions are exclusive.
  • 2021-10-14: Note that loadBalanced=true conflicts with srvMaxHosts.

Server Monitoring

  • Status: Accepted
  • Minimum Server Version: 2.4

Abstract

This spec defines how a driver monitors a MongoDB server. In summary, the client monitors each server in the topology. The scope of server monitoring is to provide the topology with updated ServerDescriptions based on hello or legacy hello command responses.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Specification

Terms

See the terms in the main SDAM spec.

check

The client checks a server by attempting to call hello or legacy hello on it, and recording the outcome.

client

A process that initiates a connection to a MongoDB server. This includes mongod and mongos processes in a replica set or sharded cluster, as well as drivers, the shell, tools, etc.

scan

The process of checking all servers in the deployment.

suitable

A server is judged "suitable" for an operation if the client can use it for a particular operation. For example, a write requires a standalone, primary, or mongos. Suitability is fully specified in the Server Selection Spec.

significant topology change

A change in the server's state that is relevant to the client's view of the server, e.g. a change in the server's replica set member state, or its replica set tags. In SDAM terms, a significant topology change on the server means the client's ServerDescription is out of date. Standalones and mongos do not currently experience significant topology changes but they may in the future.

regular hello or legacy hello command

A default {hello: 1} or legacy hello command where the server responds immediately.

streamable hello or legacy hello command

The hello or legacy hello command feature which allows the server to stream multiple replies back to the client.

RTT

Round trip time. The client's measurement of the duration of one hello or legacy hello call. The RTT is used to support localThresholdMS from the Server Selection spec and timeoutMS from the Client Side Operations Timeout Spec.

FaaS

A Function-as-a-Service (FaaS) environment like AWS Lambda.

serverMonitoringMode

The serverMonitoringMode option configures which server monitoring protocol to use. Valid modes are "stream", "poll", or "auto". The default value MUST be "auto":

  • With "stream" mode, the client MUST use the streaming protocol when the server supports it or fall back to the polling protocol otherwise.
  • With "poll" mode, the client MUST use the polling protocol.
  • With "auto" mode, the client MUST behave the same as "poll" mode when running on a FaaS platform or the same as "stream" mode otherwise. The client detects that it's running on a FaaS platform via the same rules for generating the client.env handshake metadata field in the MongoDB Handshake spec.

Multi-threaded or asynchronous drivers MUST implement this option. See Why disable the streaming protocol on FaaS platforms like AWS Lambda? and Why introduce a knob for serverMonitoringMode?

Monitoring

The client monitors servers using the hello or legacy hello commands. In MongoDB 4.4+, a monitor uses the Streaming Protocol to continuously stream hello or legacy hello responses from the server. In MongoDB <= 4.2, a monitor uses the Polling Protocol pausing heartbeatFrequencyMS between checks. Clients check servers sooner in response to certain events.

If a server API version is requested, then the driver must use hello for monitoring. If a server API version is not requested, the initial handshake using the legacy hello command must include helloOk: true. If the response contains helloOk: true, then the driver must use the hello command for monitoring. If the response does not contain helloOk: true, then the driver must use the legacy hello command for monitoring.

The socket used to check a server MUST use the same connectTimeoutMS as regular sockets. Multi-threaded clients SHOULD set monitoring sockets' socketTimeoutMS to the connectTimeoutMS. (See socket timeout for monitoring is connectTimeoutMS. Drivers MAY let users configure the timeouts for monitoring sockets separately if necessary to preserve backwards compatibility.)

The client begins monitoring a server when:

The following subsections specify how monitoring works, first in multi-threaded or asynchronous clients, and second in single-threaded clients. This spec provides detailed requirements for monitoring because it intends to make all drivers behave consistently.

Multi-threaded or asynchronous monitoring

Servers are monitored in parallel

All servers' monitors run independently, in parallel: If some monitors block calling hello or legacy hello over slow connections, other monitors MUST proceed unimpeded.

The natural implementation is a thread per server, but the decision is left to the implementer. (See thread per server.)

Servers are monitored with dedicated sockets

A monitor SHOULD NOT use the client's regular connection pool to acquire a socket; it uses a dedicated socket that does not count toward the pool's maximum size.

Drivers MUST NOT authenticate on sockets used for monitoring nor include SCRAM mechanism negotiation (i.e. saslSupportedMechs), as doing so would make monitoring checks more expensive for the server.

Servers are checked periodically

Each monitor checks its server and notifies the client of the outcome so the client can update the TopologyDescription.

After each check, the next check SHOULD be scheduled heartbeatFrequencyMS later; a check MUST NOT run while a previous check is still in progress.

Requesting an immediate check

At any time, the client can request that a monitor check its server immediately. (For example, after a "not writable primary" error. See error handling.) If the monitor is sleeping when this request arrives, it MUST wake and check as soon as possible. If a hello or legacy hello call is already in progress, the request MUST be ignored. If the previous check ended less than minHeartbeatFrequencyMS ago, the monitor MUST sleep until the minimum delay has passed, then check the server.

Application operations are unblocked when a server is found

Each time a check completes, threads waiting for a suitable server are unblocked. Each unblocked thread MUST proceed if the new TopologyDescription now contains a suitable server.

Clients update the topology from each handshake

When a monitor check creates a new connection, the connection handshake response MUST be used to satisfy the check and update the topology.

When a client successfully calls hello or legacy hello to handshake a new connection for application operations, it SHOULD use the hello or legacy hello reply to update the ServerDescription and TopologyDescription, the same as with a hello or legacy hello reply on a monitoring socket. If the hello or legacy hello call fails, the client SHOULD mark the server Unknown and update its TopologyDescription, the same as a failed server check on monitoring socket.

Clients use the streaming protocol when supported

When a monitor discovers that the server supports the streamable hello or legacy hello command and the client does not have streaming disabled, it MUST use the streaming protocol.

Single-threaded monitoring

cooldownMS

After a single-threaded client gets a network error trying to check a server, the client skips re-checking the server until cooldownMS has passed.

This avoids spending connectTimeoutMS on each unavailable server during each scan.

This value MUST be 5000 ms, and it MUST NOT be configurable.

Scanning

Single-threaded clients MUST scan all servers synchronously, inline with regular application operations. Before each operation, the client checks if heartbeatFrequencyMS has passed since the previous scan ended, or if the topology is marked "stale"; if so it scans all the servers before selecting a server and performing the operation.

Selection failure triggers an immediate scan. When a client that uses single-threaded monitoring fails to select a suitable server for any operation, it scans the servers, then attempts selection again, to see if the scan discovered suitable servers. It repeats, waiting minHeartbeatFrequencyMS after each scan, until a timeout.

Scanning order

If the topology is a replica set, the client attempts to contact the primary as soon as possible to get an authoritative list of members. Otherwise, the client attempts to check all members it knows of, in order from the least-recently to the most-recently checked.

When all servers have been checked the scan is complete. New servers discovered during the scan MUST be checked before the scan is complete. Sometimes servers are removed during a scan so they are not checked, depending on the order of events.

The scanning order is expressed in this pseudocode:

scanStartTime = now()
# You'll likely need to convert units here.
beforeCoolDown = scanStartTime - cooldownMS

while true:
    serversToCheck = all servers with lastUpdateTime before scanStartTime

    remove from serversToCheck any Unknowns with lastUpdateTime > beforeCoolDown

    if no serversToCheck:
        # This scan has completed.
        break

    if a server in serversToCheck is RSPrimary:
        check it
    else if there is a PossiblePrimary:
        check it
    else if any servers are not of type Unknown or RSGhost:
        check the one with the oldest lastUpdateTime
        if several servers have the same lastUpdateTime, choose one at random
    else:
        check the Unknown or RSGhost server with the oldest lastUpdateTime
        if several servers have the same lastUpdateTime, choose one at random

This algorithm might be better understood with an example:

  1. The client is configured with one seed and TopologyType Unknown. It begins a scan.
  2. When it checks the seed, it discovers a secondary.
  3. The secondary's hello or legacy hello response includes the "primary" field with the address of the server that the secondary thinks is primary.
  4. The client creates a ServerDescription with that address, type PossiblePrimary, and lastUpdateTime "infinity ago". (See updateRSWithoutPrimary.)
  5. On the next iteration, there is still no RSPrimary, so the new PossiblePrimary is the top-priority server to check.
  6. The PossiblePrimary is checked and replaced with an RSPrimary. The client has now acquired an authoritative host list. Any new hosts in the list are added to the TopologyDescription with lastUpdateTime "infinity ago". (See updateRSFromPrimary.)
  7. The client continues scanning until all known hosts have been checked.

Another common case might be scanning a pool of mongoses. When the client first scans its seed list, they all have the default lastUpdateTime "infinity ago", so it scans them in random order. This randomness provides some load-balancing if many clients start at once. A client's subsequent scans of the mongoses are always in the same order, since their lastUpdateTimes are always in the same order by the time a scan ends.

minHeartbeatFrequencyMS

If a client frequently rechecks a server, it MUST wait at least minHeartbeatFrequencyMS milliseconds since the previous check ended, to avoid pointless effort. This value MUST be 500 ms, and it MUST NOT be configurable (no knobs).

heartbeatFrequencyMS

The interval between server checks, counted from the end of the previous check until the beginning of the next one.

For multi-threaded and asynchronous drivers it MUST default to 10 seconds and MUST be configurable. For single-threaded drivers it MUST default to 60 seconds and MUST be configurable. It MUST be called heartbeatFrequencyMS unless this breaks backwards compatibility.

For both multi- and single-threaded drivers, the driver MUST NOT permit users to configure it less than minHeartbeatFrequencyMS (500ms).

(See heartbeatFrequencyMS in the main SDAM spec.)

Awaitable hello or legacy hello Server Specification

As of MongoDB 4.4 the hello or legacy hello command can wait to reply until there is a topology change or a maximum time has elapsed. Clients opt in to this "awaitable hello" feature by passing new parameters "topologyVersion" and "maxAwaitTimeMS" to the hello or legacy hello commands. Exhaust support has also been added, which clients can enable in the usual manner by setting the OP_MSG exhaustAllowed flag.

Clients use the awaitable hello feature as the basis of the streaming heartbeat protocol to learn much sooner about stepdowns, elections, reconfigs, and other events.

topologyVersion

A server that supports awaitable hello or legacy hello includes a "topologyVersion" field in all hello or legacy hello replies and State Change Error replies. The topologyVersion is a subdocument with two fields, "processId" and "counter":

{
    topologyVersion: {processId: <ObjectId>, counter: <int64>},
    ( ... other fields ...)
}
processId

An ObjectId maintained in memory by the server. It is reinitialized by the server using the standard ObjectId logic each time this server process starts.

counter

An int64 State change counter, maintained in memory by the server. It begins at 0 when the server starts, and it is incremented whenever there is a significant topology change.

maxAwaitTimeMS

To enable awaitable hello or legacy hello, the client includes a new int64 field "maxAwaitTimeMS" in the hello or legacy hello request. This field determines the maximum duration in milliseconds a server will wait for a significant topology change before replying.

Feature Discovery

To discover if the connected server supports awaitable hello or legacy hello, a client checks the most recent hello or legacy hello command reply. If the reply includes "topologyVersion" then the server supports awaitable hello or legacy hello.

Awaitable hello or legacy hello Protocol

To initiate an awaitable hello or legacy hello command, the client includes both maxAwaitTimeMS and topologyVersion in the request, for example:

{
    hello: 1,
    maxAwaitTimeMS: 10000,
    topologyVersion: {processId: <ObjectId>, counter: <int64>},
    ( ... other fields ...)
}

Clients MAY additionally set the OP_MSG exhaustAllowed flag to enable streaming hello or legacy hello. With streaming hello or legacy hello, the server MAY send multiple hello or legacy hello responses without waiting for further requests.

A server that implements the new protocol follows these rules:

  • Always include the server's topologyVersion in hello, legacy hello, and State Change Error replies.
  • If the request includes topologyVersion without maxAwaitTimeMS or vice versa, return an error.
  • If the request omits topologyVersion and maxAwaitTimeMS, reply immediately.
  • If the request includes topologyVersion and maxAwaitTimeMS, then reply immediately if the server's topologyVersion.processId does not match the request's, otherwise reply when the server's topologyVersion.counter is greater than the request's, or maxAwaitTimeMS elapses, whichever comes first.
  • Following the OP_MSG spec, if the request omits the exhaustAllowed flag, the server MUST NOT set the moreToCome flag on the reply. If the request's exhaustAllowed flag is set, the server MAY set the moreToCome flag on the reply. If the server sets moreToCome, it MUST continue streaming replies without awaiting further requests. Between replies it MUST wait until the server's topologyVersion.counter is incremented or maxAwaitTimeMS elapses, whichever comes first. If the reply includes ok: 0 the server MUST NOT set the moreToCome flag.
  • On a topology change that changes the horizon parameters, the server will close all application connections.

Example awaitable hello conversation:

ClientServer
hello handshake ->
<- reply with topologyVersion
hello as OP_MSG with maxAwaitTimeMS and topologyVersion ->
wait for change or timeout
<- OP_MSG with topologyVersion
...

Example streaming hello conversation (awaitable hello with exhaust):

ClientServer
hello handshake ->
<- reply with topologyVersion
hello as OP_MSG with exhaustAllowed, maxAwaitTimeMS, and topologyVersion ->
wait for change or timeout
<- OP_MSG with moreToCome and topologyVersion
wait for change or timeout
<- OP_MSG with moreToCome and topologyVersion
...
<- OP_MSG without moreToCome
...

Streaming Protocol

The streaming protocol is used to monitor MongoDB 4.4+ servers and optimally reduces the time it takes for a client to discover server state changes. Multi-threaded or asynchronous drivers MUST use the streaming protocol when connected to a server that supports the awaitable hello or legacy hello commands. This protocol requires an extra thread and an extra socket for each monitor to perform RTT calculations.

Streaming disabled

The streaming protocol MUST be disabled when either:

  • the client is configured with serverMonitoringMode=poll, or
  • the client is configured with serverMonitoringMode=auto and a FaaS platform is detected, or
  • the server does not support streaming (eg MongoDB < 4.4).

When the streaming protocol is disabled the client MUST use the polling protocol and MUST NOT start an extra thread or connection for Measuring RTT.

See Why disable the streaming protocol on FaaS platforms like AWS Lambda?.

Streaming hello or legacy hello

The streaming hello or legacy hello protocol uses awaitable hello or legacy hello with the OP_MSG exhaustAllowed flag to continuously stream hello or legacy hello responses from the server. Drivers MUST set the OP_MSG exhaustAllowed flag with the awaitable hello or legacy hello command and MUST process each hello or legacy hello response. (I.e., they MUST process responses strictly in the order they were received.)

A client follows these rules when processing the hello or legacy hello exhaust response:

  • If the response indicates a command error, or a network error or timeout occurs, the client MUST close the connection and restart the monitoring protocol on a new connection. (See Network or command error during server check.)
  • If the response is successful (includes "ok:1") and includes the OP_MSG moreToCome flag, then the client begins reading the next response.
  • If the response is successful (includes "ok:1") and does not include the OP_MSG moreToCome flag, then the client initiates a new awaitable hello or legacy hello with the topologyVersion field from the previous response.

Socket timeout

Clients MUST use connectTimeoutMS as the timeout for the connection handshake. When connectTimeoutMS=0, the timeout is unlimited and MUST remain unlimited for awaitable hello and legacy hello replies. Otherwise, connectTimeoutMS is non-zero and clients MUST use connectTimeoutMS + heartbeatFrequencyMS as the timeout for awaitable hello and legacy hello replies.

Measuring RTT

When using the streaming protocol, clients MUST issue a hello or legacy hello command to each server to measure RTT every heartbeatFrequencyMS. The RTT command MUST be run on a dedicated connection to each server. Clients MUST NOT use dedicated connections to measure RTT when the streaming protocol is not used. (See Monitors MUST use a dedicated connection for RTT commands.)

Clients MUST update the RTT from the hello or legacy hello duration of the initial connection handshake. Clients MUST NOT update RTT based on streaming hello or legacy hello responses.

Clients MUST ignore the response to the hello or legacy hello command when measuring RTT. Errors encountered when running a hello or legacy hello command MUST NOT update the topology. (See Why don't clients mark a server unknown when an RTT command fails?)

Clients MUST track the minimum RTT out of the (at most) last 10 samples. Clients MUST report the minimum RTT as 0 until at least 2 samples have been gathered.

When constructing a ServerDescription from a streaming hello or legacy hello response, clients MUST set the average and minimum round trip times from the RTT task as the "roundTripTime" and "minRoundTripTime" fields, respectively.

See the pseudocode in the RTT thread section for an example implementation.

SDAM Monitoring

Clients MUST publish a ServerHeartbeatStartedEvent before attempting to read the next hello or legacy hello exhaust response. (See Why must streaming hello or legacy hello clients publish ServerHeartbeatStartedEvents?)

Clients MUST NOT publish any events when running an RTT command. (See Why don't streaming hello or legacy hello clients publish events for RTT commands?)

Heartbeat frequency

In the polling protocol, a client sleeps between each hello or legacy hello check (for at least minHeartbeatFrequencyMS and up to heartbeatFrequencyMS). In the streaming protocol, after processing an "ok:1" hello or legacy hello response, the client MUST NOT sleep and MUST begin the next check immediately.

Clients MUST set maxAwaitTimeMS to heartbeatFrequencyMS.

hello or legacy hello Cancellation

When a client is closed, clients MUST cancel all hello and legacy hello checks; a monitor blocked waiting for the next streaming hello or legacy hello response MUST be interrupted such that threads may exit promptly without waiting maxAwaitTimeMS.

When a client marks a server Unknown from Network error when reading or writing, clients MUST cancel the hello or legacy hello check on that server and close the current monitoring connection. (See Drivers cancel in-progress monitor checks.)

Polling Protocol

The polling protocol is used to monitor MongoDB < 4.4 servers or when [streaming is disabled](#streaming is disabled). The client checks a server with a hello or legacy hello command and then sleeps for heartbeatFrequencyMS before running another check.

Marking the connection pool as ready (CMAP only)

When a monitor completes a successful check against a server, it MUST mark the connection pool for that server as "ready", and doing so MUST be synchronized with the update to the topology (e.g. by marking the pool as ready in onServerDescriptionChanged). This is required to ensure a server does not get selected while its pool is still paused. See the Connection Pool definition in the CMAP specification for more details on marking the pool as "ready".

Error handling

Network or command error during server check

When a server check fails due to a network error (including a network timeout) or a command error (ok: 0), the client MUST follow these steps:

  1. Close the current monitoring connection.
  2. Mark the server Unknown.
  3. Clear the connection pool for the server (See Clear the connection pool on both network and command errors). For CMAP compliant drivers, clearing the pool MUST be synchronized with marking the server as Unknown (see Why synchronize clearing a server's pool with updating the topology?). If this was a network timeout error, then the pool MUST be cleared with interruptInUseConnections = true (see Why does the pool need to support closing in use connections as part of its clear logic?)
  4. If this was a network error and the server was in a known state before the error, the client MUST NOT sleep and MUST begin the next check immediately. (See retry hello or legacy hello calls once and JAVA-1159.)
  5. Otherwise, wait for heartbeatFrequencyMS (or minHeartbeatFrequencyMS if a check is requested) before restarting the monitoring protocol on a new connection.
    • Note that even in the streaming protocol, a monitor in this state will wait for an application operation to [request an immediate check](#request an immediate check) or for the heartbeatFrequencyMS timeout to expire before beginning the next check.

See the pseudocode in the Monitor thread section.

Note that this rule applies only to server checks during monitoring. It does not apply when multi-threaded clients update the topology from each handshake.

Implementation notes

This section intends to provide generous guidance to driver authors. It is complementary to the reference implementations. Words like "should", "may", and so on are used more casually here.

Monitor thread

Most platforms can use an event object to control the monitor thread. The event API here is assumed to be like the standard Python Event. heartbeatFrequencyMS is configurable, minHeartbeatFrequencyMS is always 500 milliseconds:

class Monitor(Thread):
  def __init__():
      # Monitor options:
      serverAddress = serverAddress
      connectTimeoutMS = connectTimeoutMS
      heartbeatFrequencyMS = heartbeatFrequencyMS
      minHeartbeatFrequencyMS = 500
      stableApi = stableApi
      if serverMonitoringMode == "stream":
          streamingEnabled = True
      elif serverMonitoringMode == "poll":
          streamingEnabled = False
      else:  # serverMonitoringMode == "auto"
          streamingEnabled = not isFaas()

      # Internal Monitor state:
      connection = Null
      # Server API versioning implies that the server supports hello.
      helloOk = stableApi != Null
      description = default ServerDescription
      lock = Mutex()
      rttMonitor = RttMonitor(serverAddress, stableApi)

  def run():
      while this monitor is not stopped:
          previousDescription = description
          try:
              description = checkServer(previousDescription)
          except CheckCancelledError:
              if this monitor is stopped:
                  # The client was closed.
                  return
              # The client marked this server Unknown and cancelled this
              # check during "Network error when reading or writing".
              # Wait before running the next check.
              wait()
              continue

          with client.lock:
              topology.onServerDescriptionChanged(description, connection pool for server)
              if description.error != Null:
                  # Clear the connection pool only after the server description is set to Unknown.
                  clear(interruptInUseConnections: isNetworkTimeout(description.error)) connection pool for server

          # Immediately proceed to the next check if the previous response
          # was successful and included the topologyVersion field, or the
          # previous response included the moreToCome flag, or the server
          # has just transitioned to Unknown from a network error.
          serverSupportsStreaming = description.type != Unknown and description.topologyVersion != Null
          connectionIsStreaming = connection != Null and connection.moreToCome
          transitionedWithNetworkError = isNetworkError(description.error) and previousDescription.type != Unknown
          if streamingEnabled and serverSupportsStreaming and not rttMonitor.started:
              # Start the RttMonitor.
              rttMonitor.run()
          if (streamingEnabled and (serverSupportsStreaming or connectionIsStreaming)) or transitionedWithNetworkError:
              continue

          wait()

  def setUpConnection():
      # Take the mutex to avoid a data race because this code writes to the connection field and a concurrent
      # cancelCheck call could be reading from it.
      with lock:
          # Server API versioning implies that the server supports hello.
          helloOk = stableApi != Null
          connection = new Connection(serverAddress)
          set connection timeout to connectTimeoutMS

      # Do any potentially blocking operations after releasing the mutex.
      create the socket and perform connection handshake

  def checkServer(previousDescription):
      try:
          # The connection is null if this is the first check. It's closed if there was an error during the previous
          # check or the previous check was cancelled.

          if helloOk:
              helloCommand = hello
          else
              helloCommand = legacy hello

          if not connection or connection.isClosed():
              setUpConnection()
              rttMonitor.addSample(connection.handshakeDuration)
              response = connection.handshakeResponse
          elif connection.moreToCome:
              response = read next helloCommand exhaust response
          elif streamingEnabled and previousDescription.topologyVersion:
              # Initiate streaming hello or legacy hello
              if connectTimeoutMS != 0:
                  set connection timeout to connectTimeoutMS+heartbeatFrequencyMS
              response = call {helloCommand: 1, helloOk: True, topologyVersion: previousDescription.topologyVersion, maxAwaitTimeMS: heartbeatFrequencyMS}
          else:
              # The server does not support topologyVersion or streamingEnabled=False.
              response = call {helloCommand: 1, helloOk: True}

          # If the server supports hello, then response.helloOk will be true
          # and hello will be used for subsequent monitoring commands.
          # If the server does not support hello, then response.helloOk will be undefined
          # and legacy hello will be used for subsequent monitoring commands.
          helloOk = response.helloOk

          return ServerDescription(response, rtt=rttMonitor.average(), ninetiethPercentileRtt=rttMonitor.ninetiethPercentile())
      except Exception as exc:
          close connection
          rttMonitor.reset()
          return ServerDescription(type=Unknown, error=exc)

  def wait():
      start = gettime()

      # Can be awakened by requestCheck().
      event.wait(heartbeatFrequencyMS)
      event.clear()

      waitTime = gettime() - start
      if waitTime < minHeartbeatFrequencyMS:
          # Cannot be awakened.
          sleep(minHeartbeatFrequencyMS - waitTime)

Requesting an immediate check:

def requestCheck():
    event.set()

hello or legacy hello Cancellation:

def cancelCheck():
    # Take the mutex to avoid reading the connection value while setUpConnection is writing to it.
    # Copy the connection value in the lock but do the actual cancellation outside.
    with lock:
        tempConnection = connection

    if tempConnection:
      interrupt connection read
      close tempConnection

RTT thread

The requirements in the Measuring RTT section can be satisfied with an additional thread that periodically runs the hello or legacy hello command on a dedicated connection, for example:

class RttMonitor(Thread):
  def __init__():
      # Options:
      serverAddress = serverAddress
      connectTimeoutMS = connectTimeoutMS
      heartbeatFrequencyMS = heartbeatFrequencyMS
      stableApi = stableApi

      # Internal state:
      connection = Null
      # Server API versioning implies that the server supports hello.
      helloOk = stableApi != Null
      lock = Mutex()
      movingAverage = MovingAverage()
      # Track the min RTT seen in the most recent 10 samples.
      recentSamples = deque(maxlen=10)

  def reset():
      with lock:
          movingAverage.reset()
          recentSamples.clear()

  def addSample(rtt):
      with lock:
          movingAverage.update(rtt)
          recentSamples.append(rtt)

  def average():
      with lock:
          return movingAverage.get()

  def min():
      with lock:
          # Need at least 2 RTT samples.
          if len(recentSamples) < 2:
              return 0
          return min(recentSamples)

  def run():
      while this monitor is not stopped:
          try:
              rtt = pingServer()
              addSample(rtt)
          except Exception as exc:
              # Don't call reset() here. The Monitor thread is responsible
              # for resetting the average RTT.
              close connection
              connection = Null
              helloOk = stableApi != Null

          # Can be awakened when the client is closed.
          event.wait(heartbeatFrequencyMS)
          event.clear()

  def setUpConnection():
      # Server API versioning implies that the server supports hello.
      helloOk = stableApi != Null
      connection = new Connection(serverAddress)
      set connection timeout to connectTimeoutMS
      perform connection handshake

  def pingServer():
      if helloOk:
          helloCommand = hello
      else
          helloCommand = legacy hello

      if not connection:
          setUpConnection()
          return RTT of the connection handshake

      start = time()
      response = call {helloCommand: 1, helloOk: True}
      rtt = time() - start
      helloOk = response.helloOk
      return rtt

Design Alternatives

Alternating hello or legacy hello to check servers and RTT without adding an extra connection

The streaming hello or legacy hello protocol is optimal in terms of latency; clients are always blocked waiting for the server to stream updated hello or legacy hello information, they learn of server state changes as soon as possible. However, streaming hello or legacy hello has two downsides:

  1. Streaming hello or legacy hello requires a new connection to each server to calculate the RTT.
  2. Streaming hello or legacy hello requires a new thread (or threads) to calculate the RTT of each server.

To address these concerns we designed the alternating hello or legacy hello protocol. This protocol would have alternated between awaitable hello or legacy hello and regular hello or legacy hello. The awaitable hello or legacy hello replaces the polling protocol's client side sleep and allows the client to receive updated hello or legacy hello responses sooner. The regular hello or legacy hello allows the client to maintain accurate RTT calculations without requiring any extra threads or sockets.

We reject this design because streaming hello or legacy hello is strictly better at reducing the client's time-to-recovery. We determined that one extra connection per server per MongoClient is reasonable for all drivers. Applications that upgrade may see a modest increase in connections and memory usage on the server. We don't expect this increase to be problematic; however, we have several projects planned for future MongoDB releases to make the streaming hello or legacy hello protocol cheaper server-side which should mitigate the cost of the extra monitoring connections.

Use TCP smoothed round-trip time instead of measuring RTT explicitly

TCP sockets internally maintain a "smoothed round-trip time" or SRTT. Drivers could use this SRTT instead of measuring RTT explicitly via hello or legacy hello commands. The server could even include this value on all hello or legacy hello responses. We reject this idea for a few reasons:

  • Not all programming languages have an API to access the TCP socket's RTT.
  • On Windows, RTT access requires Admin privileges.
  • TCP's SRTT would likely differ substantially from RTT measurements in the current protocol. For example, the SRTT can be reset on retransmission timeouts.

Rationale

Thread per server

Mongos uses a monitor thread per replica set, rather than a thread per server. A thread per server is impractical if mongos is monitoring a large number of replica sets. But a driver only monitors one.

In mongos, threads trying to do reads and writes join the effort to scan the replica set. Such threads are more likely to be abundant in mongos than in drivers, so mongos can rely on them to help with monitoring.

In short: mongos has different scaling concerns than a multi-threaded or asynchronous driver, so it allocates threads differently.

Socket timeout for monitoring is connectTimeoutMS

When a client waits for a server to respond to a connection, the client does not know if the server will respond eventually or if it is down. Users can help the client guess correctly by supplying a reasonable connectTimeoutMS for their network: on some networks a server is probably down if it hasn't responded in 10 ms, on others a server might still be up even if it hasn't responded in 10 seconds.

The socketTimeoutMS, on the other hand, must account for both network latency and the operation's duration on the server. Applications should typically set a very long or infinite socketTimeoutMS so they can wait for long-running MongoDB operations.

Multi-threaded clients use distinct sockets for monitoring and for application operations. A socket used for monitoring does two things: it connects and calls hello or legacy hello. Both operations are fast on the server, so only network latency matters. Thus both operations SHOULD use connectTimeoutMS, since that is the value users supply to help the client guess if a server is down, based on users' knowledge of expected latencies on their networks.

A monitor SHOULD NOT use the client's regular connection pool

If a multi-threaded driver's connection pool enforces a maximum size and monitors use sockets from the pool, there are two bad options: either monitors compete with the application for sockets, or monitors have the exceptional ability to create sockets even when the pool has reached its maximum size. The former risks starving the monitor. The latter is more complex than it is worth. (A lesson learned from PyMongo 2.6's pool, which implemented this option.)

Since this rule is justified for drivers that enforce a maximum pool size, this spec recommends that all drivers follow the same rule for the sake of consistency.

Monitors MUST use a dedicated connection for RTT commands

When using the streaming protocol, a monitor needs to maintain an extra dedicated connection to periodically update its average round trip time in order to support localThresholdMS from the Server Selection spec.

It could pop a connection from its regular pool, but we rejected this option for a few reasons:

  • Under contention the RTT task may block application operations from completing in a timely manner.
  • Under contention the application may block the RTT task from completing in a timely manner.
  • Under contention the RTT task may often result in an extra connection anyway because the pool creates new connections under contention up to maxPoolSize.
  • This would be inconsistent with the rule that a monitor SHOULD NOT use the client's regular connection pool.

The client could open and close a new connection for each RTT check. We rejected this design, because if we ping every heartbeatFrequencyMS (default 10 seconds) then the cost to the client and the server of creating and destroying the connection might exceed the cost of keeping a dedicated connection open.

Instead, the client must use a dedicated connection reserved for RTT commands. Despite the cost of the additional connection per server, we chose this option as the safest and least likely to result in surprising behavior under load.

Monitors MUST use the hello or legacy hello command to measure RTT

In the streaming protocol, clients could use the "ping", "hello", or legacy hello commands to measure RTT. This spec chooses "hello" or legacy hello for consistency with the polling protocol as well as consistency with the initial RTT provided the connection handshake which also uses the hello or legacy hello commands. Additionally, mongocryptd does not allow the ping command but does allow hello or legacy hello.

Why not use awaitedTimeMS in the server response to calculate RTT in the streaming protocol?

One approach to calculating RTT in the streaming protocol would be to have the server return an awaitedTimeMS in its hello or legacy hello response. A driver could then determine the RTT by calculating the difference between the initial request, or last response, and the awaitedTimeMS.

We rejected this design because of a number of issue with the unreliability of clocks in distributed systems. Clocks skew between local and remote system clocks. This approach mixes two notions of time: the local clock times the whole operation while the remote clock times the wait. This means that if these clocks tick at different rates, or there are anomalies like clock changes, you will get bad results. To make matters worse, you will be comparing times from multiple servers that could each have clocks ticking at different rates. This approach will bias toward servers with the fastest ticking clock, since it will seem like it spends the least time on the wire.

Additionally, systems using NTP will experience clock "slew". ntpd "slews" time by up to 500 parts-per-million to have the local time gradually approach the "true" time without big jumps - over a 10 second window that means a 5ms difference. If both sides are slewing in opposite directions, that can result in an effective difference of 10ms. Both of these times are close enough to localThresholdMS to significantly affect which servers are viable in NEAREST calculations.

Ensuring that all measurements use the same clock obviates the need for a more complicated solution, and mitigates the above mentioned concerns.

Why don't clients mark a server unknown when an RTT command fails?

In the streaming protocol, clients use the hello or legacy hello command on a dedicated connection to measure a server's RTT. However, errors encountered when running the RTT command MUST NOT mark a server Unknown. We reached this decision because the dedicate RTT connection does not come from a connection pool and thus does not have a generation number associated with it. Without a generation number we cannot handle errors from the RTT command without introducing race conditions. Introducing such a generation number would add complexity to this design without much benefit. It is safe to ignore these errors because the Monitor will soon discover the server's state regardless (either through an updated streaming response, an error on the streaming connection, or by handling an error on an application connection).

Drivers cancel in-progress monitor checks

When an application operation fails with a non-timeout network error, drivers cancel that monitor's in-progress check.

We assume that a non-timeout network error on one application connection implies that all other connections to that server are also bad. This means that it is redundant to continue reading on the current monitoring connection. Instead, we cancel the current monitor check, close the monitoring connection, and start a new check soon. Note that we rely on the connection/pool generation number checking to avoid races and ensure that the monitoring connection is only closed once.

This approach also handles the rare case where the client sees a network error on an application connection but the monitoring connection is still healthy. If we did not cancel the monitor check in this scenario, then the server would remain in the Unknown state until the next hello or legacy hello response (up to maxAwaitTimeMS). A potential real world example of this behavior is when Azure closes an idle connection in the application pool.

Retry hello or legacy hello calls once

A monitor's connection to a server is long-lived and used only for hello or legacy hello calls. So if a server has responded in the past, a network error on the monitor's connection means that there was a network glitch, or a server restart since the last check, or that the server is truly down. To handle the case that the server is truly down, the monitor makes the server unselectable by marking it Unknown. To handle the case of a transient network glitch or restart, the monitor immediately runs the next check without waiting.

Clear the connection pool on both network and command errors

A monitor clears the connection pool when a server check fails with a network or command error (Network or command error during server check). When the check fails with a network error it is likely that all connections to that server are also closed. (See JAVA-1252). When the check fails with a network timeout error, a monitor MUST set interruptInUseConnections to true. See, Why does the pool need to support closing in use connections as part of its clear logic?.

When the server is shutting down, it may respond to hello or legacy hello commands with ShutdownInProgress errors before closing connections. In this case, the monitor clears the connection pool because all connections will be closed soon. Other command errors are unexpected but are handled identically.

Why must streaming hello or legacy hello clients publish ServerHeartbeatStartedEvents?

The SDAM Monitoring spec guarantees that every ServerHeartbeatStartedEvent has either a correlating ServerHeartbeatSucceededEvent or ServerHeartbeatFailedEvent. This is consistent with Command Monitoring on exhaust cursors where the driver publishes a fake CommandStartedEvent before reading the next getMore response.

Why don't streaming hello or legacy hello clients publish events for RTT commands?

In the streaming protocol, clients MUST NOT publish any events (server, topology, command, CMAP, etc..) when running an RTT command. We considered introducing new RTT events (ServerRTTStartedEvent, ServerRTTSucceededEvent, ServerRTTFailedEvent) but it's not clear that there is a demand for this. Applications can still monitor changes to a server's RTT by listening to TopologyDescriptionChangedEvents.

What is the purpose of the "awaited" field on server heartbeat events?

ServerHeartbeatSucceededEvents published from awaitable hello or legacy hello responses will regularly have 10 second durations. The spec introduces the "awaited" field on server heartbeat events so that applications can differentiate a slow heartbeat in the polling protocol from a normal awaitable hello or legacy hello heartbeat in the new protocol.

Why disable the streaming protocol on FaaS platforms like AWS Lambda?

The streaming protocol relies on the assumption that the client can read the server's heartbeat responses in a timely manner, otherwise the client will be acting on stale information. In many FaaS platforms, like AWS Lambda, host applications will be suspended and resumed many minutes later. This behavior causes a build up of heartbeat responses and the client can end up spending a long time in a catch up phase processing outdated responses. This problem was discovered in DRIVERS-2246.

Additionally, the streaming protocol requires an extra connection and thread per monitored server which is expensive on platforms like AWS Lambda. The extra connection is particularly inefficient when thousands of AWS instances and thus thousands of clients are used.

We decided to make polling the default behavior when running on FaaS platforms like AWS Lambda to improve scalability, performance, and reliability.

Why introduce a knob for serverMonitoringMode?

The serverMonitoringMode knob provides a workaround in cases where the polling protocol would be a better choice but the driver is not running on a FaaS platform. It also provides a workaround in case the FaaS detection logic becomes outdated or inaccurate.

Changelog

  • 2024-05-02: Migrated from reStructuredText to Markdown.

  • 2020-02-20: Extracted server monitoring from SDAM into this new spec.

  • 2020-03-09: A monitor check that creates a new connection MUST use the connection's handshake to update the topology.

  • 2020-04-20: Add streaming heartbeat protocol.

  • 2020-05-20: Include rationale for why we don't use awaitedTimeMS

  • 2020-06-11: Support connectTimeoutMS=0 in streaming heartbeat protocol.

  • 2020-12-17: Mark the pool for a server as "ready" after performing a successful check. Synchronize pool clearing with SDAM updates.

  • 2021-06-21: Added support for hello/helloOk to handshake and monitoring.

  • 2021-06-24: Remove optimization mention that no longer applies

  • 2022-01-19: Add 90th percentile RTT tracking.

  • 2022-02-24: Rename Versioned API to Stable API

  • 2022-04-05: Preemptively cancel in progress operations when SDAM heartbeats timeout.

  • 2022-10-05: Remove spec front matter reformat changelog.

  • 2022-11-17: Add minimum RTT tracking and remove 90th percentile RTT.

  • 2023-10-05: Add serverMonitoringMode and default to the polling protocol on FaaS. Clients MUST NOT use dedicated connections to measure RTT when using the polling protocol.


.. role:: javascript(code) :language: javascript

======================================== Polling SRV Records for mongos Discovery

:Status: Accepted :Minimum Server Version: N/A

.. contents::


Abstract

Currently the Initial DNS Seedlist Discovery_ functionality provides a static seedlist when a MongoClient is constructed. Periodically polling the DNS SRV records would allow for the mongos proxy list to be updated without having to change client configuration.

This specification builds on top of the original Initial DNS Seedlist Discovery specification, and modifies the Server Discovery and Monitoring_ specification's definition of monitoring a set of mongos servers in a Sharded TopologyType.

.. _Initial DNS Seedlist Discovery: ../initial-dns-seedlist-discovery/initial-dns-seedlist-discovery.md .. _Server Discovery and Monitoring: ../server-discovery-and-monitoring/server-discovery-and-monitoring.md

META

The keywords “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119 <https://www.ietf.org/rfc/rfc2119.txt>_.

Specification

Terms

rescan, rescanning


A rescan is the periodic scan of all DNS SRV records to discover a new set of
mongos hosts.

rescanSRVIntervalMS

An internal value representing how often the DNS SRV records should be queried for.

Implementation

If the initial topology was created through a mongodb+srv:// URI, then drivers MUST implement this specification by periodically rescanning the SRV DNS records. There MUST NOT be an option to turn this behaviour off.

Drivers MUST NOT implement this specification if they do not adhere fully to the Initial DNS Seedlist Discovery_ specification.

This feature is only available when the Server Discovery has determined that the TopologyType is Sharded, or Unknown. Drivers MUST NOT rescan SRV DNS records when the Topology is not Sharded (i.e. Single, ReplicaSetNoPrimary, or ReplicaSetWithPrimary).

The discovery of a set of mongos servers is explained in the seedlist_ discovery section of the original specification. The behaviour of the periodic rescan is similar, but not identical to the behaviour of initial seedlist discovery. Periodic scan MUST follow these rules:

  • The driver will query the DNS server for SRV records on {hostname}.{domainname}, prefixed with the SRV service name and protocol. The SRV service name is provided in the srvServiceName_ URI option and defaults to mongodb. The protocol is always tcp. After prefixing, the URI should look like: _{srvServiceName}._tcp.{hostname}.{domainname}.

  • A driver MUST verify that the host names returned through SRV records have the same parent {domainname}. When this verification fails, a driver:

    • MUST NOT add such a non-compliant host name to the topology
    • MUST NOT raise an error
    • SHOULD log the non-compliance, including the host name
    • MUST NOT initiate a connection to any such host
  • If the DNS request returns no verified hosts in SRV records, no SRV records at all, or a DNS error happens, the driver:

    • MUST NOT change the topology
    • MUST NOT raise an error
    • SHOULD log this situation, including the reason why the DNS records could not be found, if possible
    • MUST temporarily set rescanSRVIntervalMS to heartbeatFrequencyMS until at least one verified SRV record is obtained.
  • For all verified host names, as returned through the DNS SRV query, the driver:

    • MUST remove all hosts that are part of the topology, but are no longer in the returned set of valid hosts
    • MUST NOT remove all hosts, and then re-add the ones that were returned. Hosts that have not changed, MUST be left alone and unchanged.
    • If srvMaxHosts_ is zero or greater than or equal to the number of valid hosts, each valid new host MUST be added to the topology as Unknown.
    • If srvMaxHosts_ is greater than zero and less than the number of valid hosts, valid new hosts MUST be randomly selected and added to the topology as Unknown until the topology has srvMaxHosts hosts. Drivers MUST use the same randomization algorithm as they do for initial selection_.
  • Priorities and weights in SRV records MUST continue to be ignored, and MUST NOT dictate which mongos server is used for new connections.

The rescan needs to happen periodically. As SRV records contain a TTL value, this value can be used to indicate when a rescan needs to happen. Different SRV records can have different TTL values. The rescanSRVIntervalMS value MUST be set to the lowest of the individual TTL values associated with the different SRV records in the most recent rescan, but MUST NOT be lower than 60 seconds. If a driver is unable to access the TTL values of SRV records, it MUST rescan every 60 seconds.

Drivers SHOULD endeavour to rescan and obtain a new list of mongos servers every rescanSRVIntervalMS value. The rescanSRVIntervalMS period SHOULD be calculated from the end of the previous rescan (or the end of the initial DNS seedlist discovery scan).

.. _seedlist: ../initial-dns-seedlist-discovery/initial-dns-seedlist-discovery.md#seedlist-discovery .. _srvMaxHosts: ../initial-dns-seedlist-discovery/initial-dns-seedlist-discovery.md#srvmaxhosts .. _srvServiceName: ../initial-dns-seedlist-discovery/initial-dns-seedlist-discovery.md#srvservicename .. _initial selection: ../initial-dns-seedlist-discovery/initial-dns-seedlist-discovery.md#querying-dns

Multi-Threaded Drivers

A threaded driver MUST use a separate monitoring thread for scanning the DNS records so that DNS lookups don't block other operations.

Single-Threaded Drivers

The rescan MUST happen before scanning all servers as part of the normal scanning_ functionality, but only if rescanSRVIntervalMS has passed.

.. _scanning: ../server-discovery-and-monitoring/server-discovery-and-monitoring.md#scanning

Test Plan

See README.rst in the accompanying test directory_.

.. _test directory: tests

Motivation for Change

The original Initial DNS Seedlist Discovery_ specification only regulates the initial list of mongos hosts to be used instead of a single hostname from a connection URI. Although this makes the initial configuration of a set of mongos servers a lot easier, it does not provide a method for updating the list of mongos servers in the topology.

Since the introduction of the mongodb+srv:// schema to provide an initial seedlist, some users have requested additional functionality to be able to update the configured list of mongos hosts that make up the initially seeded topology:

  • https://jira.mongodb.org/browse/JAVA-2927

Design Rationale

From the scope document

Should DNS polling use heartbeatFrequencyMS or DNS cache TTLs?


We have selected to use lowest TTLs among all DNS SRV records, with a caveat
that the rescan frequency is not lower than 60 seconds.

Should DNS polling also have a "fast polling" mode when no servers are available?

We have not opted to have a "fast polling" mode, but we did include a provision that a rescan needs to happen when DNS records are not available. In that case, a rescan would happen every heartbeatFrequencyMS. The rationale being that polling DNS really often really fast does not make a lot of sense due to DNS caching, which often uses the TTL already anyway, but when we have no TTL records to reference we still need a fallback frequency.

For the design

No option to turn off periodic rescanning


The design does not allow for an option to turn off the periodic rescanning of
SRV records on the basis that we try to have as few options as possible: the
"no knobs" philosophy.

Backwards Compatibility
=======================

This specification changes the behaviour of server monitoring by introducing a
repeating DNS lookup of the SRV records. Although this is an improvement in
the ``mongodb+srv://`` scheme it can nonetheless break expectations with users
that were familiar with the old behaviour. We do not expect this to negatively
impact users.

Reference Implementation
========================

Reference implementations are made for the following drivers:

- Perl
- C#

Security Implication
====================

This specification has no security implications beyond the ones associated
with the original `Initial DNS Seedlist Discovery`_ specification.

Future work
===========

No future work is expected.

Changelog
=========

:2022-10-05: Revise spec front matter and reformat changelog.
:2021-10-14: Specify behavior for ``srvMaxHosts`` MongoClient option.
:2021-09-15: Clarify that service name only defaults to ``mongodb``, and should
             be defined by the ``srvServiceName`` URI option.

Server Selection

  • Status: Accepted
  • Minimum Server Version: 2.4

Abstract

MongoDB deployments may offer more than one server that can service an operation. This specification describes how MongoDB drivers and mongos shall select a server for either read or write operations. It includes the definition of a "read preference" document, configuration options, and algorithms for selecting a server for different deployment topologies.

Meta

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Motivation for Change

This specification builds upon the prior "Driver Read Preference" specification, which had a number of omissions, flaws or other deficiencies:

  1. Mandating features that implied monotonicity for situations where monotonicity is not guaranteed
  2. Mandating features that are not supported by mongos
  3. Neglecting to specify a single, standard way to calculate average latency times
  4. Specifying complex command-helper rules
  5. Omitting rules for applying read preferences to a single server or to select among multiple mongos servers
  6. Omitting test cases for verification of spec compliance

This revision addresses these problems as well as improving structure and specificity.

Additionally, it adds specifications for server selection more broadly:

  • Selection of a server for write operations
  • Server selection retry and timeout

Specification

Scope and general requirements

This specification describes how MongoDB drivers and mongos select a server for read and write operations, including commands, OP_QUERY, OP_INSERT, OP_UPDATE, and OP_DELETE. For read operations, it describes how drivers and mongos shall interpret a read preference document.

This specification does not apply to OP_GET_MORE or OP_KILL_CURSORS operations on cursors, which need to go to the same server that received an OP_QUERY and returned a cursor ID.

For operations that are part of a sharded transaction this specification only applies to the initial operation which starts the transaction on a mongos. This specification does not apply to subsequent operations that are part of the sharded transaction because all operations in a sharded transaction need to go to the same mongos server.

Drivers and mongos MUST conform to the semantics of this document, but SHOULD use language-appropriate data models or variable names.

This specification does not apply to commands issued for server monitoring or authentication.

Terms

Available
Describes a server that is believed to be reachable over the network and able to respond to requests. A server of type Unknown or PossiblePrimary is not available; other types are available.

Client
Software that communicates with a MongoDB deployment. This includes both drivers and mongos.

Candidate
Describes servers in a deployment that enter the selection process, determined by the read preference mode parameter and the servers' type. Depending on the mode, candidate servers might only include secondaries or might apply to all servers in the deployment.

Deployment
One or more servers that collectively provide access to a single logical set of MongoDB databases.

Command
An OP_QUERY operation targeting the '$cmd' collection namespace.

Direct connection
A driver connection mode that sends all database operations to a single server without regard for type.

Eligible
Describes candidate servers that also meet the criteria specified by the tag_sets and maxStalenessSeconds read preference parameters.

Hedged Read
A server mode in which the same query is dispatched in parallel to multiple replica set members.

Immediate topology check
For a multi-threaded or asynchronous client, this means waking all server monitors for an immediate check. For a single-threaded client, this means a (blocking) scan of all servers.

Latency window
When choosing between several suitable servers, the latency window is the range of acceptable RTTs from the shortest RTT to the shortest RTT plus the local threshold. E.g. if the shortest RTT is 15ms and the local threshold is 200ms, then the latency window ranges from 15ms - 215ms.

Local threshold
The maximum acceptable difference in milliseconds between the shortest RTT and the longest RTT of servers suitable to be selected.

Mode
One of several enumerated values used as part of a read preference, defining which server types are candidates for reads and the semantics for choosing a specific one.

Primary
Describes a server of type RSPrimary.

Query
An OP_QUERY operation targeting a regular (non '$cmd') collection namespace.

Read preference
The parameters describing which servers in a deployment can receive read operations, including mode, tag_sets, maxStalenessSeconds, and hedge.

RS
Abbreviation for "replica set".

RTT
Abbreviation for "round trip time".

Round trip time
The time in milliseconds to execute a hello or legacy hello command and receive a response for a given server. This spec differentiates between the RTT of a single hello or legacy hello command and a server's average RTT over several such commands.

Secondary
A server of type RSSecondary.

Staleness
A worst-case estimate of how far a secondary's replication lags behind the primary's last write.

Server
A mongod or mongos process.

Server selection
The process by which a server is chosen for a database operation out of all potential servers in a deployment.

Server type
An enumerated type indicating whether a server is up or down, whether it is a mongod or mongos, whether it belongs to a replica set and, if so, what role it serves in the replica set. See the Server Discovery and Monitoring spec for more details.

Suitable
Describes a server that meets all specified criteria for a read or write operation.

Tag
A single key/value pair describing either (1) a user-specified characteristic of a replica set member or (2) a desired characteristic for the target of a read operation. The key and value have no semantic meaning to the driver; they are arbitrary user choices.

Tag set
A document of zero or more tags. Each member of a replica set can be configured with zero or one tag set.

Tag set list
A list of zero or more tag sets. A read preference might have a tag set list used for selecting servers.

Topology
The state of a deployment, including its type, which servers are members, and the server types of members.

Topology type
An enumerated type indicating the semantics for monitoring servers and selecting servers for database operations. See the Server Discovery and Monitoring spec for more details.

Assumptions

  1. Unless they explicitly override these priorities, we assume our users prefer their applications to be, in order:
    • Predictable: the behavior of the application should not change based on the deployment type, whether single mongod, replica set or sharded cluster.
    • Resilient: applications will adapt to topology changes, if possible, without raising errors or requiring manual reconfiguration.
    • Low-latency: all else being equal, faster responses to queries and writes are preferable.
  2. Clients know the state of a deployment based on some form of ongoing monitoring, following the rules defined in the Server Discovery and Monitoring spec.
    • They know which members are up or down, what their tag sets are, and their types.
    • They know average round trip times to each available member.
    • They detect reconfiguration and the addition or removal of members.
  3. The state of a deployment could change at any time, in between any network interaction.
    • Servers might or might not be reachable; they can change type at any time, whether due to partitions, elections, or misconfiguration.
    • Data rollbacks could occur at any time.

MongoClient Configuration

Selecting a server requires the following client-level configuration options:

localThresholdMS

This defines the size of the latency window for selecting among multiple suitable servers. The default is 15 (milliseconds). It MUST be configurable at the client level. It MUST NOT be configurable at the level of a database object, collection object, or at the level of an individual query.

In the prior read preference specification, localThresholdMS was called secondaryAcceptableLatencyMS by drivers. Drivers MUST support the new name for consistency, but MAY continue to support the legacy name to avoid a backward-breaking change.

mongos currently uses localThreshold and MAY continue to do so.

serverSelectionTimeoutMS

This defines the maximum time to block for server selection before throwing an exception. The default is 30,000 (milliseconds). It MUST be configurable at the client level. It MUST NOT be configurable at the level of a database object, collection object, or at the level of an individual query.

The actual timeout for server selection can be less than serverSelectionTimeoutMS. See Timeouts for rules to compute the exact value.

This default value was chosen to be sufficient for a typical server primary election to complete. As the server improves the speed of elections, this number may be revised downward.

Users that can tolerate long delays for server selection when the topology is in flux can set this higher. Users that want to "fail fast" when the topology is in flux can set this to a small number.

A serverSelectionTimeoutMS of zero MAY have special meaning in some drivers; zero's meaning is not defined in this spec, but all drivers SHOULD document the meaning of zero.

serverSelectionTryOnce

Single-threaded drivers MUST provide a "serverSelectionTryOnce" mode, in which the driver scans the topology exactly once after server selection fails, then either selects a server or raises an error.

The serverSelectionTryOnce option MUST be true by default. If it is set false, then the driver repeatedly searches for an appropriate server until the selection process times out (pausing minHeartbeatFrequencyMS between attempts, as required by the Server Discovery and Monitoring spec).

Users of single-threaded drivers MUST be able to control this mode in one or both of these ways:

  • In code, pass true or false for an option called serverSelectionTryOnce, spelled idiomatically for the language, to the MongoClient constructor.
  • Include "serverSelectionTryOnce=true" or "serverSelectionTryOnce=false" in the URI. The URI option is spelled the same for all drivers.

Conflicting usages of the URI option and the symbol is an error.

Multi-threaded drivers MUST NOT provide this mode. (See single-threaded server selection implementation and the rationale for a "try once" mode.)

heartbeatFrequencyMS

This controls when topology updates are scheduled. See heartbeatFrequencyMS in the Server Discovery and Monitoring spec for details.

socketCheckIntervalMS

Only for single-threaded drivers.

The default socketCheckIntervalMS MUST be 5000 (5 seconds), and it MAY be configurable. If socket has been idle for at least this long, it must be checked before being used again.

See checking an idle socket after socketCheckIntervalMS and what is the purpose of socketCheckIntervalMS?.

idleWritePeriodMS

A constant, how often an idle primary writes a no-op to the oplog. See idleWritePeriodMS in the Max Staleness spec for details.

smallestMaxStalenessSeconds

A constant, 90 seconds. See "Smallest allowed value for maxStalenessSeconds" in the Max Staleness Spec.

serverSelector

Implementations MAY allow configuration of an optional, application-provided function that augments the server selection rules. The function takes as a parameter a list of server descriptions representing the suitable servers for the read or write operation, and returns a list of server descriptions that should still be considered suitable.

Read Preference

A read preference determines which servers are considered suitable for read operations. Read preferences are interpreted differently based on topology type. See topology-type-specific server selection rules for details.

When no servers are suitable, the selection might be retried or will eventually fail following the rules described in the Rules for server selection section.

Components of a read preference

A read preference consists of a mode and optional tag_sets, maxStalenessSeconds, and hedge. The mode prioritizes between primaries and secondaries to produce either a single suitable server or a list of candidate servers. If tag_sets and maxStalenessSeconds are set, they determine which candidate servers are eligible for selection. If hedge is set, it configures how server hedged reads are used.

The default mode is 'primary'. The default tag_sets is a list with an empty tag set: [{}]. The default maxStalenessSeconds is -1 or null, depending on the language. The default hedge is unset.

Each is explained in greater detail below.

mode

For a deployment with topology type ReplicaSetWithPrimary or ReplicaSetNoPrimary, the mode parameter controls whether primaries or secondaries are deemed suitable. Topology types Single and Sharded have different selection criteria and are described elsewhere.

Clients MUST support these modes:

primary
Only an available primary is suitable.

secondary
All secondaries (and only secondaries) are candidates, but only eligible candidates (i.e. after applying tag_sets and maxStalenessSeconds) are suitable.

primaryPreferred
If a primary is available, only the primary is suitable. Otherwise, all secondaries are candidates, but only eligible secondaries are suitable.

secondaryPreferred
All secondaries are candidates. If there is at least one eligible secondary, only eligible secondaries are suitable. Otherwise, when there are no eligible secondaries, the primary is suitable.

nearest
The primary and all secondaries are candidates, but only eligible candidates are suitable.

Note on other server types: The Server Discovery and Monitoring spec defines several other server types that could appear in a replica set. Such types are never candidates, eligible or suitable.

maxStalenessSeconds

The maximum replication lag, in wall clock time, that a secondary can suffer and still be eligible.

The default is no maximum staleness.

A maxStalenessSeconds of -1 MUST mean "no maximum". Drivers are also free to use None, null, or other representations of "no value" to represent "no max staleness".

Drivers MUST raise an error if maxStalenessSeconds is a positive number and the mode field is 'primary'.

A driver MUST raise an error if the TopologyType is ReplicaSetWithPrimary or ReplicaSetNoPrimary and either of these conditions is false:

maxStalenessSeconds * 1000 >= heartbeatFrequencyMS + idleWritePeriodMS
maxStalenessSeconds >= smallestMaxStalenessSeconds

heartbeatFrequencyMS is defined in the Server Discovery and Monitoring spec, and idleWritePeriodMS is defined to be 10 seconds in the Max Staleness spec.

See "Smallest allowed value for maxStalenessSeconds" in the Max Staleness Spec.

mongos MUST reject a read with maxStalenessSeconds provided and a mode of 'primary'.

mongos MUST reject a read with maxStalenessSeconds that is not a positive integer.

mongos MUST reject a read if maxStalenessSeconds is less than smallestMaxStalenessSeconds, with error code 160 (SERVER-24421).

During server selection, drivers (but not mongos) with minWireVersion < 5 MUST raise an error if maxStalenessSeconds is a positive number, and any available server's maxWireVersion is less than 5.1

After filtering servers according to mode, and before filtering with tag_sets, eligibility MUST be determined from maxStalenessSeconds as follows:

  • If maxStalenessSeconds is not a positive number, then all servers are eligible.

  • Otherwise, calculate staleness. Non-secondary servers (including Mongos servers) have zero staleness. If TopologyType is ReplicaSetWithPrimary, a secondary's staleness is calculated using its ServerDescription "S" and the primary's ServerDescription "P":

    (S.lastUpdateTime - S.lastWriteDate) - (P.lastUpdateTime - P.lastWriteDate) + heartbeatFrequencyMS
    

    (All datetime units are in milliseconds.)

    If TopologyType is ReplicaSetNoPrimary, a secondary's staleness is calculated using its ServerDescription "S" and the ServerDescription of the secondary with the greatest lastWriteDate, "SMax":

    SMax.lastWriteDate - S.lastWriteDate + heartbeatFrequencyMS
    

    Servers with staleness less than or equal to maxStalenessSeconds are eligible.

See the Max Staleness Spec for overall description and justification of this feature.

tag_sets

The read preference tag_sets parameter is an ordered list of tag sets used to restrict the eligibility of servers, such as for data center awareness.

Clients MUST raise an error if a non-empty tag set is given in tag_sets and the mode field is 'primary'.

A read preference tag set (T) matches a server tag set (S) – or equivalently a server tag set (S) matches a read preference tag set (T) — if T is a subset of S (i.e. T ⊆ S).

For example, the read preference tag set "{ dc: 'ny', rack: '2' }" matches a secondary server with tag set "{ dc: 'ny', rack: '2', size: 'large' }".

A tag set that is an empty document matches any server, because the empty tag set is a subset of any tag set. This means the default tag_sets parameter ([{}]) matches all servers.

Tag sets are applied after filtering servers by mode and maxStalenessSeconds, and before selecting one server within the latency window.

Eligibility MUST be determined from tag_sets as follows:

  • If the tag_sets list is empty then all candidate servers are eligible servers. (Note, the default of [{}] means an empty list probably won't often be seen, but if the client does not forbid an empty list, this rule MUST be implemented to handle that case.)
  • If the tag_sets list is not empty, then tag sets are tried in order until a tag set matches at least one candidate server. All candidate servers matching that tag set are eligible servers. Subsequent tag sets in the list are ignored.
  • If the tag_sets list is not empty and no tag set in the list matches any candidate server, no servers are eligible servers.
hedge

The read preference hedge parameter is a document that configures how the server will perform hedged reads. It consists of the following keys:

  • enabled: Enables or disables hedging

Hedged reads are automatically enabled in MongoDB 4.4+ when using a nearest read preference. To explicitly enable hedging, the hedge document must be passed. An empty document uses server defaults to control hedging, but the enabled key may be set to true or false to explicitly enable or disable hedged reads.

Drivers MAY allow users to specify an empty hedge document if they accept documents for read preference options. Any driver that exposes a builder API for read preference objects MUST NOT allow an empty hedge document to be constructed. In this case, the user MUST specify a value for enabled, which MUST default to true. If the user does not call a hedge API method, drivers MUST NOT send a hedge option to the server.

Read preference configuration

Drivers MUST allow users to configure a default read preference on a MongoClient object. Drivers MAY allow users to configure a default read preference on a Database or Collection object.

A read preference MAY be specified as an object, document or individual mode, tag_sets, and maxStalenessSeconds parameters, depending on what is most idiomatic for the language.

If more than one object has a default read preference, the default of the most specific object takes precedence. I.e. Collection is preferred over Database, which is preferred over MongoClient.

Drivers MAY allow users to set a read preference on queries on a per-operation basis similar to how hint or batchSize are set. E.g., in Python:

db.collection.find({}, read_preference=ReadPreference.SECONDARY)
db.collection.find(
    {},
    read_preference=ReadPreference.NEAREST,
    tag_sets=[{'dc': 'ny'}],
    maxStalenessSeconds=120,
    hedge={'enabled': true})

Passing read preference to mongos and load balancers

If a server of type Mongos or LoadBalancer is selected for a read operation, the read preference is passed to the selected mongos through the use of $readPreference (as a Global Command Argument for OP_MSG or a query modifier for OP_QUERY) and, for OP_QUERY only, the SecondaryOk wire protocol flag, according to the following rules.

For OP_MSG:
  • For mode 'primary', drivers MUST NOT set $readPreference
  • For all other read preference modes (i.e. 'secondary', 'primaryPreferred', ...), drivers MUST set $readPreference
For OP_QUERY:

If the read preference contains only a mode parameter and the mode is 'primary' or 'secondaryPreferred', for maximum backwards compatibility with older versions of mongos, drivers MUST only use the value of the SecondaryOk wire protocol flag (i.e. set or unset) to indicate the desired read preference and MUST NOT use a $readPreference query modifier.

Therefore, when sending queries to a mongos or load balancer, the following rules apply:

  • For mode 'primary', drivers MUST NOT set the SecondaryOk wire protocol flag and MUST NOT use $readPreference
  • For mode 'secondary', drivers MUST set the SecondaryOk wire protocol flag and MUST also use $readPreference
  • For mode 'primaryPreferred', drivers MUST set the SecondaryOk wire protocol flag and MUST also use $readPreference
  • For mode 'secondaryPreferred', drivers MUST set the SecondaryOk wire protocol flag. If the read preference contains a non-empty tag_sets parameter, maxStalenessSeconds is a positive integer, or the hedge parameter is non-empty, drivers MUST use $readPreference; otherwise, drivers MUST NOT use $readPreference
  • For mode 'nearest', drivers MUST set the SecondaryOk wire protocol flag and MUST also use $readPreference

The $readPreference query modifier sends the read preference as part of the query. The read preference fields tag_sets is represented in a $readPreference document using the field name tags.

When sending a read operation via OP_QUERY and any $ modifier is used, including the $readPreference modifier, the query MUST be provided using the $query modifier like so:

{
    $query: {
        field1: 'query_value',
        field2: 'another_query_value'
    },
    $readPreference: {
        mode: 'secondary',
        tags: [ { 'dc': 'ny' } ],
        maxStalenessSeconds: 120,
        hedge: { enabled: true }
    }
}
Document structure

A valid $readPreference document for mongos or load balancer has the following requirements:

  1. The mode field MUST be present exactly once with the mode represented in camel case:

    • 'primary'
    • 'secondary'
    • 'primaryPreferred'
    • 'secondaryPreferred'
    • 'nearest'
  2. If the mode field is "primary", the tags, maxStalenessSeconds, and hedge fields MUST be absent.

    Otherwise, for other mode values, the tags field MUST either be absent or be present exactly once and have an array value containing at least one document. It MUST contain only documents, no other type.

    The maxStalenessSeconds field MUST be either be absent or be present exactly once with an integer value.

    The hedge field MUST be either absent or be a document.

Mongos or service receiving a query with $readPreference SHOULD validate the mode, tags, maxStalenessSeconds, and hedge fields according to rules 1 and 2 above, but SHOULD ignore unrecognized fields for forward-compatibility rather than throwing an error.

Use of read preferences with commands

Because some commands are used for writes, deployment-changes or other state-changing side-effects, the use of read preference by a driver depends on the command and how it is invoked:

  1. Write commands: insert, update, delete, findAndModify

    Write commands are considered write operations and MUST follow the corresponding Rules for server selection for each topology type.

  2. Generic command method: typically command or runCommand

    The generic command method MUST act as a read operation for the purposes of server selection.

    The generic command method has a default read preference of mode 'primary'. The generic command method MUST ignore any default read preference from client, database or collection configuration. The generic command method SHOULD allow an optional read preference argument.

    If an explicit read preference argument is provided as part of the generic command method call, it MUST be used for server selection, regardless of the name of the command. It is up to the user to use an appropriate read preference, e.g. not calling renameCollection with a mode of 'secondary'.

    N.B.: "used for server selection" does not supersede rules for server selection on "Standalone" topologies, which ignore any requested read preference.

  3. Command-specific helper: methods that wrap database commands, like count, distinct, listCollections or renameCollection.

    Command-specific helpers MUST act as read operations for the purposes of server selection, with read preference rules defined by the following three categories of commands:

    • "must-use-primary": these commands have state-modifying effects and will only succeed on a primary. An example is renameCollection.

      These command-specific helpers MUST use a read preference mode of 'primary', MUST NOT take a read preference argument and MUST ignore any default read preference from client, database or collection configuration. Languages with dynamic argument lists MUST throw an error if a read preference is provided as an argument.

      Clients SHOULD rely on the server to return a "not writable primary" or other error if the command is "must-use-primary". Clients MAY raise an exception before sending the command if the topology type is Single and the server type is not "Standalone", "RSPrimary" or "Mongos", but the identification of the set of 'must-use-primary' commands is out of scope for this specification.

    • "should-use-primary": these commands are intended to be run on a primary, but would succeed -- albeit with possibly stale data -- when run against a secondary. An example is listCollections.

      These command-specific helpers MUST use a read preference mode of 'primary', MUST NOT take a read preference argument and MUST ignore any default read preference from client, database or collection configuration. Languages with dynamic argument lists MUST throw an error if a read preference is provided as an argument.

      Clients MUST NOT raise an exception if the topology type is Single.

    • "may-use-secondary": these commands run against primaries or secondaries, according to users' read preferences. They are sometimes called "query-like" commands.

      The current list of "may-use-secondary" commands includes:

      • aggregate without a write stage (e.g. $out, $merge)
      • collStats
      • count
      • dbStats
      • distinct
      • find
      • geoNear
      • geoSearch
      • group
      • mapReduce where the out option is { inline: 1 }
      • parallelCollectionScan

      Associated command-specific helpers SHOULD take a read preference argument and otherwise MUST use the default read preference from client, database, or collection configuration.

      For pre-5.0 servers, an aggregate command is "must-use-primary" if its pipeline contains a write stage (e.g. $out, $merge); otherwise, it is "may-use-secondary". For 5.0+ servers, secondaries can execute an aggregate command with a write stage and all aggregate commands are "may-use-secondary". This is discussed in more detail in Read preferences and server selection in the CRUD spec.

      If a client provides a specific helper for inline mapReduce, then it is "may-use-secondary" and the regular mapReduce helper is "must-use-primary". Otherwise, the mapReduce helper is "may-use-secondary" and it is the user's responsibility to specify {inline: 1} when running mapReduce on a secondary.

    New command-specific helpers implemented in the future will be considered "must-use-primary", "should-use-primary" or "may-use-secondary" according to the specifications for those future commands. Command helper specifications SHOULD use those terms for clarity.

Rules for server selection

Server selection is a process which takes an operation type (read or write), a ClusterDescription, and optionally a read preference and, on success, returns a ServerDescription for an operation of the given type.

Server selection varies depending on whether a client is multi-threaded/asynchronous or single-threaded because a single-threaded client cannot rely on the topology state being updated in the background.

Timeouts

Multi-threaded drivers and single-threaded drivers with serverSelectionTryOnce set to false MUST enforce a timeout for the server selection process. The timeout MUST be computed as described in Client Side Operations Timeout: Server Selection.

Multi-threaded or asynchronous server selection

A driver that uses multi-threaded or asynchronous monitoring MUST unblock waiting operations as soon as server selection completes, even if not all servers have been checked by a monitor. Put differently, the client MUST NOT block server selection while waiting for server discovery to finish.

For example, if the client is discovering a replica set and the application attempts a read operation with mode 'primaryPreferred', the operation MUST proceed immediately if a suitable secondary is found, rather than blocking until the client has checked all members and possibly discovered a primary.

The number of threads allowed to wait for server selection SHOULD be either (a) the same as the number of threads allowed to wait for a connection from a pool; or (b) governed by a global or client-wide limit on number of waiting threads, depending on how resource limits are implemented by a driver.

operationCount

Multi-threaded or async drivers MUST keep track of the number of operations that a given server is currently executing (the server's operationCount). This value MUST be incremented once a server is selected for an operation and MUST be decremented once that operation has completed, regardless of its outcome. Where this value is stored is left as a implementation detail of the driver; some example locations include the Server type that also owns the connection pool for the server (if there exists such a type in the driver's implementation) or on the pool itself. Incrementing or decrementing a server's operationCount MUST NOT wake up any threads that are waiting for a topology update as part of server selection. See operationCount-based selection within the latency window (multi-threaded or async) for the rationale behind the way this value is used.

Server Selection Algorithm

For multi-threaded clients, the server selection algorithm is as follows:

  1. Record the server selection start time and log a "Server selection started" message.
  2. If the topology wire version is invalid, raise an error and log a "Server selection failed" message.
  3. Find suitable servers by topology type and operation type. If a list of deprioritized servers is provided, and the topology is a sharded cluster, these servers should be selected only if there are no other suitable servers. The server selection algorithm MUST ignore the deprioritized servers if the topology is not a sharded cluster.
  4. Filter the suitable servers by calling the optional, application-provided server selector.
  5. If there are any suitable servers, filter them according to Filtering suitable servers based on the latency window and continue to the next step; otherwise, log a "Waiting for suitable server to become available" message if one has not already been logged for this operation, and goto Step #9.
  6. Choose two servers at random from the set of suitable servers in the latency window. If there is only 1 server in the latency window, just select that server and goto Step #8.
  7. Of the two randomly chosen servers, select the one with the lower operationCount. If both servers have the same operationCount, select arbitrarily between the two of them.
  8. Increment the operationCount of the selected server and return it. Log a "Server selection succeeded" message. Do not go onto later steps.
  9. Request an immediate topology check, then block the server selection thread until the topology changes or until the server selection timeout has elapsed
  10. If server selection has timed out, raise a [server selection error](#server selection error) and log a "Server selection failed" message.
  11. Goto Step #2

Single-threaded server selection

Single-threaded drivers do not monitor the topology in the background. Instead, they MUST periodically update the topology during server selection as described below.

When serverSelectionTryOnce is true, server selection timeouts have no effect; a single immediate topology check will be done if the topology starts stale or if the first selection attempt fails.

When serverSelectionTryOnce is false, then the server selection loops until a server is successfully selected or until the selection timeout is exceeded.

Therefore, for single-threaded clients, the server selection algorithm is as follows:

  1. Record the server selection start time and log a "Server selection started" message.
  2. Record the maximum time as start time plus the computed timeout
  3. If the topology has not been scanned in heartbeatFrequencyMS milliseconds, mark the topology stale
  4. If the topology is stale, proceed as follows:
    • record the target scan time as last scan time plus minHeartBeatFrequencyMS
    • if serverSelectionTryOnce is false and the target scan time would exceed the maximum time, raise a [server selection error](#server selection error) and log a "Server selection failed" message.
    • if the current time is less than the target scan time, sleep until the target scan time
    • do a blocking immediate topology check (which must also update the last scan time and mark the topology as no longer stale)
  5. If the topology wire version is invalid, raise an error and log a "Server selection failed" message.
  6. Find suitable servers by topology type and operation type. If a list of deprioritized servers is provided, and the topology is a sharded cluster, these servers should be selected only if there are no other suitable servers. The server selection algorithm MUST ignore the deprioritized servers if the topology is not a sharded cluster.
  7. Filter the suitable servers by calling the optional, application-provided server selector.
  8. If there are any suitable servers, filter them according to Filtering suitable servers based on the latency window and return one at random from the filtered servers, and log a "Server selection succeeded" message.; otherwise, mark the topology stale and continue to step #9.
  9. If serverSelectionTryOnce is true and the last scan time is newer than the selection start time, raise a [server selection error](#server selection error) and log a "Server selection failed" message; otherwise, log a "Waiting for suitable server to become available" message if one has not already been logged for this operation, and goto Step #4
  10. If the current time exceeds the maximum time, raise a [server selection error](#server selection error) and log a "Server selection failed" message.
  11. Goto Step #4

Before using a socket to the selected server, drivers MUST check whether the socket has been used in socketCheckIntervalMS milliseconds. If the socket has been idle for longer, the driver MUST update the ServerDescription for the selected server. After updating, if the server is no longer suitable, the driver MUST repeat the server selection algorithm and select a new server.

Because single-threaded selection can do a blocking immediate check, the server selection timeout is not a hard deadline. The actual maximum server selection time for any given request can vary from the timeout minus minHeartbeatFrequencyMS to the timeout plus the time required for a blocking scan.

Single-threaded drivers MUST document that when serverSelectionTryOne is true, selection may take up to the time required for a blocking scan, and when serverSelectionTryOne is false, selection may take up to the timeout plus the time required for a blocking scan.

Topology type: Unknown

When a deployment has topology type "Unknown", no servers are suitable for read or write operations.

Topology type: Single

A deployment of topology type Single contains only a single server of any type. Topology type Single signifies a direct connection intended to receive all read and write operations.

Therefore, read preference is ignored during server selection with topology type Single. The single server is always suitable for reads if it is available. Depending on server type, the read preference is communicated to the server differently:

  • Type Mongos: the read preference is sent to the server using the rules for Passing read preference to mongos and load balancers.
  • Type Standalone: clients MUST NOT send the read preference to the server
  • For all other types, using OP_QUERY: clients MUST always set the SecondaryOk wire protocol flag on reads to ensure that any server type can handle the request.
  • For all other types, using OP_MSG: If no read preference is configured by the application, or if the application read preference is Primary, then $readPreference MUST be set to { "mode": "primaryPreferred" } to ensure that any server type can handle the request. If the application read preference is set otherwise, $readPreference MUST be set following Document structure.

The single server is always suitable for write operations if it is available.

Topology type: LoadBalanced

During command construction, drivers MUST add a $readPreference field to the command when required by Passing read preference to mongos and load balancers; see the Load Balancer Specification for details.

Topology types: ReplicaSetWithPrimary or ReplicaSetNoPrimary

A deployment with topology type ReplicaSetWithPrimary or ReplicaSetNoPrimary can have a mix of server types: RSPrimary (only in ReplicaSetWithPrimary), RSSecondary, RSArbiter, RSOther, RSGhost, Unknown or PossiblePrimary.

Read operations

For the purpose of selecting a server for read operations, the same rules apply to both ReplicaSetWithPrimary and ReplicaSetNoPrimary.

To select from the topology a server that matches the user's Read Preference:

If mode is 'primary', select the primary server.

If mode is 'secondary' or 'nearest':

  1. Select all secondaries if mode is 'secondary', or all secondaries and the primary if mode is 'nearest'.
  2. From these, filter out servers staler than maxStalenessSeconds if it is a positive number.
  3. From the remaining servers, select servers matching the tag_sets.
  4. From these, select one server within the latency window.

(See [algorithm for filtering by staleness](#algorithm for filtering by staleness), [algorithm for filtering by tag_sets](#algorithm for filtering by tag_sets), and filtering suitable servers based on the latency window for details on each step, and why is maxStalenessSeconds applied before tag_sets?.)

If mode is 'secondaryPreferred', attempt the selection algorithm with mode 'secondary' and the user's maxStalenessSeconds and tag_sets. If no server matches, select the primary.

If mode is 'primaryPreferred', select the primary if it is known, otherwise attempt the selection algorithm with mode 'secondary' and the user's maxStalenessSeconds and tag_sets.

For all read preferences modes except 'primary', clients MUST set the SecondaryOk wire protocol flag (OP_QUERY) or $readPreference global command argument (OP_MSG) to ensure that any suitable server can handle the request. If the read preference mode is 'primary', clients MUST NOT set the SecondaryOk wire protocol flag (OP_QUERY) or $readPreference global command argument (OP_MSG).

Write operations

If the topology type is ReplicaSetWithPrimary, only an available primary is suitable for write operations.

If the topology type is ReplicaSetNoPrimary, no servers are suitable for write operations.

Topology type: Sharded

A deployment of topology type Sharded contains one or more servers of type Mongos or Unknown.

For read operations, all servers of type Mongos are suitable; the mode, tag_sets, and maxStalenessSeconds read preference parameters are ignored for selecting a server, but are passed through to mongos. See Passing read preference to mongos and load balancers.

For write operations, all servers of type Mongos are suitable.

If more than one mongos is suitable, drivers MUST select a suitable server within the latency window (see Filtering suitable servers based on the latency window).

Round Trip Times and the Latency Window

Calculation of Average Round Trip Times

For every available server, clients MUST track the average RTT of server monitoring hello or legacy hello commands.

An Unknown server has no average RTT. When a server becomes unavailable, its average RTT MUST be cleared. Clients MAY implement this idiomatically (e.g nil, -1, etc.).

When there is no average RTT for a server, the average RTT MUST be set equal to the first RTT measurement (i.e. the first hello or legacy hello command after the server becomes available).

After the first measurement, average RTT MUST be computed using an exponentially-weighted moving average formula, with a weighting factor (alpha) of 0.2. If the prior average is denoted old_rtt, then the new average (new_rtt) is computed from a new RTT measurement (x) using the following formula:

    alpha = 0.2
    new_rtt = alpha * x + (1 - alpha) * old_rtt

A weighting factor of 0.2 was chosen to put about 85% of the weight of the average RTT on the 9 most recent observations.

Filtering suitable servers based on the latency window

Server selection results in a set of zero or more suitable servers. If more than one server is suitable, a server MUST be selected from among those within the latency window.

The localThresholdMS configuration parameter controls the size of the latency window used to select a suitable server.

The shortest average round trip time (RTT) from among suitable servers anchors one end of the latency window (A). The other end is determined by adding localThresholdMS (B = A + localThresholdMS).

A server MUST be selected from among suitable servers that have an average RTT (RTT) within the latency window (i.e. A ≤ RTT ≤ B). In other words, the suitable server with the shortest average RTT is always a possible choice. Other servers could be chosen if their average RTTs are no more than localThresholdMS more than the shortest average RTT.

See either Single-threaded server selection or Multi-threaded or asynchronous server selection for information on how to select a server from among those within the latency window.

Checking an Idle Socket After socketCheckIntervalMS

Only for single-threaded drivers.

If a server is selected that has an existing connection that has been idle for socketCheckIntervalMS, the driver MUST check the connection with the "ping" command. If the ping succeeds, use the selected connection. If not, set the server's type to Unknown and update the Topology Description according to the Server Discovery and Monitoring Spec, and attempt once more to select a server.

The logic is expressed in this pseudocode. The algorithm for the "getServer" function is suggested below, in Single-threaded server selection implementation:

    def getConnection(criteria):
        # Get a server for writes, or a server matching read prefs, by
        # running the server selection algorithm.
        server = getServer(criteria)
        if not server:
            throw server selection error

        connection = server.connection
        if connection is NULL:
            # connect to server and return connection
        else if connection has been idle < socketCheckIntervalMS:
            return connection
        else:
            try:
                use connection for "ping" command
                return connection
            except network error:
                close connection
                mark server Unknown and update Topology Description

                # Attempt *once* more to select.
                server = getServer(criteria)
                if not server:
                    throw server selection error

                # connect to server and return connection

See What is the purpose of socketCheckIntervalMS?.

Requests and Pinning Deprecated

The prior read preference specification included the concept of a "request", which pinned a server to a thread for subsequent, related reads. Requests and pinning are now deprecated. See What happened to pinning? for the rationale for this change.

Drivers with an existing request API MAY continue to provide it for backwards compatibility, but MUST document that pinning for the request does not guarantee monotonic reads.

Drivers MUST NOT automatically pin the client or a thread to a particular server without an explicit start_request (or comparable) method call.

Outside a legacy "request" API, drivers MUST use server selection for each individual read operation.

Logging

Please refer to the logging specification for details on logging implementations in general, including log levels, log components, and structured versus unstructured logging.

Drivers MUST support logging of server selection information via the following log messages. These messages MUST use the serverSelection log component.

The types used in the structured message definitions below are demonstrative, and drivers MAY use similar types instead so long as the information is present (e.g. a double instead of an integer, or a string instead of an integer if the structured logging framework does not support numeric types.)

Common Fields

The following key-value pairs MUST be included in all server selection log messages:

KeySuggested TypeValue
selectorStringString representation of the selector being used to select the server. This can be a read preference or an application-provided custom selector. The exact content of is flexible depending on what the driver is able to log. At minimum, when the selector is a read preference this string MUST contain all components of the read preference, and when it is an application-provided custom selector the string MUST somehow indicate that it is a custom selector.
operationIdIntThe driver-generated operation ID. Optional; only present if the driver generates operation IDs and this command has one.
operationStringThe name of the operation for which a server is being selected. When server selection is being performed to select a server for a command, this MUST be the command name.
topologyDescriptionStringString representation of the current topology description. The format of is flexible and could be e.g. the toString() implementation for a driver's topology type, or an extended JSON representation of the topology object.

"Server selection started" message

This message MUST be logged at debug level. It MUST be emitted on the occasions specified either in Multi-threaded or asynchronous server selection or Single-threaded server selection, depending on which algorithm the driver implements.

This message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Server selection started"

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Server selection started for operation {{operation}} with ID {{operationId}}. Selector: {{selector}}, topology description: {{topologyDescription}}

"Server selection succeeded" message

This message MUST be logged at debug level. It MUST be emitted on the occasions specified either in Multi-threaded or asynchronous server selection or Single-threaded server selection, depending on which algorithm the driver implements.

This message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Server selection succeeded"
serverHostStringThe hostname, IP address, or Unix domain socket path for the selected server.
serverPortIntThe port for the selected server. Optional; not present for Unix domain sockets. When the user does not specify a port and the default (27017) is used, the driver SHOULD include it here.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Server selection succeeded for operation {{operation}} with ID {{operationId}}. Selected server: {{serverHost}}:{{serverPort}}. Selector: {{selector}}, topology description: {{topologyDescription}}

"Server selection failed" message

This message MUST be logged at debug level. It MUST be emitted on the occasions specified either in Multi-threaded or asynchronous server selection or Single-threaded server selection, depending on which algorithm the driver implements.

This message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Server selection failed"
failureFlexibleRepresentation of the error the driver will throw regarding server selection failing. The type and format of this value is flexible; see the logging specification for details on representing errors in log messages. Drivers MUST take care to not include any information in this field that is already included in the log message; e.g. the topology description should not be duplicated within this field.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Server selection failed for operation {{operationName}} with ID {{operationId}}. Failure: {{failure}}. Selector: {{selector}}, topology description: {{topologyDescription}}

"Waiting for suitable server to become available" message

This message MUST be logged at info level. It MUST be emitted on the occasions specified either in Multi-threaded or asynchronous server selection or Single-threaded server selection, depending on which algorithm the driver implements.

In order to avoid generating redundant log messages, the driver MUST take care to only emit this message once per operation. We only log the message once because the only values that can change over time are:

  • The remaining time: given the initial message's timestamp and the initial timestamp, the time remaining can always be inferred from the original message.
  • The topology description: rather than logging these changes on a per-operation basis, users should observe them with a single set of messages for the entire client via SDAM log messages.

This message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Waiting for suitable server to become available"
remainingTimeMSIntThe remaining time left until server selection will time out. This MAY be omitted if the driver supports disabling server selection timeout altogether.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Waiting for server to become available for operation {{operationName}} with ID {{operationId}}. Remaining time: {{remainingTimeMS}} ms. Selector: {{selector}}, topology description: {{topologyDescription}}.

Implementation Notes

These are suggestions. As always, driver authors should balance cross-language standardization with backwards compatibility and the idioms of their language.

Modes

Modes ('primary', 'secondary', ...) are constants declared in whatever way is idiomatic for the programming language. The constant values may be ints, strings, or whatever. However, when attaching modes to $readPreference camel case must be used as described above in Passing read preference to mongos and load balancers.

primaryPreferred and secondaryPreferred

'primaryPreferred' is equivalent to selecting a server with read preference mode 'primary' (without tag_sets or maxStalenessSeconds), or, if that fails, falling back to selecting with read preference mode 'secondary' (with tag_sets and maxStalenessSeconds, if provided).

'secondaryPreferred' is the inverse: selecting with mode 'secondary' (with tag_sets and maxStalenessSeconds) and falling back to selecting with mode 'primary' (without tag_sets or maxStalenessSeconds).

Depending on the implementation, this may result in cleaner code.

nearest

The term 'nearest' is unfortunate, as it implies a choice based on geographic locality or absolute lowest latency, neither of which are true.

Instead, and unlike the other read preference modes, 'nearest' does not favor either primaries or secondaries; instead all servers are candidates and are filtered by tag_sets and maxStalenessSeconds.

To always select the server with the lowest RTT, users should use mode 'nearest' without tag_sets or maxStalenessSeconds and set localThresholdMS to zero.

To distribute reads across all members evenly regardless of RTT, users should use mode 'nearest' without tag_sets or maxStalenessSeconds and set localThresholdMS very high so that all servers fall within the latency window.

In both cases, tag_sets and maxStalenessSeconds could be used to further restrict the set of eligible servers, if desired.

Tag set lists

Tag set lists can be configured in the driver in whatever way is natural for the language.

Multi-threaded server selection implementation

The following example uses a single lock for clarity. Drivers are free to implement whatever concurrency model best suits their design.

The following is pseudocode for multi-threaded or asynchronous server selection:

    def getServer(criteria):
        client.lock.acquire()

        now = gettime()
        endTime = now + computed server selection timeout

        log a "server selection started" message
        while true:
            # The topologyDescription keeps track of whether any server has an
            # an invalid wire version range
            if not topologyDescription.compatible:
                client.lock.release()
                log a "server selection failed" message
                throw invalid wire protocol range error with details

            if maxStalenessSeconds is set:
                if client minWireVersion < 5 and "<any available server's maxWireVersion < 5">:
                    client.lock.release()
                    throw error

                if topologyDescription.type in (ReplicaSetWithPrimary, ReplicaSetNoPrimary):
                    if (maxStalenessSeconds * 1000 < heartbeatFrequencyMS + idleWritePeriodMS or
                        maxStalenessSeconds < smallestMaxStalenessSeconds):
                    client.lock.release()
                    throw error

            servers = all servers in topologyDescription matching criteria

            if serverSelector is not null:
                servers = serverSelector(servers)

            if servers is not empty:
                in_window = servers within the latency window
                if len(in_window) == 1:
                    selected = in_window[0]
                else:
                    server1, server2 = random two entries from in_window
                    if server1.operation_count <= server2.operation_count:
                        selected = server1
                    else:
                        selected = server2
                selected.operation_count += 1
                client.lock.release()
                return selected

            request that all monitors check immediately
            if the message was not logged already for this operation: 
                log a "waiting for suitable server to become available" message

            # Wait for a new TopologyDescription. condition.wait() releases
            # client.lock while waiting and reacquires it before returning.
            # While a thread is waiting on client.condition, it is awakened
            # early whenever a server check completes.
            timeout_left = endTime - gettime()
            client.condition.wait(timeout_left)

            if now after endTime:
                client.lock.release()
                throw server selection error

Single-threaded server selection implementation

The following is pseudocode for single-threaded server selection:

    def getServer(criteria):
        startTime = gettime()
        loopEndTime = startTime
        maxTime = startTime + computed server selection timeout
        nextUpdateTime = topologyDescription.lastUpdateTime
                       + heartbeatFrequencyMS/1000:

        if nextUpdateTime < startTime:
            topologyDescription.stale = true

        while true:

            if topologyDescription.stale:
                scanReadyTime = topologyDescription.lastUpdateTime
                              + minHeartbeatFrequencyMS/1000

                if ((not serverSelectionTryOnce) && (scanReadyTime > maxTime)):
                    throw server selection error with details

                # using loopEndTime below is a proxy for "now" but avoids
                # the overhead of another gettime() call
                sleepTime = scanReadyTime - loopEndTime

                if sleepTime > 0:
                    sleep sleepTime

                rescan all servers
                topologyDescription.lastupdateTime = gettime()
                topologyDescription.stale = false

            # topologyDescription keeps a record of whether any
            # server has an incompatible wire version range
            if not topologyDescription.compatible:
                topologyDescription.stale = true
                # throw invalid wire version range error with details

            if maxStalenessSeconds is set:
                if client minWireVersion < 5 and "<any available server's maxWireVersion < 5>":
                    # throw error

                if topologyDescription.type in (ReplicaSetWithPrimary, ReplicaSetNoPrimary):
                    if (maxStalenessSeconds * 1000 < heartbeatFrequencyMS + idleWritePeriodMS or
                        maxStalenessSeconds < smallestMaxStalenessSeconds):
                    # throw error

            servers = all servers in topologyDescription matching criteria

            if serverSelector is not null:
                servers = serverSelector(servers)

            if servers is not empty:
                in_window = servers within the latency window
                return random entry from in_window
            else:
                topologyDescription.stale = true

            loopEndTime = gettime()

            if serverSelectionTryOnce:
                if topologyDescription.lastUpdateTime > startTime:
                    throw server selection error with details
            else if loopEndTime > maxTime:
                throw server selection error with details

            if the message was not logged already: 
                log a "waiting for suitable server to become available" message

Server Selection Errors

Drivers should use server descriptions and their error attributes (if set) to return useful error messages.

For example, when there are no members matching the ReadPreference:

  • "No server available for query with ReadPreference primary"
  • "No server available for query with ReadPreference secondary"
  • "No server available for query with ReadPreference " + mode + ", tag set list " + tag_sets + ", and maxStalenessSeconds " + maxStalenessSeconds

Or, if authentication failed:

  • "Authentication failed: [specific error message]"

Here is a sketch of some pseudocode for handling error reporting when errors could be different across servers:

    if there are any available servers:
        error_message = "No servers are suitable for " + criteria
    else if all ServerDescriptions' errors are the same:
        error_message = a ServerDescription.error value
    else:
        error_message = ', '.join(all ServerDescriptions' errors)

Cursors

Cursor operations OP_GET_MORE and OP_KILL_CURSOR do not go through the server selection process. Cursor operations must be sent to the original server that received the query and sent the OP_REPLY. For exhaust cursors, the same socket must be used for OP_GET_MORE until the cursor is exhausted.

Sharded Transactions

Operations that are part of a sharded transaction (after the initial command) do not go through the server selection process. Sharded transaction operations MUST be sent to the original mongos server on which the transaction was started.

The 'text' command and mongos

Note: As of MongoDB 2.6, mongos doesn't distribute the "text" command to secondaries, see SERVER-10947.

However, the "text" command is deprecated in 2.6, so this command-specific helper may become deprecated before this is fixed.

Test Plan

The server selection test plan is given in a separate document that describes the tests and supporting data files: Server Selection Tests

Design Rationale

Use of topology types

The prior version of the read preference spec had only a loose definition of server or topology types. The Server Discovery and Monitoring spec defines these terms explicitly and they are used here for consistency and clarity.

Consistency with mongos

In order to ensure that behavior is consistent regardless of topology type, read preference behaviors are limited to those that mongos can proxy.

For example, mongos ignores read preference 'secondary' when a shard consists of a single server. Therefore, this spec calls for topology type Single to ignore read preferences for consistency.

The spec has been written with the intention that it can apply to both drivers and mongos and the term "client" has been used when behaviors should apply to both. Behaviors that are specific to drivers are largely limited to those for communicating with a mongos.

New localThresholdMS configuration option name

Because this does not apply only to secondaries and does not limit absolute latency, the name secondaryAcceptableLatencyMS is misleading.

The mongos name localThreshold misleads because it has nothing to do with locality. It also doesn't include the MS units suffix for consistency with other time-related configuration options.

However, given a choice between the two, localThreshold is a more general term. For drivers, we add the MS suffix for clarity about units and consistency with other configuration options.

Random selection within the latency window (single-threaded)

When more than one server is judged to be suitable, the spec calls for random selection to ensure a fair distribution of work among servers within the latency window.

It would be hard to ensure a fair round-robin approach given the potential for servers to come and go. Making newly available servers either first or last could lead to unbalanced work. Random selection has a better fairness guarantee and keeps the design simpler.

operationCount-based selection within the latency window (multi-threaded or async)

As operation execution slows down on a node (e.g. due to degraded server-side performance or increased network latency), checked-out pooled connections to that node will begin to remain checked out for longer periods of time. Assuming at least constant incoming operation load, more connections will then need to be opened against the node to service new operations that it gets selected for, further straining it and slowing it down. This can lead to runaway connection creation scenarios that can cripple a deployment ("connection storms"). As part of DRIVERS-781, the random choice portion of multi-threaded server selection was changed to more evenly spread out the workload among suitable servers in order to prevent any single node from being overloaded. The new steps achieve this by approximating an individual server's load via the number of concurrent operations that node is processing (operationCount) and then routing operations to servers with less load. This should reduce the number of new operations routed towards nodes that are busier and thus increase the number routed towards nodes that are servicing operations faster or are simply less busy. The previous random selection mechanism did not take load into account and could assign work to nodes that were under too much stress already.

As an added benefit, the new approach gives preference to nodes that have recently been discovered and are thus are more likely to be alive (e.g. during a rolling restart). The narrowing to two random choices first ensures new servers aren't overly preferred however, preventing a "thundering herd" situation. Additionally, the maxConnecting provisions included in the CMAP specification prevent drivers from crippling new nodes with connection storms.

This approach is based on the "Power of Two Random Choices with Least Connections" load balancing algorithm.

An alternative approach to this would be to prefer selecting servers that already have available connections. While that approach could help reduce latency, it does not achieve the benefits of routing operations away from slow servers or of preferring newly introduced servers. Additionally, that approach could lead to the same node being selected repeatedly rather than spreading the load out among all suitable servers.

The SecondaryOk wire protocol flag

In server selection, there is a race condition that could exist between what a selected server type is believed to be and what it actually is.

The SecondaryOk wire protocol flag solves the race problem by communicating to the server whether a secondary is acceptable. The server knows its type and can return a "not writable primary" error if SecondaryOk is false and the server is a secondary.

However, because topology type Single is used for direct connections, we want read operations to succeed even against a secondary, so the SecondaryOk wire protocol flag must be sent to mongods with topology type Single.

(If the server type is Mongos, follow the rules for Passing read preference to mongos and load balancers, even for topology type Single.)

General command method going to primary

The list of commands that can go to secondaries changes over time and depends not just on the command but on parameters. For example, the mapReduce command may or may not be able to be run on secondaries depending on the value of the out parameter.

It significantly simplifies implementation for the general command method always to go to the primary unless a explicit read preference is set and rely on users of the general command method to provide a read preference appropriate to the command.

The command-specific helpers will need to implement a check of read preferences against the semantics of the command and its parameters, but keeping this logic close to the command rather than in a generic method is a better design than either delegating this check to the generic method, duplicating the logic in the generic method, or coupling both to another validation method.

Average round trip time calculation

Using an exponentially-weighted moving average avoids having to store and rotate an arbitrary number of RTT observations. All observations count towards the average. The weighting makes recent observations count more heavily while smoothing volatility.

Verbose errors

Error messages should be sufficiently verbose to allow users and/or support engineers to determine the reasons for server selection failures from log or other error messages.

"Try once" mode

Single-threaded drivers in languages like PHP and Perl are typically deployed as many processes per application server. Each process must independently discover and monitor the MongoDB deployment.

When no suitable server is available (due to a partition or misconfiguration), it is better for each request to fail as soon as its process detects a problem, instead of waiting and retrying to see if the deployment recovers.

Minimizing response latency is important for maximizing request-handling capacity and for user experience (e.g. a quick fail message instead of a slow web page).

However, when a request arrives and the topology information is already stale, or no suitable server is known, making a single attempt to update the topology to service the request is acceptable.

A user of a single-threaded driver who prefers resilience in the face of topology problems, rather than short response times, can turn the "try once" mode off. Then driver rescans the topology every minHeartbeatFrequencyMS until a suitable server is found or the timeout expires.

What is the purpose of socketCheckIntervalMS?

Single-threaded clients need to make a compromise: if they check servers too frequently it slows down regular operations, but if they check too rarely they cannot proactively avoid errors.

Errors are more disruptive for single-threaded clients than for multi-threaded. If one thread in a multi-threaded process encounters an error, it warns the other threads not to use the disconnected server. But single-threaded clients are deployed as many independent processes per application server, and each process must throw an error until all have discovered that a server is down.

The compromise specified here balances the cost of frequent checks against the disruption of many errors. The client preemptively checks individual sockets that have not been used in the last socketCheckIntervalMS, which is more frequent by default than heartbeatFrequencyMS defined in the Server Discovery and Monitoring Spec.

The client checks the socket with a "ping" command, rather than "hello" or legacy hello, because it is not checking the server's full state as in the Server Discovery and Monitoring Spec, it is only verifying that the connection is still open. We might also consider a select or poll call to check if the socket layer considers the socket closed, without requiring a round-trip to the server. However, this technique usually will not detect an uncleanly shutdown server or a network outage.

Backwards Compatibility

In general, backwards breaking changes have been made in the name of consistency with mongos and avoiding misleading users about monotonicity.

  • Features removed:

    • Automatic pinning (see What happened to pinning?)
    • Auto retry (replaced by the general server selection algorithm)
    • mongos "high availability" mode (effectively, mongos pinning)
  • Other features and behaviors have changed explicitly

    • Ignoring read preferences for topology type Single
    • Default read preference for the generic command method
  • Changes with grandfather clauses

    • Alternate names for localThresholdMS
    • Pinning for legacy request APIs
  • Internal changes with little user-visibility

    • Clarifying calculation of average RTT

Questions and Answers

What happened to pinning?

The prior read preference spec, which was implemented in the versions of the drivers and mongos released concomitantly with MongoDB 2.2, stated that a thread / client should remain pinned to an RS member as long as that member matched the current mode, tags, and acceptable latency. This increased the odds that reads would be monotonic (assuming no rollback), but had the following surprising consequence:

  1. Thread / client reads with mode 'secondary' or 'secondaryPreferred', gets pinned to a secondary
  2. Thread / client reads with mode 'primaryPreferred', driver / mongos sees that the pinned member (a secondary) matches the mode (which allows for a secondary) and reads from secondary, even though the primary is available and preferable

The old spec also had the swapped problem, reading from the primary with 'secondaryPreferred', except for mongos which was changed at the last minute before release with SERVER-6565.

This left application developers with two problems:

  1. 'primaryPreferred' and 'secondaryPreferred' acted surprisingly and unpredictably within requests
  2. There was no way to specify a common need: read from a secondary if possible with 'secondaryPreferred', then from primary if possible with 'primaryPreferred', all within a request. Instead an application developer would have to do the second read with 'primary', which would unpin the thread but risk unavailability if only secondaries were up.

Additionally, mongos 2.4 introduced the releaseConnectionsAfterResponse option (RCAR), mongos 2.6 made it the default and mongos 2.8 will remove the ability to turn it off. This means that pinning to a mongos offers no guarantee that connections to shards are pinned. Since we can't provide the same guarantees for replica sets and sharded clusters, we removed automatic pinning entirely and deprecated "requests". See SERVER-11956 and SERVER-12273.

Regardless, even for replica sets, pinning offers no monotonicity because of the ever-present possibility of rollbacks. Through MongoDB 2.6, secondaries did not close sockets on rollback, so a rollback could happen between any two queries without any indication to the driver.

Therefore, an inconsistent feature that doesn't actually do what people think it does has no place in the spec and has been removed. Should the server eventually implement some form of "sessions", this spec will need to be revised accordingly.

Why change from mongos High Availability (HA) to random selection?

Mongos HA has similar problems with pinning, in that one can wind up pinned to a high-latency mongos even if a lower-latency mongos later becomes available.

Selection within the latency window avoids this problem and makes server selection exactly analogous to having multiple suitable servers from a replica set. This is easier to explain and implement.

What happened to auto-retry?

The old auto-retry mechanism was closely connected to server pinning, which has been removed. It also mandated exactly three attempts to carry out a query on different servers, with no way to disable or adjust that value, and only for the first query within a request.

To the extent that auto-retry was trying to compensate for unavailable servers, the Server Discovery and Monitoring spec and new server selection algorithm provide a more robust and configurable way to direct all queries to available servers.

After a server is selected, several error conditions could still occur that make the selected server unsuitable for sending the operation, such as:

  • the server could have shutdown the socket (e.g. a primary stepping down),
  • a connection pool could be empty, requiring new connections; those connections could fail to connect or could fail the server handshake

Once an operation is sent over the wire, several additional error conditions could occur, such as:

  • a socket timeout could occur before the server responds
  • the server might send an RST packet, indicating the socket was already closed
  • for write operations, the server might return a "not writable primary" error

This specification does not require nor prohibit drivers from attempting automatic recovery for various cases where it might be considered reasonable to do so, such as:

  • repeating server selection if, after selection, a socket is determined to be unsuitable before a message is sent on it
  • for a read operation, after a socket error, selecting a new server meeting the read preference and resending the query
  • for a write operation, after a "not writable primary" error, selecting a new server (to locate the primary) and resending the write operation

Driver-common rules for retrying operations (and configuring such retries) could be the topic of a different, future specification.

Why is maxStalenessSeconds applied before tag_sets?

The intention of read preference's list of tag sets is to allow a user to prefer the first tag set but fall back to members matching later tag sets. In order to know whether to fall back or not, we must first filter by all other criteria.

Say you have two secondaries:

  • Node 1, tagged {'tag': 'value1'}, estimated staleness 5 minutes
  • Node 2, tagged {'tag': 'value2'}, estimated staleness 1 minute

And a read preference:

  • mode: "secondary"
  • maxStalenessSeconds: 120 (2 minutes)
  • tag_sets: [{'tag': 'value1'}, {'tag': 'value2'}]

If tag sets were applied before maxStalenessSeconds, we would select Node 1 since it matches the first tag set, then filter it out because it is too stale, and be left with no eligible servers.

The user's intent in specifying two tag sets was to fall back to the second set if needed, so we filter by maxStalenessSeconds first, then tag_sets, and select Node 2.

References

Changelog

  • 2015-06-26: Updated single-threaded selection logic with "stale" and serverSelectionTryOnce.

  • 2015-08-10: Updated single-threaded selection logic to ensure a scan always
    happens at least once under serverSelectionTryOnce if selection fails. Removed the general selection algorithm and put full algorithms for each of the single- and multi-threaded sections. Added a requirement that single-threaded drivers document selection time expectations.

  • 2016-07-21: Updated for Max Staleness support.

  • 2016-08-03: Clarify selection algorithm, in particular that maxStalenessMS
    comes before tag_sets.

  • 2016-10-24: Rename option from "maxStalenessMS" to "maxStalenessSeconds".

  • 2016-10-25: Change minimum maxStalenessSeconds value from 2 *
    heartbeatFrequencyMS to heartbeatFrequencyMS + idleWritePeriodMS (with proper conversions of course).

  • 2016-11-01: Update formula for secondary staleness estimate with the
    equivalent, and clearer, expression of this formula from the Max Staleness Spec

  • 2016-11-21: Revert changes that would allow idleWritePeriodMS to change in the
    future, require maxStalenessSeconds to be at least 90.

  • 2017-06-07: Clarify socketCheckIntervalMS behavior, single-threaded drivers
    must retry selection after checking an idle socket and discovering it is broken.

  • 2017-11-10: Added application-configurated server selector.

  • 2017-11-12: Specify read preferences for OP_MSG with direct connection, and
    delete obsolete comment direct connections to secondaries getting "not writable primary" errors by design.

  • 2018-01-22: Clarify that $query wrapping is only for OP_QUERY

  • 2018-01-22: Clarify that $out on aggregate follows the "$out Aggregation
    Pipeline Operator" spec and warns if read preference is not primary.

  • 2018-01-29: Remove reference to '$out Aggregation spec'. Clarify runCommand
    selection rules.

  • 2018-12-13: Update tag_set example to use only String values

  • 2019-05-20: Added rule to not send read preferene to standalone servers

  • 2019-06-07: Clarify language for aggregate and mapReduce commands that write

  • 2020-03-17: Specify read preferences with support for server hedged reads

  • 2020-10-10: Consider server load when selecting servers within the latency window.

  • 2021-04-07: Adding in behaviour for load balancer mode.

  • 2021-05-12: Removed deprecated URI option in favour of readPreference=secondaryPreferred.

  • 2021-05-13: Updated to use modern terminology.

  • 2021-08-05: Updated $readPreference logic to describe OP_MSG behavior.

  • 2021-09-03: Clarify that wire version check only applies to available servers.

  • 2021-09-28: Note that 5.0+ secondaries support aggregate with write stages
    (e.g. $out and $merge). Clarify setting SecondaryOk wire protocol flag or $readPreference global command argument for replica set topology.

  • 2022-01-19: Require that timeouts be applied per the client-side operations timeout spec

  • 2022-10-05: Remove spec front matter, move footnote, and reformat changelog.

  • 2022-11-09: Add log messages and tests.

  • 2023-08-26: Add list of deprioritized servers for sharded cluster topology.

  • 2024-02-07: Migrated from reStructuredText to Markdown.

1

mongos 3.4 refuses to connect to mongods with maxWireVersion < 5, so it does no additional wire version checks related to maxStalenessSeconds.

Max Staleness

  • Status: Accepted
  • Minimum Server Version: 3.4

Abstract

Read preference gains a new option, "maxStalenessSeconds".

A client (driver or mongos) MUST estimate the staleness of each secondary, based on lastWriteDate values provided in server hello responses, and select only those secondaries whose staleness is less than or equal to maxStalenessSeconds.

Most of the implementation of the maxStalenessSeconds option is specified in the Server Discovery And Monitoring Spec and the Server Selection Spec. This document supplements those specs by collecting information specifically about maxStalenessSeconds.

Meta

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Motivation for Change

Users have often asked for ways to avoid reading from stale secondaries. An application with a geographically distributed replica set may want to prefer nearby members to minimize latency, while at the same time avoiding extremely laggy secondaries to mitigate the risk of very stale reads.

Goals

  • Provide an approximate means of limiting the staleness of secondary reads.
  • Provide a client-side knob to adjust the tradeoff between network-local reads and data recency.
  • Be robust in the face of clock skew between the client and servers, and skew between the primary and secondaries.
  • Avoid "inadvertent primary read preference": prevent a maxStalenessSeconds setting so small it forces all reads to the primary regardless of actual replication lag.
  • Specify how mongos routers and shards track the opTimes of Config Servers as Replica Sets ("CSRS").

Non-Goals

  • Provide a global server-side configuration of max acceptable staleness (see rejected ideas).
  • Support small values for max staleness.
  • Make a consistency guarantee resembling readConcern "afterOpTime".
  • Specify how maxStalenessSeconds interacts with readConcern "afterOpTime" in drivers (distinct from the goal for routers and shards).
  • Compensate for the duration of server checks in staleness estimations.

Specification

API

"maxStalenessSeconds" is a new read preference option, with a positive integer value. It MUST be configurable similar to other read preference options like "readPreference" and "tag_sets". Clients MUST also recognize it in the connection string:

mongodb://host/?readPreference=secondary&maxStalenessSeconds=120

Clients MUST consider "maxStalenessSeconds=-1" in the connection string to mean "no maximum staleness".

A connection string combining a positive maxStalenessSeconds with read preference mode "primary" MUST be considered invalid; this includes connection strings with no explicit read preference mode.

By default there is no maximum staleness.

A driver connected to a replica set requires that maxStalenessSeconds be absent, or be at least smallestMaxStalenessSeconds (90 seconds) and at least heartbeatFrequencyMS + idleWritePeriodMS. The exact mechanism for enforcement is defined in the Server Selection Spec.

Besides configuring maxStalenessSeconds in the connection string, the API for configuring it in code is not specified; drivers are free to use None, null, -1, or other representations of "no value" to represent "no max staleness".

Replica Sets

Replica set primaries and secondaries implement the following features to support maxStalenessSeconds.

idleWritePeriodMS

An idle primary writes a no-op to the oplog every 10 seconds to refresh secondaries' lastWriteDate values (see SERVER-23892 and primary must write periodic no-ops). This spec refers to this period as idleWritePeriodMS with constant value 10,000.

lastWrite

A primary's or secondary's hello response contains a "lastWrite" subdocument with these fields (SERVER-8858):

  • lastWriteDate: a BSON UTC datetime, the wall-clock time of the primary when it most recently recorded a write to the oplog.
  • opTime: an opaque value representing the position in the oplog of the most recently seen write. Needed for sharding, not used for the maxStalenessSeconds read preference option.

Wire Version

The maxWireVersion MUST be incremented to 5 to indicate that the server includes maxStalenessSeconds features (SERVER-23893).

Client

A client (driver or mongos) MUST estimate the staleness of each secondary, based on lastWriteDate values provided in server hello responses, and select for reads only those secondaries whose estimated staleness is less than or equal to maxStalenessSeconds.

If any server's maxWireVersion is less than 5 and maxStalenessSeconds is a positive number, every attempt at server selection throws an error.

When there is a known primary, a secondary S's staleness is estimated with this formula:

(S.lastUpdateTime - S.lastWriteDate) - (P.lastUpdateTime - P.lastWriteDate) + heartbeatFrequencyMS

Where "P" and "S" are the primary's and secondary's ServerDescriptions. All datetimes are in milliseconds. The staleness estimate could be temporarily negative.

When there is no known primary, a secondary S's staleness is estimated with this formula:

SMax.lastWriteDate - S.lastWriteDate + heartbeatFrequencyMS

Where "SMax" is the secondary with the greatest lastWriteDate.

Explanation of Staleness Estimate With Primary

  1. When the client checks the primary, it gets the delta between the primary's lastWriteDate and the client clock. Call this "Client_to_Primary".
  2. When the client checks a secondary, it gets the delta between the secondary's lastWriteDate and the client clock. Call this "Client_to_Secondary".
  3. The difference of these two is an estimate of the delta between the primary's and secondary's lastWriteDate.

Thus:

staleness = Client_to_Secondary - Client_to_Primary
= (S.lastUpdateTime - S.lastWriteDate) - (P.lastUpdateTime - P.lastWriteDate)

Finally, add heartbeatFrequencyMS:

(S.lastUpdateTime - S.lastWriteDate) - (P.lastUpdateTime - P.lastWriteDate) + heartbeatFrequencyMS

This adjusts for the pessimistic assumption that S stops replicating right after S.lastUpdateTime, so it will be heartbeatFrequencyMS more stale by the time it is checked again. This means S must be fresh enough at S.lastUpdateTime to be eligible for reads from now until the next check, even if it stops replicating.

See the Server Discovery and Monitoring Spec and Server Selection Spec for details of client implementation.

Routers and shards

Background: Shard servers and mongos servers in a sharded cluster with CSRS use readConcern "afterOptime" for consistency guarantees when querying the shard config.

Besides tracking lastWriteDate, routers and shards additionally track the opTime of CSRS members if they have maxWireVersion 5 or greater. (See Server Discovery and Monitoring Spec for details.)

When a router or shard selects a CSRS member to read from with readConcern like:

readConcern: { afterOpTime: OPTIME }

... then it follows this selection logic:

  1. Make a list of known CSRS data members.
  2. Filter out those whose last known opTime is older than OPTIME.
  3. If no servers remain, select the primary.
  4. Otherwise, select randomly one of the CSRS members whose roundTripTime is within localThresholdMS of the member with the fastest roundTripTime.

Step 4 is the standard localThresholdMS logic from the Server Selection Spec.

This algorithm helps routers and shards select a secondary that is likely to satisfy readConcern "afterOpTime" without blocking.

This feature is only for routers and shards, not drivers. See Future Work.

Reference Implementation

The C Driver (CDRIVER-1363) and Perl Driver (PERL-626).

Estimating Staleness: Example With a Primary and Continuous Writes

Consider a primary P and a secondary S, and a client with heartbeatFrequencyMS set to 10 seconds. Say that the primary's clock is 50 seconds skewed ahead of the client's.

The client checks P and S at time 60 (meaning 60 seconds past midnight) by the client's clock. The primary reports its lastWriteDate is 10.

Then, S reports its lastWriteDate is 0. The client estimates S's staleness as:

(S.lastUpdateTime - S.lastWriteDate) - (P.lastUpdateTime - P.lastWriteDate) + heartbeatFrequencyMS
= (60 - 0) - (60 - 10) + 10
= 20 seconds

(Values converted from milliseconds to seconds for the sake of discussion.)

Note that the secondary appears only 10 seconds stale at this moment, but the client adds heartbeatFrequencyMS, pessimistically assuming that the secondary will not replicate at all between now and the next check. If the current staleness plus heartbeatFrequencyMS is still less than maxStalenessSeconds, then we can safely read from the secondary from now until the next check.

The client re-checks P and S 10 seconds later, at time 70 by the client's clock. S responds first with a lastWriteDate of 5: it has fallen 5 seconds further behind. The client updates S's lastWriteDate and lastUpdateTime. The client now estimates S's staleness as:

(S.lastUpdateTime - S.lastWriteDate) - (P.lastUpdateTime - P.lastWriteDate) + heartbeatFrequencyMS
= (70 - 5) - (60 - 10) + 10
= 25 seconds

Say that P's response arrives 10 seconds later, at client time 80, and reports its lastWriteDate is 30. S's staleness is still 25 seconds:

(S.lastUpdateTime - S.lastWriteDate) - (P.lastUpdateTime - P.lastWriteDate) + heartbeatFrequencyMS
= (70 - 5) - (80 - 30) + 10
= 25 seconds

The same story as a table:

Client clockPrimary clockEventS.lastUpdateTimeS.lastWriteDateP.lastUpdateTimeP.lastWriteDateS staleness
6010P and S respond600601020 seconds
7020S responds705601025 seconds
8030P responds705803025 seconds

Estimating Staleness: Example With No Primary

Consider a replica set with secondaries S1 and S2, and no primary. S2 lags 15 seconds farther behind S1 and has not yet caught up. The client has heartbeatFrequencyMS set to 10 seconds.

When the client checks the two secondaries, S1's lastWriteDate is 20 and S2's lastWriteDate is 5.

Because S1 is the secondary with the maximum lastWriteDate, "SMax", its staleness estimate equals heartbeatFrequencyMS:

SMax.lastWriteDate - S.lastWriteDate + heartbeatFrequencyMS = 20 - 20 + 10 = 10

(Since max staleness must be at least heartbeatFrequencyMS + idleWritePeriodMS, S1 is eligible for reads no matter what.)

S2's staleness estimate is:

SMax.lastWriteDate - S.lastWriteDate + heartbeatFrequencyMS
= 20 - 5 + 10
= 25

Estimating Staleness: Example of Worst-Case Accuracy With Idle Replica Set

Consider a primary P and a secondary S, and a client with heartbeatFrequencyMS set to 500 ms. There is no clock skew. (Previous examples show that skew has no effect.)

The primary has been idle for 10 seconds and writes a no-op to the oplog at time 50 (meaning 50 seconds past midnight), and again at time 60.

Before the secondary can replicate the no-op at time 60, the client checks both servers. The primary reports its lastWriteDate is 60, the secondary reports 50.

The client estimates S's staleness as:

(S.lastUpdateTime - S.lastWriteDate) - (P.lastUpdateTime - P.lastWriteDate) + heartbeatFrequencyMS
= (60 - 50) - (60 - 60) + 0.5
= 10.5

The same story as a table:

ClockEventS.lastUpdateTimeS.lastWriteDateP.lastUpdateTimeP.lastWriteDateS staleness
50Idle write5050
60Idle write begins6050
60Client checks P and S6060506010.5
60Idle write completes6060

In this scenario the actual secondary lag is between 0 and 10 seconds. But the staleness estimate can be as large as:

staleness = idleWritePeriodMS + heartbeatFrequencyMS

To ensure the secondary is always eligible for reads in an idle replica set, we require:

maxStalenessSeconds * 1000 >= heartbeatFrequencyMS + idleWritePeriodMS

Supplemental

Python scripts in this document's source directory:

  • test_max_staleness_spo.py: Uses scipy.optimize to determine worst-case accuracy of the staleness estimate in an idle replica set.
  • test_staleness_estimate.py: Tests whether a client would correctly select a secondary from an idle replica set, given a random distribution of values for maxStalenessSeconds, heartbeatFrequencyMS, lastWriteDate, and lastUpdateTime.

Test Plan

See max-staleness-tests.md, and the YAML and JSON tests in the tests directory.

Design Rationale

Specify max staleness in seconds

Other driver options that are timespans are in milliseconds, for example serverSelectionTimeoutMS. The max staleness option is specified in seconds, however, to make it obvious to users that clients can only enforce large, imprecise max staleness values.

maxStalenessSeconds is part of Read Preferences

maxStalenessSeconds MAY be configurable at the client, database, and collection level, and per operation, the same as other read preference fields are, because users expressed that their tolerance for stale reads varies per operation.

Primary must write periodic no-ops

Consider a scenario in which the primary does not:

  1. There are no writes for an hour.
  2. A client performs a heavy read-only workload with read preference mode "nearest" and maxStalenessSeconds of 90 seconds.
  3. The primary receives a write.
  4. In the brief time before any secondary replicates the write, the client re-checks all servers.
  5. Since the primary's lastWriteDate is an hour ahead of all secondaries', the client only queries the primary.
  6. After heartbeatFrequencyMS, the client re-checks all servers and finds that the secondaries aren't lagging after all, and resumes querying them.

This apparent "replication lag spike" is just a measurement error, but it causes exactly the behavior the user wanted to avoid: a small replication lag makes the client route all queries from the secondaries to the primary.

Therefore an idle primary must execute a no-op every 10 seconds (idleWritePeriodMS) to keep secondaries' lastWriteDate values close to the primary's clock. The no-op also keeps opTimes close to the primary's, which helps mongos choose an up-to-date secondary to read from in a CSRS.

Monitoring software like MongoDB Cloud Manager that charts replication lag will also benefit when spurious lag spikes are solved.

See Estimating Staleness: Example of Worst-Case Accuracy With Idle Replica Set. and SERVER-23892.

Smallest allowed value for maxStalenessSeconds

If maxStalenessSeconds is a positive number, it must be at least smallestMaxStalenessSeconds (90 seconds) and at least heartbeatFrequencyMS + idleWritePeriodMS. The exact mechanism for enforcement is defined in the Server Selection Spec.

The justification for heartbeatFrequencyMS + idleWritePeriodMS is technical: If maxStalenessSeconds is set to exactly heartbeatFrequencyMS (converted to seconds), then so long as a secondary lags even a millisecond it is ineligible. Despite the user's read preference mode, the client will always read from the primary.

This is an example of "inadvertent primary read preference": a maxStalenessSeconds setting so small it forces all reads to the primary regardless of actual replication lag. We want to prohibit this effect (see goals).

We also want to ensure that a secondary in an idle replica set is always considered eligible for reads with maxStalenessSeconds. See Estimating Staleness: Example of Worst-Case Accuracy With Idle Replica Set.

Requiring maxStalenessSeconds to be at least 90 seconds is a design choice. If the only requirement were that maxStalenessSeconds be at least heartbeatFrequencyMS + idleWritePeriodMS, then the smallest value would be 20 seconds for multi-threaded drivers (10 second idleWritePeriodMS plus multi-threaded drivers' default 10 second heartbeatFrequencyMS), 70 seconds for single-threaded drivers (whose default heartbeatFrequencyMS is 60 seconds), and 40 seconds for mongos (whose replica set monitor checks servers every 30 seconds).

The smallest configurable value for heartbeatFrequencyMS is 0.5 seconds, so maxStalenessSeconds could be as small as 10.5 when using a driver connected to a replica set, but mongos provides no such flexibility.

Therefore, this spec also requires that maxStalenessSeconds is at least 90:

  • To provide a minimum for all languages and topologies that is easy to document and explain
  • To avoid application breakage when moving from replica set to sharded cluster, or when using the same URI with different drivers
  • To emphasize that maxStalenessSeconds is a low-precision heuristic
  • To avoid the arbitrary-seeming minimum of 70 seconds imposed by single-threaded drivers

All servers must have wire version 5 to support maxStalenessSeconds

Clients with minWireVersion < 5 MUST throw an error if maxStalenessSeconds is set, and any available server in the topology has maxWireVersion less than 5.

An available server is defined in the Server Selection specification.

Servers began reporting lastWriteDate in wire protocol version 5, and clients require some or all servers' lastWriteDate in order to estimate any servers' staleness. The exact requirements of the formula vary according to TopologyType, so this spec makes a simple ruling: if any server is running an outdated version, maxStalenessSeconds cannot be supported.

Rejected ideas

Add all secondaries' opTimes to primary's hello response

Not needed; each secondary's self-report of its opTime is just as good as the primary's.

Use opTimes from command responses besides hello

An idea was to add opTime to command responses that don't already include it (e.g., "find"), and use these opTimes to update ServerDescriptions more frequently than the periodic hello calls.

But while a server is not being used (e.g., while it is too stale, or while it does not match some other part of the Read Preference), only its periodic hello responses can update its opTime. Therefore, heartbeatFrequencyMS sets a lower bound on maxStalenessSeconds, so there is no benefit in recording each server's opTime more frequently. On the other hand there would be costs: effort adding opTime to all command responses, lock contention getting the opTime on the server and recording it on the client, complexity in the spec and the client code.

Use current time in staleness estimate

A proposed staleness formula estimated the secondary's worst possible staleness:

P.lastWriteDate + (now - P.lastUpdateTime) - S.lastWriteDate

In this proposed formula, the place occupied by "S.lastUpdateTime" in the actual formula is replaced with "now", at the moment in the server selection process when staleness is being estimated.

This formula attempted a worst-case estimate right now: it assumed the primary kept writing after the client checked it, and that the secondary replicated nothing since the client last checked the secondary. The formula was rejected because it would slosh load to and from the secondary during the interval between checks.

For example: Say heartbeatFrequencyMS is 10 seconds and maxStalenessSeconds is set to 25 seconds, and immediately after a secondary is checked its staleness is estimated at 20 seconds. It is eligible for reads until 5 seconds after the check, then it becomes ineligible, causing all queries to be directed to the primary until the next check, 5 seconds later.

Server-side Configuration

We considered a deployment-wide "max staleness" setting that servers communicate to clients in hello, e.g., "120 seconds is the max staleness." The read preference config is simplified: "maxStalenessSeconds" is gone, instead we have "staleOk: true" (the default?) and "staleOk: false".

Based on Customer Advisory Board feedback, configuring staleness per-operation on the client side is more useful. We should merely avoid closing the door on a future server-side configuration feature.

References

Complaints about stale reads, and proposed solutions:

Future Work

Future feature to support readConcern "afterOpTime"

If a future spec allows applications to use readConcern "afterOptime", clients should prefer secondaries that have already replicated to that opTime, so reads do not block. This is an extension of the mongos logic for CSRS to applications.

Future feature to support server-side configuration

For this spec, we chose to control maxStalenessSeconds in client code. A future spec could allow database administrators to configure from the server side how much replication lag makes a secondary too stale to read from. (See Server-side Configuration above.) This could be implemented atop the current feature: if a server communicates is staleness configuration in its hello response like:

{ hello: true, maxStalenessSeconds: 30 }

... then a future client can use the value from the server as its default maxStalenessSeconds when there is no client-side setting.

Changelog

  • 2024-04-30: Migrated from reStructuredText to Markdown.

  • 2022-10-05: Remove spec front matter and revise changelog.

  • 2021-09-08: Updated tests to support driver removal of support for server versions older than 3.6.

  • 2021-09-03: Clarify that wire version check only applies to available servers.

  • 2021-04-06: Updated to use hello command.

  • 2016-09-29: Specify "no max staleness" in the URI with "maxStalenessMS=-1" instead of "maxStalenessMS=0".

  • 2016-10-24: Rename option from "maxStalenessMS" to "maxStalenessSeconds".

  • 2016-10-25: Change minimum maxStalenessSeconds value from 2 *
    heartbeatFrequencyMS to heartbeatFrequencyMS + idleWritePeriodMS (with proper conversions of course).

  • 2016-11-21: Revert changes that would allow idleWritePeriodMS to change in the
    future, require maxStalenessSeconds to be at least 90.

Retryable Reads

  • Status: Accepted
  • Minimum Server Version: 3.6

Abstract

This specification is about the ability for drivers to automatically retry any read operation that has not yet received any results—due to a transient network error, a "not writable primary" error after a replica set failover, etc.

This specification will

  • outline how an API for retryable read operations will be implemented in drivers
  • define an option to enable retryable reads for an application.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Specification

Terms

Retryable Error

An error is considered retryable if it meets any of the criteria defined under Retryable Writes: Terms: Retryable Error, minus the final criterion about write concern errors. For convenience, the relevant criteria have been adapted to retryable reads and reproduced below.

An error is considered retryable if it meets any of the following criteria:

  • any network exception (e.g. socket timeout or error)
  • a server error response with any the following codes:
Error NameError Code
ExceededTimeLimit262
InterruptedAtShutdown11600
InterruptedDueToReplStateChange11602
NotWritablePrimary10107
NotPrimaryNoSecondaryOk13435
NotPrimaryOrSecondary13436
PrimarySteppedDown189
ReadConcernMajorityNotAvailableYet134
ShutdownInProgress91
HostNotFound7
HostUnreachable6
NetworkTimeout89
SocketException9001
  • a PoolClearedError
  • Any of the above retryable errors that occur during a connection handshake (including the authentication step). For example, a network error or ShutdownInProgress error encountered when running the hello or saslContinue commands.

MongoClient Configuration

This specification introduces the following client-level configuration option.

retryReads

This boolean option determines whether retryable behavior will be applied to all read operations executed within the MongoClient. This option MUST default to true. As with retryable writes, this option MUST NOT be configurable at the level of an individual read operation, collection object, or database object. Drivers that expose a "high" and "core" API (e.g. Java and C# driver) MUST NOT expose a configurable option at the level of an individual read operation, collection object, or database object in "high", but MAY expose the option in "core."

Naming Deviations

As with retryable writes, drivers MUST use the defined name of retryReads for the connection string parameter to ensure portability of connection strings across applications and drivers. If drivers solicit MongoClient options through another mechanism (e.g. an options dictionary provided to the MongoClient constructor), drivers SHOULD use the defined name but MAY deviate to comply with their existing conventions. For example, a driver may use retry_reads instead of retryReads. For any other names in the spec, drivers SHOULD use the defined name but MAY deviate to comply with their existing conventions.

Requirements for Retryable Reads

Supported Server Versions

Drivers MUST verify server eligibility by ensuring that maxWireVersion is at least 6 because retryable reads require a MongoDB 3.6 standalone, replica set or shard cluster, MongoDB 3.6 server wire version is 6 as defined in the Server Wire version and Feature List specification.

The minimum server version is 3.6 because

  1. It gives us version parity with retryable writes.
  2. It forces the retry attempt(s) to use the same implicit session, which would make it it easier to track operations and kill any errant longer running operation.
  3. It limits the scope of the implementation (OP_QUERY will not need to be supported).

Supported Read Operations

Drivers MUST support retryability for the following operations:

  • All read operations defined in the CRUD specification i.e.

    • Collection.find()

      • This includes the find operations backing the GridFS API.
    • Collection.aggregate()

      • Only if the pipeline does not include a write stage (e.g. $out, $merge)
    • Collection.distinct()

    • Collection.count()

      • Only required if the driver already provides count()
    • Collection.estimatedDocumentCount()

    • Collection.countDocuments()

  • All read operation helpers in the change streams specification i.e.

    • Collection.watch()
    • Database.watch()
    • MongoClient.watch()
  • All enumeration commands e.g.

    • MongoClient.listDatabases()
    • Database.listCollections()
    • Collection.listIndexes()
  • Any read operations not defined in the aforementioned specifications:

    • Any read operation helpers e.g. Collection.findOne()

Drivers SHOULD support retryability for the following operations:

  • Any driver that provides generic command runners for read commands (with logic to inherit a client-level read concerns) SHOULD implement retryability for the read-only command runner.

Most of the above methods are defined in the following specifications:

Unsupported Read Operations

Drivers MUST NOT retry the following operations:

  • Collection.mapReduce()
    • This is due to the "Early Failure on Socket Disconnect" feature not supporting mapReduce.
    • N.B. If mapReduce is executed via a generic command runner for read commands, drivers SHOULD NOT inspect the command to prevent mapReduce from retrying.
  • Cursor.getMore()
  • The generic runCommand helper, even if it is passed a read command.
    • N.B.: This applies only to a generic command runner, which is agnostic about the read/write nature of the command.

Implementing Retryable Reads

Executing Retryable Read Commands

Executing retryable read commands is extremely similar to executing retryable write commands. The following explanation for executing retryable read commands has been adapted from the explanation for executing retryable write commands.

1. Selecting the initial server

The driver selects the initial server for the command as usual. When selecting a server for the first attempt of a retryable read command, drivers MUST allow a server selection error to propagate. In this case, the caller is able to infer that no attempt was made.

2. Determining whether retry should be allowed

A driver then determines if it should attempt to retry next.

2a. When not to allow retry

Drivers MUST attempt to execute the read command exactly once and allow any errors to propagate under any of the the following conditions:

  • if retryable reads is not enabled or
  • if the selected server does not support retryable reads or
  • if the session in a transaction

By allowing the error to propagate, the caller is able to infer that one attempt was made.

2b. When to allow retry

Drivers MUST only attempt to retry a read command if

  • retryable reads are enabled and
  • the selected server supports retryable reads and
  • the previous attempt yields a retryable error
3. Deciding to allow retry, encountering the initial retryable error, and selecting a server

If the driver decides to allow retry and the previous attempt of a retryable read command encounters a retryable error, the driver MUST update its topology according to the Server Discovery and Monitoring spec (see SDAM: Error Handling) and capture this original retryable error. Drivers should then proceed with selecting a server for a retry attempt.

3a. Selecting the server for retry

In a sharded cluster, the server on which the operation failed MUST be provided to the server selection mechanism as a deprioritized server.

If the driver cannot select a server for a retry attempt or the newly selected server does not support retryable reads, retrying is not possible and drivers MUST raise the previous retryable error. In both cases, the caller is able to infer that an attempt was made.

3b. Sending an equivalent command for a retry attempt

After server selection, a driver MUST send a valid command to the newly selected server that is equivalent1 to the initial command sent to the first server. If the driver determines that the newly selected server may not be able to support a command equivalent to the initial command, drivers MUST NOT retry and MUST raise the previous retryable error

The above requirement can be fulfilled in one of two ways:

  1. During a retry attempt, the driver SHOULD recreate the command while adhering to that operation's specification's server/wire version requirements. If an error occurs while recreating the command, then the driver MUST raise the original retryable error.

    For example, if the wire version dips from W0 to W1 after server selection, and the spec for operation O notes that for wire version W1, that field F should be omitted, then field F should be omitted. If the spec for operation O requires the driver to error out if field F is defined when talking to a server with wire version W1, then the driver must error out and raise the original retryable error.

  2. Alternatively, if a driver chooses not to recreate the command as described above, then a driver MUST NOT retry if the server/wire version dips after server selection and MUST raise the original retryable error.

    For example, if the wire version dips after server selection, the driver can choose to not retry and simply raise the original retryable error because there is no guarantee that the lower versioned server can support the original command.

3c. If a retry attempt fails

If a retry attempt also fails and Client Side Operations Timeout (CSOT) is enabled and the timeout has not yet expired, then the Driver MUST jump back to step 2b above in order to allow multiple retry attempts.

Otherwise, drivers MUST update their topology according to the SDAM spec (see SDAM: Error Handling). If an error would not allow the caller to infer that an attempt was made (e.g. connection pool exception originating from the driver), the previous error should be raised. If a retry failed due to another retryable error or some other error originating from the server, that error should be raised instead as the caller can infer that an attempt was made and the second error is likely more relevant (with respect to the current topology state).

If a driver associates server information (e.g. the server address or description) with an error, the driver MUST ensure that the reported server information corresponds to the server that originated the error.

4. Implementation constraints

When retrying a read command, drivers MUST NOT resend the original wire protocol message (see: Can drivers resend the same wire protocol message on retry attempts?).

Pseudocode

The following pseudocode for executing retryable read commands has been adapted from the pseudocode for executing retryable write commands and reflects the flow described above.

/**
 * Checks if a connection supports retryable reads.
 */
function isRetryableReadsSupported(connection) {
  return connection.MaxWireVersion >= RETRYABLE_READS_MIN_WIRE_VERSION);
}

/**
 * Executes a read command in the context of a MongoClient where a retryable
 * read have been enabled. The session parameter may be an implicit or
 * explicit client session (depending on how the CRUD method was invoked).
 */
function executeRetryableRead(command, session) {
  Exception previousError = null;
  retrying = false;
  Server previousServer = null;
  while true {
    if (previousError != null) {
      retrying = true;
    }
    try {
      if (previousServer == null) {
        server = selectServer();
      } else {
        // If a previous attempt was made, deprioritize the previous server
        // where the command failed.
        deprioritizedServers = [ previousServer ];
        server = selectServer(deprioritizedServers);
      }
    } catch (ServerSelectionException exception) {
      if (previousError == null) {
        // If this is the first attempt, propagate the exception.
        throw exception;
      }
      // For retries, propagate the previous error.
      throw previousError;
    }

    try {
      connection = server.getConnection();
    } catch (PoolClearedException poolClearedError) {
      /* PoolClearedException indicates the operation did not even attempt to
       * create a connection, let alone execute the operation. This means we
       * are always safe to attempt a retry. We do not need to update SDAM,
       * since whatever error caused the pool to be cleared will do so itself. */
      if (previousError == null) {
        previousError = poolClearedError;
      }
      /* CSOT is enabled and the operation has timed out. */
      if (timeoutMS != null && isExpired(timeoutMS) {
        throw previousError;
      }
      continue;
    }

    if ( !isRetryableReadsSupported(connection) || session.inTransaction()) {
      /* If this is the first loop iteration and we determine that retryable
       * reads are not supported, execute the command once and allow any
       * errors to propagate */

      if (previousError == null) {
        return executeCommand(connection, command);
      }

      /* If the server selected for retrying is too old, throw the previous error.
       * The caller can then infer that an attempt was made and failed. This case
       * is very rare, and likely means that the cluster is in the midst of a
       * downgrade. */
      throw previousError;
    }

    /* NetworkException and NotWritablePrimaryException are both retryable errors. If
     * caught, remember the exception, update SDAM accordingly, and proceed with
     * retrying the operation.
     *
     * Exceptions that originate from the driver (e.g. no socket available
     * from the connection pool) are treated as fatal. Any such exception
     * that occurs on the previous attempt is propagated as-is. On retries,
     * the error from the previous attempt is raised as it will be more
     * relevant for the user. */
    try {
      return executeCommand(connection, retryableCommand);
    } catch (NetworkException networkError) {
      updateTopologyDescriptionForNetworkError(server, networkError);
      previousError = networkError;
      previousServer = server;
    } catch (NotWritablePrimaryException notPrimaryError) {
      updateTopologyDescriptionForNotWritablePrimaryError(server, notPrimaryError);
      previousError = notPrimaryError;
      previousServer = server;
    } catch (DriverException error) {
      if ( previousError != null ) {
        throw previousError;
      }
      throw error;
    }

    if (timeoutMS == null) {
      /* If CSOT is not enabled, allow any retryable error from the second
       * attempt to propagate to our caller, as it will be just as relevant
       * (if not more relevant) than the original error. */
      if (retrying) {
        throw previousError;
      }
    } else if (isExpired(timeoutMS)) {
      /* CSOT is enabled and the operation has timed out. */
      throw previousError;
    }
  }
}

Logging Retry Attempts

As with retryable writes, drivers MAY choose to log retry attempts for read operations. This specification does not define a format for such log messages.

Command Monitoring

As with retryable writes, in accordance with the Command Logging and Monitoring specification, drivers MUST guarantee that each CommandStartedEvent has either a correlating CommandSucceededEvent or CommandFailedEvent and that every "command started" log message has either a correlating "command succeeded" log message or "command failed" log message. If the first attempt of a retryable read operation encounters a retryable error, drivers MUST fire a CommandFailedEvent and emit a "command failed" log message for the retryable error and fire a separate CommandStartedEvent and emit a separate "command started" log message when executing the subsequent retry attempt. Note that the second CommandStartedEvent and "command started" log message may have a different connectionId, since a server is reselected for a retry attempt.

Documentation

  1. Drivers MUST document all read operations that support retryable behavior.
  2. Drivers MUST document that the operations in Unsupported Read Operations do not support retryable behavior.
  3. Driver release notes MUST make it clear to users that they may need to adjust custom retry logic to prevent an application from inadvertently retrying for too long (see Backwards Compatibility for details).
  4. Drivers implementing retryability for their generic command runner for read commands MUST document that mapReduce will be retried if it is passed as a command to the command runner. These drivers also MUST document the potential for degraded performance given that "Early Failure on Socket Disconnect" feature does not support mapReduce.

Test Plan

See the README for tests.

At a high level, the test plan will cover executing supported read operations within a MongoClient where retryable reads have been enabled, ensuring that reads are retried.

Motivation for Change

Drivers currently have an API for the retryability of write operations but not for read operations. The driver API needs to be extended to include support for retryable behavior for read operations.

Design Rationale

The design of this specification is based off the Retryable Writes specification. It modifies the driver API as little as possible to introduce the concept retryable behavior for read operations.

Alternative retry strategies (e.g. exponential back-off, incremental intervals, regular intervals, immediate retry, randomization) were considered, but the behavior of a single, immediate retry attempt was chosen in the interests of simplicity as well as consistency with the design for retryable writes.

See the future work section for potential upcoming changes to retry mechanics.

Backwards Compatibility

The API changes to support retryable reads extend the existing API but do not introduce any backward breaking changes. Existing programs that do not make use of retryable reads will continue to compile and run correctly.

N.B.: Applications with custom retry logic that choose to enable retryable reads may need to redo their custom retry logic to ensure that the reads are retried as desired. e.g. if an application has custom logic that retries reads n times and enables retryable reads, then the application could end up retrying reads up to 2n times.

The note above will also apply if an application upgrades to a version of the driver where that defaults to enabling retryable reads.

Rejected Designs

  1. To improve performance on servers without "Early Failure on Socket Disconnect", we considered using killSessions to automatically kill the previous attempt before running a retry. We decided against this because after killing the session, parts of it still may be running if there are any errors. Additionally, killing sessions takes time because a kill has to talk to every non-config mongod in the cluster (i.e. all the primaries and secondaries of each shard). In addition, in order to protect the system against getting overloaded with these requests, every server allows no more than one killsession operation at a time. Operations that attempt to killsessions while a killsession is running are batched together and run simultaneously after the current one finishes.

Reference Implementation

The C# and Python drivers will provide the reference implementations. See CSHARP-2429 and PYTHON-1674.

Security Implications

None.

Future work

  1. A later specification may allow operations (including read) to be retried any number of times during a singular timeout period.
  2. Any future changes to the the applicable parts of retryable writes specification may also need to be reflected in the retryable reads specification, and vice versa.
  3. We may revisit the decision not retry Cursor.getMore() (see Q&A).
  4. Once DRIVERS-560 is resolved, tests will be added to allow testing Retryable Reads on MongoDB 3.6. See the test plan for additional information.

Q&A

Why is retrying Cursor.getMore() not supported?

Cursor.getMore() cannot be retried because of the inability for the client to discern if the cursor was advanced. In other words, since the driver does not know if the original getMore() succeeded or not, the driver cannot reliably know if results might be inadvertently skipped.

For example, if a transient network error occurs as a driver requests the second batch of results via a getMore() and the driver were to silently retry the getMore(), it is possible that the server had actually received the initial getMore(). In such a case, the server will advance the cursor once more and return the third batch instead of the desired second batch.

Furthermore, even if the driver could detect such a scenario, it is impossible to return previously iterated data from a cursor because the server currently only allows forward iteration.

It is worth noting that the "Cursors survive primary stepdown" feature avoids this issue in certain common circumstances, so that we may revisit this decision to disallow trying getMore() in the future.

Why are read operations only retried once by default?

Read operations are only retried once for the same reasons that writes are also only retried once. For convenience's sake, that reasoning has been adapted for reads and reproduced below:

The spec concerns itself with retrying read operations that encounter a retryable error (i.e. no response due to network error or a response indicating that the node is no longer a primary). A retryable error may be classified as either a transient error (e.g. dropped connection, replica set failover) or persistent outage. If a transient error results in the server being marked as "unknown", a subsequent retry attempt will allow the driver to rediscover the primary within the designated server selection timeout period (30 seconds by default). If server selection times out during this retry attempt, we can reasonably assume that there is a persistent outage. In the case of a persistent outage, multiple retry attempts are fruitless and would waste time. See How To Write Resilient MongoDB Applications for additional discussion on this strategy.

However when Client Side Operations Timeout is enabled, the driver will retry multiple times until the operation succeeds, a non-retryable error is encountered, or the timeout expires. Retrying multiple times provides greater resilience to cascading failures such as rolling server restarts during planned maintenance events.

Can drivers resend the same wire protocol message on retry attempts?

No. This is in contrast to the answer supplied in in the retryable writes specification. However, when retryable writes were implemented, no driver actually chose to resend the same wire protocol message. Today, if a driver attempted to resend the same wire protocol message, this could violate the rules for gossiping $clusterTime: specifically the rule that a driver must send the highest seen $clusterTime.

Additionally, there would be a behavioral difference between a driver resending the same wire protocol message and one that does not. For example, a driver that creates a new wire protocol message could exhibit the following characteristics:

  1. The second attempt to send the read command could have a higher $clusterTime.
  2. If the initial attempt failed with a server error, then the session's operationTime would be advanced and the next read would include a larger readConcern.afterClusterTime.

A driver that resends the same wire protocol message would not exhibit the above characteristics. Thus, in order to avoid this behavioral difference and not violate the rules about gossiping $clusterTime, drivers MUST not resend the same wire protocol message.

Why isn't MongoDB 4.2 required?

MongoDB 4.2 was initially considered as a requirement for retryable reads because MongoDB 4.2 implements support for "Early Failure on Socket Disconnect," changing the the semantics of socket disconnect to prevent ops from doing work that no client is interested in. This prevents applications from seeing degraded performance when an expensive read is retried. Upon further discussion, we decided that "Early Failure on Socket Disconnect" should not be required to retry reads because the resilience benefit of retryable reads outweighs the minor risk of degraded performance. Additionally, any customers experiencing degraded performance can simply disable retryableReads.

Changelog

  • 2024-04-30: Migrated from reStructuredText to Markdown.

  • 2023-12-05: Add that any server information associated with retryable
    exceptions MUST reflect the originating server, even in the presence of retries.

  • 2023-11-30: Add ReadConcernMajorityNotAvailableYet to the list of error codes
    that should be retried.

  • 2023-11-28: Add ExceededTimeLimit to the list of error codes that should
    be retried.

  • 2023-08-26: Require that in a sharded cluster the server on which the
    operation failed MUST be provided to the server selection mechanism as a deprioritized server.

  • 2023-08-21: Update Q&A that contradicts SDAM transient error logic

  • 2022-11-09: CLAM must apply both events and log messages.

  • 2022-10-18: When CSOT is enabled multiple retry attempts may occur.

  • 2022-10-05: Remove spec front matter, move footnote, and reformat changelog.

  • 2022-01-25: Note that drivers should retry handshake network failures.

  • 2021-04-26: Replaced deprecated terminology; removed requirement to parse error
    message text as MongoDB 3.6+ servers will always return an error code

  • 2021-03-23: Require that PoolClearedErrors are retried

  • 2019-06-07: Mention $merge stage for aggregate alongside $out

  • 2019-05-29: Renamed InterruptedDueToStepDown to InterruptedDueToReplStateChange

1

The first and second commands will be identical unless variations in parameters exist between wire/server versions.

Retryable Writes

  • Status: Accepted
  • Minimum Server Version: 3.6

Abstract

MongoDB 3.6 will implement support for server sessions, which are shared resources within a cluster identified by a session ID. Drivers compatible with MongoDB 3.6 will also implement support for client sessions, which are always associated with a server session and will allow for certain commands to be executed within the context of a server session.

Additionally, MongoDB 3.6 will utilize server sessions to allow some write commands to specify a transaction ID to enforce at-most-once semantics for the write operation(s) and allow for retrying the operation if the driver fails to obtain a write result (e.g. network error or "not writable primary" error after a replica set failover). This specification will outline how an API for retryable write operations will be implemented in drivers. The specification will define an option to enable retryable writes for an application and describe how a transaction ID will be provided to write commands executed therein.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Specification

Terms

Transaction ID
The transaction ID identifies the transaction as part of which the command is running. In a write command where the client has requested retryable behavior, it is expressed by the top-level lsid and txnNumber fields. The lsid component is the corresponding server session ID. which is a BSON value defined in the Driver Session specification. The txnNumber component is a monotonically increasing (per server session), positive 64-bit integer.

ClientSession
Driver object representing a client session, which is defined in the Driver Session specification. This object is always associated with a server session; however, drivers will pool server sessions so that creating a ClientSession will not always entail creation of a new server session. The name of this object MAY vary across drivers.

Retryable Error
An error is considered retryable if it has a RetryableWriteError label in its top-level "errorLabels" field. See Determining Retryable Errors for more information.

Additional terms may be defined in the Driver Session specification.

Naming Deviations

This specification defines the name for a new MongoClient option, retryWrites. Drivers MUST use the defined name for the connection string parameter to ensure portability of connection strings across applications and drivers.

If drivers solicit MongoClient options through another mechanism (e.g. options dictionary provided to the MongoClient constructor), drivers SHOULD use the defined name but MAY deviate to comply with their existing conventions. For example, a driver may use retry_writes instead of retryWrites.

For any other names in the spec, drivers SHOULD use the defined name but MAY deviate to comply with their existing conventions.

MongoClient Configuration

This specification introduces the following client-level configuration option.

retryWrites

This boolean option determines whether retryable behavior will be applied to all supported write operations executed within the MongoClient. This option MUST default to true.

This option MUST NOT be configurable at the level of a database object, collection object, or at the level of an individual write operation.

Requirements for Retryable Writes

Supported Server Versions

Like sessions, retryable writes require a MongoDB 3.6 replica set or shard cluster operating with feature compatibility version 3.6 (i.e. the {setFeatureCompatibilityVersion: 3.6} administrative command has been run on the cluster). Drivers MUST verify server eligibility by ensuring that maxWireVersion is at least six, the logicalSessionTimeoutMinutes field is present in the server's hello or legacy hello response, and the server type is not standalone.

Retryable writes are only supported by storage engines that support document-level locking. Notably, that excludes the MMAPv1 storage engine which is available in both MongoDB 3.6 and 4.0. Since retryWrites defaults to true, Drivers MUST raise an actionable error message when the server returns code 20 with errmsg starting with "Transaction numbers". The replacement error message MUST be:

This MongoDB deployment does not support retryable writes. Please add
retryWrites=false to your connection string.

If the server selected for the first attempt of a retryable write operation does not support retryable writes, drivers MUST execute the write as if retryable writes were not enabled. Drivers MUST NOT include a transaction ID in the write command and MUST not retry the command under any circumstances.

In a sharded cluster, it is possible that mongos may appear to support retryable writes but one or more shards in the cluster do not (e.g. replica set shard is configured with feature compatibility version 3.4, a standalone is added as a new shard). In these rare cases, a write command that fans out to a shard that does not support retryable writes may partially fail and an error may be reported in the write result from mongos (e.g. writeErrors array in the bulk write result). This does not constitute a retryable error. Drivers MUST relay such errors to the user.

Supported Write Operations

MongoDB 3.6 will support retryability for some, but not all, write operations.

Supported single-statement write operations include insertOne(), updateOne(), replaceOne(), deleteOne(), findOneAndDelete(), findOneAndReplace(), and findOneAndUpdate().

Supported multi-statement write operations include insertMany() and bulkWrite(). The ordered option may be true or false. For both the collection-level and client-level bulkWrite() methods, a bulk write batch is only retryable if it does not contain any multi: true writes (i.e. UpdateMany and DeleteMany). Drivers MUST evaluate eligibility for each write command sent as part of the bulkWrite() (after order and batch splitting) individually. Drivers MUST NOT alter existing logic for order and batch splitting in an attempt to maximize retryability for operations within a bulk write.

These methods above are defined in the CRUD specification.

Later versions of MongoDB may add support for additional write operations.

Drivers MUST document operations that support retryable behavior and the conditions for which retryability is determined (see: How will users know which operations are supported?). Drivers are not required to exhaustively document all operations that do not support retryable behavior.

Unsupported Write Operations

Write commands specifying an unacknowledged write concern (e.g. {w: 0})) do not support retryable behavior. Drivers MUST NOT add a transaction ID to any write command with an unacknowledged write concern executed within a MongoClient where retryable writes have been enabled. Drivers MUST NOT retry these commands.

Write commands where a single statement might affect multiple documents will not be initially supported by MongoDB 3.6, although this may change in the future. This includes an update command where any statement in the updates sequence specifies a multi option of true or a delete command where any statement in the deletes sequence specifies a limit option of 0. In the context of the CRUD specification, this includes the updateMany() and deleteMany() methods and, in some cases, bulkWrite(). Drivers MUST NOT add a transaction ID to any single- or multi-statement write commands that include one or more multi-document write operations. Drivers MUST NOT retry these commands if they fail to return a response. With regard to bulkWrite(), drivers MUST evaluate eligibility for each write command sent as part of the bulkWrite() (after order and batch splitting) individually.

Write commands other than insert, update, delete, or findAndModify will not be initially supported by MongoDB 3.6, although this may change in the future. This includes, but is not limited to, an aggregate command using a write stage (e.g. $out, $merge). Drivers MUST NOT add a transaction ID to these commands and MUST NOT retry these commands if they fail to return a response.

Retryable Writes Within Transactions

In MongoDB 4.0 the only supported retryable write commands within a transaction are commitTransaction and abortTransaction. Therefore drivers MUST NOT retry write commands within transactions even when retryWrites has been set to true on the MongoClient. In addition, drivers MUST NOT add the RetryableWriteError label to any error that occurs during a write command within a transaction (excepting commitTransation and abortTransaction), even when retryWrites has been set to true on the MongoClient.

Implementing Retryable Writes

Determining Retryable Errors

When connected to a MongoDB instance that supports retryable writes (versions 3.6+), the driver MUST treat all errors with the RetryableWriteError label as retryable. This error label can be found in the top-level "errorLabels" field of the error.

RetryableWriteError Labels

The RetryableWriteError label might be added to an error in a variety of ways:

  • When the driver encounters a network error establishing an initial connection to a server, it MUST add a RetryableWriteError label to that error if the MongoClient performing the operation has the retryWrites configuration option set to true.

  • When the driver encounters a network error communicating with any server version that supports retryable writes, it MUST add a RetryableWriteError label to that error if the MongoClient performing the operation has the retryWrites configuration option set to true.

  • When a CMAP-compliant driver encounters a PoolClearedError during connection check out, it MUST add a RetryableWriteError label to that error if the MongoClient performing the operation has the retryWrites configuration option set to true.

  • For server versions 4.4 and newer, the server will add a RetryableWriteError label to errors or server responses that it considers retryable before returning them to the driver. As new server versions are released, the errors that are labeled with the RetryableWriteError label may change. Drivers MUST NOT add a RetryableWriteError label to any error derived from a 4.4+ server response (i.e. any error that is not a network error).

  • When receiving a command result with an error from a pre-4.4 server that supports retryable writes, the driver MUST add a RetryableWriteError label to errors that meet the following criteria if the retryWrites option is set to true on the client performing the relevant operation:

    • a mongod or mongos response with any the following error codes in the top-level code field:

      Error NameError Code
      InterruptedAtShutdown11600
      InterruptedDueToReplStateChange11602
      NotWritablePrimary10107
      NotPrimaryNoSecondaryOk13435
      NotPrimaryOrSecondary13436
      PrimarySteppedDown189
      ShutdownInProgress91
      HostNotFound7
      HostUnreachable6
      NetworkTimeout89
      SocketException9001
      ExceededTimeLimit262
    • a mongod response with any of the previously listed codes in the writeConcernError.code field.

    Drivers MUST NOT add a RetryableWriteError label based on the following:

    • any writeErrors[].code fields in a mongod or mongos response
    • the writeConcernError.code field in a mongos response

    The criteria for retryable errors is similar to the discussion in the SDAM spec's section on Error Handling, but includes additional error codes. See What do the additional error codes mean? for the reasoning behind these additional errors.

To understand why the driver should only add the RetryableWriteError label to an error when the retryWrites option is true on the MongoClient performing the operation, see Why does the driver only add the RetryableWriteError label to errors that occur on a MongoClient with retryWrites set to true?

Note: During a retryable write operation on a sharded cluster, mongos may retry the operation internally, in which case it will not add a RetryableWriteError label to any error that occurs after those internal retries to prevent excessive retrying.

For more information about error labels, see the Transactions specification.

Generating Transaction IDs

The server requires each retryable write operation to provide a unique transaction ID in its command document. The transaction ID consists of a server session ID and a monotonically increasing transaction number. The session ID is obtained from the ClientSession object, which will have either been passed to the write operation from the application or constructed internally for the operation. Drivers will be responsible for maintaining a monotonically increasing transaction number for each server session used by a ClientSession object. Drivers that pool server sessions MUST preserve the transaction number when reusing a server session from the pool with a new ClientSession (this can be tracked as another property on the driver's object for the server session).

Drivers MUST ensure that each retryable write command specifies a transaction number larger than any previously used transaction number for its session ID.

Since ClientSession objects are not thread safe and may only be used by one thread at a time, drivers should not need to worry about race conditions when incrementing the transaction number.

Behavioral Changes for Write Commands

Drivers MUST automatically add a transaction ID to all supported write commands executed via a specific CRUD method (e.g. updateOne()) or write command method (e.g. executeWriteCommand()) within a MongoClient where retryable writes have been enabled and when the selected server supports retryable writes.

If your driver offers a generic command method on your database object (e.g. runCommand()), it MUST NOT check the user's command document to determine if it is a supported write operation and MUST NOT automatically add a transaction ID. The method should send the user's command document to the server as-is.

This specification does not affect write commands executed within a MongoClient where retryable writes have not been enabled.

Constructing Write Commands

When constructing a supported write command that will be executed within a MongoClient where retryable writes have been enabled, drivers MUST increment the transaction number for the corresponding server session and include the server session ID and transaction number in top-level lsid and txnNumber fields, respectively. lsid is a BSON value (discussed in the Driver Session specification). txnNumber MUST be a positive 64-bit integer (BSON type 0x12).

The following example illustrates a possible write command for an updateOne() operation:

{
  update: "coll",
  lsid: { ... },
  txnNumber: 100,
  updates: [
    { q: { x: 1 }, u: { $inc: { y: 1 } } },
  ],
  ordered: true
}

When constructing multiple write commands for a multi-statement write operation (i.e. insertMany() and bulkWrite()), drivers MUST increment the transaction number for each supported write command in the batch.

Executing Retryable Write Commands

When selecting a writable server for the first attempt of a retryable write command, drivers MUST allow a server selection error to propagate. In this case, the caller is able to infer that no attempt was made.

If retryable writes is not enabled or the selected server does not support retryable writes, drivers MUST NOT include a transaction ID in the command and MUST attempt to execute the write command exactly once and allow any errors to propagate. In this case, the caller is able to infer that an attempt was made.

If retryable writes are enabled and the selected server supports retryable writes, drivers MUST add a transaction ID to the command. Drivers MUST only attempt to retry a write command if the first attempt yields a retryable error. Drivers MUST NOT attempt to retry a write command on any other error.

If the first attempt of a write command including a transaction ID encounters a retryable error, the driver MUST update its topology according to the SDAM spec (see: Error Handling) and capture this original retryable error.

Drivers MUST then retry the operation as many times as necessary until any one of the following conditions is reached:

For each retry attempt, drivers MUST select a writable server. In a sharded cluster, the server on which the operation failed MUST be provided to the server selection mechanism as a deprioritized server.

If the driver cannot select a server for a retry attempt or the selected server does not support retryable writes, retrying is not possible and drivers MUST raise the retryable error from the previous attempt. In both cases, the caller is able to infer that an attempt was made.

If a retry attempt also fails, drivers MUST update their topology according to the SDAM spec (see: Error Handling). If an error would not allow the caller to infer that an attempt was made (e.g. connection pool exception originating from the driver) or the error is labeled "NoWritesPerformed", the error from the previous attempt should be raised. If all server errors are labeled "NoWritesPerformed", then the first error should be raised.

If a driver associates server information (e.g. the server address or description) with an error, the driver MUST ensure that the reported server information corresponds to the server that originated the error.

The above rules are implemented in the following pseudo-code:

/**
 * Checks if a server supports retryable writes.
 */
function isRetryableWritesSupported(server) {
  if (server.getMaxWireVersion() < RETRYABLE_WIRE_VERSION) {
    return false;
  }

  if ( ! server.hasLogicalSessionTimeoutMinutes()) {
    return false;
  }

  if (server.isStandalone()) {
    return false;
  }

  return true;
}

/**
 * Executes a write command in the context of a MongoClient where retryable
 * writes have been enabled. The session parameter may be an implicit or
 * explicit client session (depending on how the CRUD method was invoked).
 */
function executeRetryableWrite(command, session) {
  /* Allow ServerSelectionException to propagate to our caller, which can then
   * assume that no attempts were made. */
  server = selectServer("writable");

  /* If the server does not support retryable writes, execute the write as if
   * retryable writes are not enabled. */
  if ( ! isRetryableWritesSupported(server)) {
    return executeCommand(server, command);
  }

  /* Incorporate lsid and txnNumber fields into the command document. These
   * values will be derived from the implicit or explicit session object. */
  retryableCommand = addTransactionIdToCommand(command, session);

  Exception previousError = null;
  retrying = false;
  while true {
    try {
      return executeCommand(server, retryableCommand);
    } catch (Exception currentError) {
      handleError(currentError);

      /* If the error has a RetryableWriteError label, remember the exception
       * and proceed with retrying the operation.
       *
       * IllegalOperation (code 20) with errmsg starting with "Transaction
       * numbers" MUST be re-raised with an actionable error message.
       */
      if (!currentError.hasErrorLabel("RetryableWriteError")) {
        if ( currentError.code == 20 && previousError.errmsg.startsWith("Transaction numbers") ) {
          currentError.errmsg = "This MongoDB deployment does not support retryable...";
        }
        throw currentError;
      }

      /*
       * If the "previousError" is "null", then the "currentError" is the
       * first error encountered during the retry attempt cycle. We must
       * persist the first error in the case where all succeeding errors are
       * labeled "NoWritesPerformed", which would otherwise raise "null" as
       * the error.
       */
      if (previousError == null) {
        previousError = currentError;
      }

      /*
       * For exceptions that originate from the driver (e.g. no socket available
       * from the connection pool), we should raise the previous error if there
       * was one.
       */
      if (currentError is not DriverException && ! previousError.hasErrorLabel("NoWritesPerformed")) {
        previousError = currentError;
      }
    }

    /*
     * We try to select server that is not the one that failed by passing the
     * failed server as a deprioritized server.
     * If we cannot select a writable server, do not proceed with retrying and
     * throw the previous error. The caller can then infer that an attempt was
     * made and failed. */
    try {
      deprioritizedServers = [ server ];
      server = selectServer("writable", deprioritizedServers);
    } catch (Exception ignoredError) {
      throw previousError;
    }

    /* If the server selected for retrying is too old, throw the previous error.
     * The caller can then infer that an attempt was made and failed. This case
     * is very rare, and likely means that the cluster is in the midst of a
     * downgrade. */
    if ( ! isRetryableWritesSupported(server)) {
      throw previousError;
    }

    if (timeoutMS == null) {
      /* If CSOT is not enabled, allow any retryable error from the second
       * attempt to propagate to our caller, as it will be just as relevant
       * (if not more relevant) than the original error. */
      if (retrying) {
        throw previousError;
      }
    } else if (isExpired(timeoutMS)) {
      /* CSOT is enabled and the operation has timed out. */
      throw previousError;
    }
    retrying = true;
  }
}

handleError in the above pseudocode refers to the function defined in the Error handling pseudocode section of the SDAM specification.

When retrying a write command, drivers MUST resend the command with the same transaction ID. Drivers MUST NOT resend the original wire protocol message if doing so would violate rules for gossipping the cluster time (see: Can drivers resend the same wire protocol message on retry attempts?).

In the case of a multi-statement write operation split across multiple write commands, a failed retry attempt will also interrupt execution of any additional write operations in the batch (regardless of the ordered option). This is no different than if a retryable error had been encountered without retryable behavior enabled or supported by the driver. Drivers are encouraged to provide access to an intermediary write result (e.g. BulkWriteResult, InsertManyResult) through the BulkWriteException, in accordance with the CRUD specification.

Logging Retry Attempts

Drivers MAY choose to log retry attempts for write operations. This specification does not define a format for such log messages.

Command Monitoring

In accordance with the Command Logging and Monitoring specification, drivers MUST guarantee that each CommandStartedEvent has either a correlating CommandSucceededEvent or CommandFailedEvent and that every "command started" log message has either a correlating "command succeeded" log message or "command failed" log message. If the first attempt of a retryable write operation encounters a retryable error, drivers MUST fire a CommandFailedEvent and emit a "command failed" log message for the retryable error and fire a separate CommandStartedEvent and "command succeeded" log message when executing the subsequent retry attempt. Note that the second CommandStartedEvent and "command succeeded" log message may have a different connectionId, since a writable server is reselected for the retry attempt.

Each attempt of a retryable write operation SHOULD report a different requestId so that events for each attempt can be properly correlated with one another.

The Command Logging and Monitoring specification states that the operationId field is a driver-generated, 64-bit integer and may be "used to link events together such as bulk write operations." Each attempt of a retryable write operation SHOULD report the same operationId; however, drivers SHOULD NOT use the operationId field to relay information about a transaction ID. A bulk write operation may consist of multiple write commands, each of which may specify a unique transaction ID.

Test Plan

See the README for tests.

At a high level, the test plan will cover the following scenarios for executing supported write operations within a MongoClient where retryable writes have been enabled:

  • Executing the same write operation (and transaction ID) multiple times should yield an identical write result.
  • Test at-most-once behavior by observing that subsequent executions of the same write operation do not incur further modifications to the collection data.
  • Exercise supported single-statement write operations (i.e. deleteOne, insertOne, replaceOne, updateOne, and findAndModify).
  • Exercise supported multi-statement insertMany and bulkWrite operations, which contain only supported single-statement write operations. Both ordered and unordered execution should be tested.

Additional prose tests for other scenarios are also included.

Motivation for Change

Drivers currently have no API for specifying at-most-once semantics and retryable behavior for write operations. The driver API needs to be extended to support this behavior.

Design Rationale

The design of this specification piggy-backs that of the Driver Session specification in that it modifies the driver API as little as possible to introduce the concept of at-most-once semantics and retryable behavior for write operations. A transaction ID will be included in all supported write commands executed within the scope of a MongoClient where retryable writes have been enabled.

Drivers expect the server to yield an error if a transaction ID is included in an unsupported write command. This requires drivers to maintain an allow list and track which write operations support retryable behavior for a given server version (see: Why must drivers maintain an allow list of supported operations?).

While this approach will allow applications to take advantage of retryable write behavior with minimal code changes, it also presents a documentation challenge. Users must understand exactly what can and will be retried (see: How will users know which operations are supported?).

Backwards Compatibility

The API changes to support retryable writes extend the existing API but do not introduce any backward breaking changes. Existing programs that do not make use of retryable writes will continue to compile and run correctly.

Reference Implementation

The C# and C drivers will provide reference implementations. JIRA links will be added here at a later point.

Future Work

Supporting at-most-once semantics and retryable behavior for updateMany and deleteMany operations may become possible once the server implements support for multi-document transactions.

A separate specification for retryable read operations could complement this specification. Retrying read operations would not require client or server sessions and could be implemented independently of retryable writes.

Q & A

What do the additional error codes mean?

The errors HostNotFound, HostUnreachable, NetworkTimeout, SocketException may be returned from mongos during problems routing to a shard. These may be transient, or localized to that mongos.

Why are write operations only retried once by default?

The spec concerns itself with retrying write operations that encounter a retryable error (i.e. no response due to network error or a response indicating that the node is no longer a primary). A retryable error may be classified as either a transient error (e.g. dropped connection, replica set failover) or persistent outage. In the case of a transient error, the driver will mark the server as "unknown" per the SDAM spec. A subsequent retry attempt will allow the driver to rediscover the primary within the designated server selection timeout period (30 seconds by default). If server selection times out during this retry attempt, we can reasonably assume that there is a persistent outage. In the case of a persistent outage, multiple retry attempts are fruitless and would waste time. See How To Write Resilient MongoDB Applications for additional discussion on this strategy.

However when Client Side Operations Timeout is enabled, the driver will retry multiple times until the operation succeeds, a non-retryable error is encountered, or the timeout expires. Retrying multiple times provides greater resilience to cascading failures such as rolling server restarts during planned maintenance events.

What if the transaction number overflows?

Since server sessions are pooled and session lifetimes are configurable on the server, it is theoretically possible for the transaction number to overflow if it reaches the limits of a signed 64-bit integer. The spec does not address this scenario. Drivers may decide to handle this as they wish. For example, they may raise a client-side error if a transaction number would overflow, eagerly remove sessions with sufficiently high transactions numbers from the pool in an attempt to limit such occurrences, or simply rely on the server to raise an error when a transaction number is reused.

Why are unacknowledged write concerns unsupported?

The server does not consider the write concern when deciding if a write operation supports retryable behavior. Technically, operations with an unacknowledged write concern can specify a transaction ID and be retried. However, the spec elects not to support unacknowledged write concerns due to various ways that drivers may issue write operations with unacknowledged write concerns.

When using OP_QUERY to issue a write command to the server, a command response is always returned. A write command with an unacknowledged write concern (i.e. {w: 0}) will return a response of {ok: 1}. If a retryable error is encountered (either a network error or "not writeable primary" response), the driver could attempt to retry the operation by executing it again with the same transaction ID.

Some drivers fall back to legacy opcodes (e.g. OP_INSERT) to execute write operations with an unacknowledged write concern. In the future, OP_MSG may allow the server to avoid returning any response for write operations sent with an unacknowledged write concern. In both of these cases, there is no response for which the driver might encounter a retryable error and decide to retry the operation.

Rather than depend on an implementation detail to determine if retryable behavior might apply, the spec has chosen to not support retryable behavior for unacknowledged write concerns and guarantee a consistent user experience across all drivers.

Why must drivers maintain an allow list of supported operations?

Requiring that drivers maintain an allow list of supported write operations is unfortunate. It both adds complexity to the driver's implementation and limits the driver's ability to immediately take advantage of new server functionality (i.e. the driver must be upgraded to support additional write operations).

Several other alternatives were discussed:

  • The server could inform drivers which write operations support retryable behavior in its hello or legacy hello response. This would be a form of feature discovery, for which there is no established protocol. It would also add complexity to the connection handshake.
  • The server could ignore a transaction ID on the first observed attempt of an unsupported write command and only yield an error on subsequent attempts. This would require the server to create a transaction record for unsupported writes to avoid the risk of applying a write twice and ensuring that retry attempts could be differentiated. It also poses a significant problem for sharding if a multi-document write does not reach all shards, since those shards would not know to create a transaction record.
  • The driver could allow more fine-grained control retryable write behavior by supporting a retryWrites option on the database and collection objects. This would allow users to enable retryWrites on a MongoClient and disable it as needed to execute unsupported write operations, or vice versa. Since we expect the retryWrites option to become less relevant once transactions are implemented, we would prefer not to add the option throughout the driver API.

How will users know which operations are supported?

The initial list of supported operations is already quite permissive. Most CRUD operations are supported apart from updateMany(), deleteMany(), and aggregate() with a write stage (e.g. $out, $merge). Other write operations (e.g. renameCollection) are rare.

That said, drivers will need to clearly document exactly which operations support retryable behavior. In the case bulkWrite(), which may or may not support retryability, drivers should discuss how eligibility is determined.

Can drivers resend the same wire protocol message on retry attempts?

Since retry attempts entail sending the same command and transaction ID to the server, drivers might consider resending the same wire protocol message in order to avoid constructing a new message and computing its checksum. The server will not complain if it receives two messages with the same requestId, as the field is only used for logging and populating the responseTo field in its replies to the client. That said, re-using a wire protocol message might violate rules for gossipping the cluster time and might also have implications for Command Monitoring, since the original write command and its retry attempt may report the same requestId.

Why can't drivers split bulk write commands to maximize retryability?

In Supported Write Operations, the spec prohibits drivers from altering existing logic for splits bulkWrite()'s requests parameter into write commands in an attempt to segregate unsupported, multi-document write operations and maximize retryability for other, supported write operations. The reasoning behind this prohibition is that such behavior would conflict with a primary goal of the bulk API in reducing the number of command round-trips to the server.

retryWrites originally defaulted to false, why does it now default to true?

Since the initial release of retryable writes in MongoDB 3.6 testing showed that the overhead for supported operations was sufficiently small that there was no risk in changing the default. Additionally, the fact that some operations continue to be unsupported for retryable writes (updateMany and deleteMany) does not seem to pose a problem in practice.

Why do drivers have to parse errmsg to determine storage engine support?

There is no reliable way to determine the storage engine in use for shards in a sharded cluster, and replica sets (and shards) can have mixed deployments using different storage engines on different members. This is especially true when a replica set or sharded cluster is being upgraded from one storage engine to another. This could be common when upgrading to MongoDB 4.2, where MMAPv1 is no longer supported.

The server returns error code 20 (IllegalOperation) when the storage engine doesn't support document-level locking and txnNumbers. Error code 20 is used for a large number of different error cases in the server so we need some other way to differentiate this error case from any other. The error code and errmsg are the same in MongoDB 3.6 and 4.0, and the same from a replica set or sharded cluster (mongos just forwards the error from the shard's replica set).

Why does the driver only add the RetryableWriteError label to errors that occur on a MongoClient with retryWrites set to true?

The driver does this to maintain consistency with the MongoDB server. Servers that support the RetryableWriteError label (MongoDB version 4.4 and newer) only add the label to an error when the client has added a txnNumber to the command, which only happens when the retryWrites option is true on the client. For the driver to add the label even if retryWrites is not true would be inconsistent with the server and potentially confusing to developers.

Changelog

  • 2024-05-08: Add guidance for client-level bulkWrite() retryability.

  • 2024-05-02: Migrated from reStructuredText to Markdown.

  • 2024-04-29: Fix the link to the Driver Sessions spec.

  • 2024-01-16: Do not use writeConcernError.code in pre-4.4 mongos response to
    determine retryability. Do not use writeErrors[].code in pre-4.4 server responses to determine retryability.

  • 2023-12-06: Clarify that writes are not retried within transactions.

  • 2023-12-05: Add that any server information associated with retryable
    exceptions MUST reflect the originating server, even in the presence of retries.

  • 2023-10-02: When CSOT is not enabled, one retry attempt occurs.

  • 2023-08-26: Require that in a sharded cluster the server on which the
    operation failed MUST be provided to the server selection mechanism as a deprioritized server.

  • 2022-11-17: Add logic for persisting "currentError" as "previousError" on first
    retry attempt, avoiding raising "null" errors.

  • 2022-11-09: CLAM must apply both events and log messages.

  • 2022-10-18: When CSOT is enabled multiple retry attempts may occur.

  • 2022-10-05: Remove spec front matter and reformat changelog.

  • 2022-01-25: Note that drivers should retry handshake network failures.

  • 2021-11-02: Clarify that error labels are only specified in a top-level field
    of an error.

  • 2021-04-26: Replaced deprecated terminology

  • 2021-03-24: Require that PoolClearedErrors be retried

  • 2020-09-01: State the the driver should only add the RetryableWriteError label
    to network errors when connected to a 4.4+ server.

  • 2020-02-25: State that the driver should only add the RetryableWriteError label
    when retryWrites is on, and make it clear that mongos will sometimes perform internal retries and not return the RetryableWriteError label.

  • 2020-02-10: Remove redundant content in Tests section.

  • 2020-01-14: Add ExceededTimeLimit to the list of error codes that should
    receive a RetryableWriteError label.

  • 2019-10-21: Change the definition of "retryable write" to be based on the
    RetryableWriteError label. Stop requiring drivers to parse errmsg to categorize retryable errors for pre-4.4 servers.

  • 2019-07-30: Drivers must rewrite error messages for error code 20 when
    txnNumber is not supported by the storage engine.

  • 2019-06-07: Mention $merge stage for aggregate alongside $out

  • 2019-05-29: Renamed InterruptedDueToStepDown to InterruptedDueToReplStateChange

  • 2019-03-06: retryWrites now defaults to true.

  • 2019-03-05: Prohibit resending wire protocol messages if doing so would violate
    rules for gossipping the cluster time.

  • 2018-06-07: WriteConcernFailed is not a retryable error code.

  • 2018-04-25: Evaluate retryable eligibility of bulkWrite() commands individually.

  • 2018-03-14: Clarify that retryable writes may fail with a FCV 3.4 shard.

  • 2017-11-02: Drivers should not raise errors if selected server does not support
    retryable writes and instead fall back to non-retryable behavior. In addition to wire protocol version, drivers may check for logicalSessionTimeoutMinutes to determine if a server supports sessions and retryable writes.

  • 2017-10-26: Errors when retrying may be raised instead of the original error
    provided they allow the user to infer that an attempt was made.

  • 2017-10-23: Drivers must document operations that support retryability.

  • 2017-10-23: Raise the original retryable error if server selection or wire
    protocol checks fail during the retry attempt. Encourage drivers to provide intermediary write results after an unrecoverable failure during a bulk write.

  • 2017-10-18: Standalone servers do not support retryable writes.

  • 2017-10-18: Also retry writes after a "not writable primary" error.

  • 2017-10-08: Renamed txnNum to txnNumber and noted that it must be a
    64-bit integer (BSON type 0x12).

  • 2017-08-25: Drivers will maintain an allow list so that only supported write
    operations may be retried. Transaction IDs will not be included in unsupported write commands, irrespective of the retryWrites option.

  • 2017-08-18: retryWrites is now a MongoClient option.

Client Side Operations Timeout

  • Status: Accepted
  • Minimum Server Version: 2.6

Abstract

This specification outlines a new timeoutMS option to govern the amount of time that a single operation can execute before control is returned to the user. This timeout applies to all of the work done to execute the operation, including but not limited to server selection, connection checkout, and server-side execution.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Specification

Terms

min(a, b)
Shorthand for "the minimum of a and b" where a and b are numeric values. For any cases where 0 means "infinite" (e.g. timeoutMS), min(0, other) MUST evaluate to other.

MongoClient Configuration

This specification introduces a new configuration option and deprecates some existing options.

timeoutMS

This 64-bit integer option specifies the per-operation timeout value in milliseconds. The default value is unset which means this feature is not enabled, i.e. the existing timeout behavior is unchanged (including serverSelectionTimeoutMS, connectTimeoutMS, socketTimeoutMS etc..). An explicit value of 0 means infinite, though some client-side timeouts like serverSelectionTimeoutMS will still apply. Drivers MUST error if a negative value is specified. This value MUST be configurable at the level of a MongoClient, MongoDatabase, MongoCollection, or of a single operation. However, if the option is specified at any level, it cannot be later changed to unset. At each level, the value MUST be inherited from the previous level if it is not explicitly specified. Additionally, some entities like ClientSession and GridFSBucket either inherit timeoutMS from their parent entities or provide options to override it. The behavior for these entities is described in individual sections of this specification.

Drivers for languages that provide an idiomatic API for expressing durations of time (e.g. TimeSpan in .NET) MAY choose to leverage these APIs for the timeoutMS option rather than using int64. Drivers that choose to do so MUST also follow the semantics for special values defined by those types. Such drivers MUST also ensure that there is a way to explicitly set timeoutMS to infinite in the API.

See timeoutMS cannot be changed to unset once it's specified.

Backwards Breaking Considerations

This specification deprecates many existing timeout options and introduces a new exception type that is used to communicate timeout expiration. If drivers need to make backwards-breaking changes to support timeoutMS, the backwards breaking behavior MUST be gated behind the presence of the timeoutMS option. If the timeoutMS option is not set, drivers MUST continue to honor existing timeouts such as socketTimeoutMS. Backwards breaking changes include any changes to exception types thrown by stable API methods or changes to timeout behavior. Drivers MUST document these changes.

In a subsequent major release, drivers SHOULD drop support for legacy timeout behavior and only continue to support the timeout options that are not deprecated by this specification. Once legacy options are removed, drivers MUST make the backwards-breaking behavioral changes described in this specification regardless of whether or not timeoutMS is set by the application.

See the Errors section for explanations of the backwards-breaking changes to error reporting.

Deprecations

The following configuration timeout options MUST be deprecated in favor of timeoutMS:

  • socketTimeoutMS
  • waitQueueTimeoutMS
  • wTimeoutMS

The following options for CRUD methods MUST be deprecated in favor of timeoutMS:

  • maxTimeMS
  • maxCommitTimeMS

Timeout Behavior

The timeoutMS option specifies the best-effort maximum amount of time a single operation can take before control is returned to the application. Drivers MUST keep track of the remaining time before the timeout expires as an operation progresses.

Operations

The timeoutMS option applies to all operations defined in the following specifications:

In addition, it applies to all operations on cursor objects that may perform blocking work (e.g. methods to iterate or close a cursor, any method that reads documents from a cursor into an array, etc).

Validation and Overrides

When executing an operation, drivers MUST ignore any deprecated timeout options if timeoutMS is set on the operation or is inherited from the collection/database/client levels. In addition to being set at these levels, the timeout for an operation can also be expressed via an explicit ClientSession (see Convenient Transactions API). In this case, the timeout on the session MUST be used as the timeoutMS value for the operation. Drivers MUST raise a validation error if an explicit session with a timeout is used and the timeoutMS option is set at the operation level for operations executed as part of a withTransaction callback.

See timeoutMS overrides deprecated timeout options.

Errors

If the timeoutMS option is not set and support for deprecated timeout options has not been dropped but a timeout is encountered (e.g. server selection times out), drivers MUST continue to return existing errors. This ensures that error-handling code in existing applications does not break unless the user opts into using timeoutMS.

If the timeoutMS option is set and the timeout expires, drivers MUST abort all blocking work and return control to the user with an error. This error MUST be distinguished in some way (e.g. custom exception type) to make it easier for users to detect when an operation fails due to a timeout. If the timeout expires during a blocking task, drivers MUST expose the underlying error returned from the task from this new error type. The stringified version of the new error type MUST include the stringified version of the underlying error as a substring. For example, if server selection expires and returns a ServerSelectionTimeoutException, drivers must allow users to access that exception from this new error type. If there is no underlying error, drivers MUST add information about when the timeout expiration was detected to the stringified version of the timeout error.

Error Transformations

When using the new timeout error type, drivers MUST transform timeout errors from external sources into the new error. One such error is the MaxTimeMSExpired server error. When checking for this error, drivers MUST only check that the error code is 50 and MUST NOT check the code name or error message. This error can be present in a top-level response document where the ok value is 0, as part of an error in the writeErrors array, or in a nested writeConcernError document. For example, all three of the following server responses would match this criteria:

{ok: 0, code: 50, codeName: "MaxTimeMSExpired", errmsg: "operation time limit exceeded"}

{ok: 1, writeErrors: [{code: 50, codeName: "MaxTimeMSExpired", errmsg: "operation time limit exceeded"}]}

{ok: 1, writeConcernError: {code: 50, codeName: "MaxTimeMSExpired"}}

Timeouts from other sources besides MongoDB servers MUST also be transformed into this new exception type. These include socket read/write timeouts and HTTP request timeouts.

Blocking Sections for Operation Execution

The following pieces of operation execution are considered blocking:

  1. Implicit session acquisition if an explicit session was not provided for the operation. This is only considered blocking for drivers that perform server selection to determine session support when acquiring implicit sessions.
  2. Server selection
  3. Connection checkout - If maxPoolSize has already been reached for the selected server, this is the amount of time spent waiting for a connection to be available.
  4. Connection establishment - If the pool for the selected server is empty and a new connection is needed, the following pieces of connection establishment are considered blocking:
    1. TCP socket establishment
    2. TLS handshake
      1. All messages sent over the socket as part of the TLS handshake
      2. OCSP verification - HTTP requests sent to OCSP responders.
    3. MongoDB handshake (i.e. initial connection hello)
    4. Authentication
      1. SCRAM-SHA-1, SCRAM-SHA-256, PLAIN: Execution of the command required for the SASL conversation.
      2. GSSAPI: Execution of the commands required for the SASL conversation and requests to the KDC and TGS.
      3. MONGODB-AWS: Execution of the commands required for the SASL conversation and all HTTP requests to ECS and EC2 endpoints.
      4. MONGODB-X509: Execution of the commands required for the authentication conversation.
  5. Client-side encryption
    1. Execution of listCollections commands to get collection schemas.
    2. Execution of find commands against the key vault collection to get encrypted data keys.
    3. Requests to non-local key management servers (e.g. AWS KMS) to decrypt data keys.
    4. Requests to mongocryptd servers.
  6. Socket write to send a command to the server
  7. Socket read to receive the server’s response

The timeoutMS option MUST apply to all blocking sections. Drivers MUST document any exceptions. For example, drivers that do not have full control over OCSP verification might not be able to set timeouts for HTTP requests to responders and would document that OCSP verification could result in an execution time greater than timeoutMS.

Server Selection

If timeoutMS is set, drivers MUST use min(serverSelectionTimeoutMS, remaining timeoutMS), referred to as computedServerSelectionTimeout as the timeout for server selection and connection checkout. The server selection loop MUST fail with a timeout error once the timeout expires.

After a server has been selected, drivers MUST use the remaining computedServerSelectionTimeout value as the timeout for connection checkout. If a new connection is required, min(connectTimeoutMS, remaining computedServerSelectionTimeout) MUST be used as the timeout for TCP socket establishment. Any network requests required to create or authenticate a connection (e.g. HTTP requests to OCSP responders) MUST use min(operationTimeout, remaining computedServerSelectionTimeout) as a timeout, where operationTimeout is the specified default timeout for the network request. If there is no specified default, these operations MUST use the remaining computedServerSelectionTimeout value. All commands sent during the connection’s handshake MUST use the remaining computedServerSelectionTimeout as their timeoutMS value. Handshake commands MUST also set timeouts per the Command Execution section.

If timeoutMS is not set and support for waitQueueTimeoutMS has not been removed, drivers MUST continue to exhibit the existing timeout behavior by honoring serverSelectionTimeoutMS for server selection and waitQueueTimeoutMS for connection checkout. If a new connection is required, drivers MUST use connectTimeoutMS as the timeout for socket establishment and socketTimeoutMS as the socket timeout for all handshake commands.

See serverSelectionTimeoutMS is not deprecated and connectTimeoutMS is not deprecated.

Command Execution

If timeoutMS is set, drivers MUST append a maxTimeMS field to commands executed against a MongoDB server using the minRoundTripTime field of the selected server. Note that this value MUST be retrieved during server selection using the servers field of the same TopologyDescription that was used for selection before the selected server's description can be modified. Otherwise, drivers may be subject to a race condition where a server is reset to the default description (e.g. due to an error in the monitoring thread) after it has been selected but before the RTT is retrieved.

If the minRoundTripTime is less than the remaining timeoutMS, the value of this field MUST be remaining timeoutMS - minRoundTripTime. If not, drivers MUST return a timeout error without attempting to send the message to the server. This is done to ensure that an operation is not routed to the server if it will likely fail with a socket timeout as that could cause connection churn. The maxTimeMS field MUST be appended after all blocking work is complete.

After wire message construction, drivers MUST check for timeout before writing the message to the server. If the timeout has expired or the amount of time remaining is less than the selected server's minimum RTT, drivers MUST return the connection to the pool and raise a timeout exception. Otherwise, drivers MUST set the connection’s write timeout to the remaining timeoutMS value before writing a message to the server. After the write is complete, drivers MUST check for timeout expiration before reading the server’s response. If the timeout has expired, the connection MUST be closed and a timeout exception MUST be propagated to the application. If it has not, drivers MUST set the connection’s read timeout to the remaining timeoutMS value. The timeout MUST apply to the aggregate of all reads done to receive a server response, not to individual reads. If any read or write calls on the socket fail with a timeout, drivers MUST transform the error into the new timeout exception as described in the Error Transformations section.

If timeoutMS is not set and support for socketTimeoutMS has not been removed, drivers MUST honor socketTimeoutMS as the timeout for socket reads and writes.

See maxTimeMS accounts for server RTT.

Batching

If an operation must be sent to the server in multiple batches (e.g. collection.bulkWrite()), the timeoutMS option MUST apply to the entire operation, not to each individual batch.

Retryability

If an operation requires a retry per the retryable reads or writes specifications and timeoutMS is set, drivers MUST retry operations as many times as possible before the timeout expires or a retry attempt returns a non-retryable error. Once the timeout expires, a timeout error MUST be raised.

See Why don’t drivers use backoff/jitter between retry attempts?.

Client Side Encryption

If automatic client-side encryption or decryption is enabled, the remaining timeoutMS value MUST be used as the timeoutMS when executing listCollections commands to retrieve collection schemas, find commands to get data from the key vault, and any commands against mongocryptd. It MUST also be used as the request timeout for HTTP requests against KMS servers to decrypt data keys. When sending a command to mongocryptd, drivers MUST NOT append a maxTimeMS field. This is to ensure that a maxTimeMS field can be safely appended to the command after it has been marked by mongocryptd and encrypted by libmongocrypt. To determine whether or not the server is a mongocryptd, drivers MUST check that the iscryptd field in the server's description is true.

For explicit encryption and decryption, the ClientEncryptionOpts options type used to construct ClientEncryption instances MUST support a new timeoutMS option, which specifies the timeout for all operations executed on the ClientEncryption object.

See maxTimeMS is not added for mongocryptd.

Background Connection Pooling

Connections created as part of a connection pool’s minPoolSize maintenance routine MUST use connectTimeoutMS as the timeout for connection establishment. After the connection is established, if timeoutMS is set at the MongoClient level, it MUST be used as the timeout for all commands sent as part of the MongoDB or authentication handshakes. The timeout MUST be refreshed after each command. These commands MUST set timeouts per the Command Execution section. If timeoutMS is not set, drivers MUST continue to honor socketTimeoutMS as the socket timeout for handshake and authentication commands.

Server Monitoring

Drivers MUST NOT use timeoutMS for commands executed by the server monitoring and RTT calculation threads.

See Monitoring threads do not use timeoutMS.

Cursors

For operations that create cursors, timeoutMS can either cap the lifetime of the cursor or be applied separately to the original operation and all next calls. To support both of these use cases, these operations MUST support a timeoutMode option. This option is an enum with possible values CURSOR_LIFETIME and ITERATION. The default value depends on the type of cursor being created. Drivers MUST error if timeoutMode is set and timeoutMS is not.

When applying the timeoutMS option to next calls on cursors, drivers MUST ensure it applies to the entire call, not individual commands. For drivers that send getMore requests in a loop when iterating tailable cursors, the timeout MUST apply to the totality of all getMore’s, not to each one individually. If a resume is required for a next call on a change stream, the timeout MUST apply to the entirety of the initial getMore and all commands sent as part of the resume attempt.

For close methods, drivers MUST allow timeoutMS to be overridden if doing so is possible in the language. If explicitly set for the operation, it MUST be honored. Otherwise, if timeoutMS was applied to the operation that created the cursor, it MUST be refreshed for the killCursors command if one is required. Note that this means timeoutMS will be refreshed for the close call even if the cursor was created with a timeoutMode of CURSOR_LIFETIME and the timeout associated with the cursor has expired. The calculated timeout MUST apply to explicit close methods that can be invoked by users as well as implicit destructors that are automatically invoked when exiting resource blocks.

See Cursor close() methods refresh timeoutMS.

Non-tailable Cursors

For non-tailable cursors, the default value of timeoutMode is CURSOR_LIFETIME. If timeoutMS is set, drivers MUST apply it to the original operation and the lifetime of the created cursor. For example, if a find is executed at time T, the find and all getMore’s on the cursor must finish by time T + timeoutMS. When executing next calls on the cursor, drivers MUST use the remaining timeout as the timeoutMS value for the operation but MUST NOT append a maxTimeMS field to getMore commands. If there are documents remaining in a previously retrieved batch, the next method MUST return them even if the timeout has expired and MUST only return a timeout error if a getMore is required.

If timeoutMode is set to ITERATION, drivers MUST raise a client-side error if the operation is an aggregate with a $out or $merge pipeline stage. If the operation is not an aggregate with $out or $merge, drivers MUST honor the timeoutMS option for the initial command but MUST NOT append a maxTimeMS field to the command sent to the server. After the operation has executed, the original timeoutMS value MUST also be applied to each next call on the created cursor. Drivers MUST NOT append a maxTimeMS field to getMore commands.

See Non-tailable cursor behavior.

Tailable Cursors

Tailable cursors only support the ITERATION value for the timeoutMode option. This is the default value and drivers MUST error if the option is set to CURSOR_LIFETIME.

Tailable non-awaitData Cursors

If timeoutMS is set, drivers MUST apply it separately to the original operation and to all next calls on the resulting cursor but MUST NOT append a maxTimeMS field to any commands.

Tailable awaitData Cursors

If timeoutMS is set, drivers MUST apply it to the original operation. Drivers MUST also apply the original timeoutMS value to each next call on the resulting cursor but MUST NOT use it to derive a maxTimeMS value for getMore commands. Helpers for operations that create tailable awaitData cursors MUST also support the maxAwaitTimeMS option. Drivers MUST error if this option is set, timeoutMS is set to a non-zero value, and maxAwaitTimeMS is greater than or equal to timeoutMS. If this option is set, drivers MUST use it as the maxTimeMS field on getMore commands.

See Tailable cursor behavior for rationale regarding both non-awaitData and awaitData cursors.

Change Streams

Driver watch helpers MUST support both timeoutMS and maxAwaitTimeMS options. Drivers MUST error if maxAwaitTimeMS is set, timeoutMS is set to a non-zero value, and maxAwaitTimeMS is greater than or equal to timeoutMS. These helpers MUST NOT support the timeoutMode option as change streams are an abstraction around tailable-awaitData cursors, so they implicitly use ITERATION mode. If set, drivers MUST apply the timeoutMS option to the initial aggregate operation. Drivers MUST also apply the original timeoutMS value to each next call on the change stream but MUST NOT use it to derive a maxTimeMS field for getMore commands. If the maxAwaitTimeMS option is set, drivers MUST use it as the maxTimeMS field on getMore commands.

If a next call fails with a timeout error, drivers MUST NOT invalidate the change stream. The subsequent next call MUST perform a resume attempt to establish a new change stream on the server. Any errors from the aggregate operation done to create a new change stream MUST be propagated to the application. Drivers MUST document that users can either call next again or close the existing change stream and create a new one if a previous next call times out. The documentation MUST suggest closing and re-creating the stream with a higher timeout if the timeout occurs before any events have been received because this is a signal that the server is timing out before it can finish processing the existing oplog.

See Change stream behavior.

Sessions

The SessionOptions used to construct explicit ClientSession instances MUST accept a new defaultTimeoutMS option, which specifies the timeoutMS value for the following operations executed on the session:

  1. commitTransaction
  2. abortTransaction
  3. withTransaction
  4. endSession

If this option is not specified for a ClientSession, it MUST inherit the timeoutMS of its parent MongoClient.

Session checkout

As noted in Blocking Sections for Operation Execution, implicit session checkout can be considered a blocking process for some drivers. Such drivers MUST apply the remaining timeoutMS value to this process when executing an operation. For explicit session checkout, drivers MUST apply the timeoutMS value of the MongoClient to the startSession call if set. Drivers MUST NOT allow users to override timeoutMS for startSession operations.

See timeoutMS cannot be overridden for startSession calls.

Convenient Transactions API

If timeoutMS is set, drivers MUST apply it to the entire withTransaction call. To propagate the timeout to the user-supplied callback, drivers MUST store the timeout as a field on the ClientSession object. This field SHOULD be private to ensure that a user can not modify it while a withTransaction call is in progress. Drivers that cannot make this field private MUST signal that the field should not be accessed or modified by users if there is an idiomatic way to do so in the language (e.g. underscore-prefixed variable names in Python) and MUST document that modification of the field can cause unintended correctness issues for applications. Drivers MUST document that the remaining timeout will not be applied to callback operations that do not use the ClientSession. Drivers MUST also document that overriding timeoutMS for operations executed using the explicit session inside the provided callback will result in a client-side error, as defined in Validation and Overrides. If the callback returns an error and the transaction must be aborted, drivers MUST refresh the timeoutMS value for the abortTransaction operation.

If timeoutMS is not set, drivers MUST continue to exhibit the existing 120 second timeout behavior. Drivers MUST NOT change existing implementations to use timeoutMS=120000 for this case.

See withTransaction communicates timeoutMS via ClientSession and withTransaction refreshes the timeout for abortTransaction.

GridFS API

GridFS buckets MUST inherit timeoutMS from their parent MongoDatabase instance and all methods in the GridFS Bucket API MUST support the timeoutMS option. For methods that create streams (e.g. open_upload_stream), the option MUST cap the lifetime of the entire stream. This MUST include the time taken by any operations executed during stream construction, reads/writes, and close/abort calls. For example, if a stream is created at time T, the final close call on the stream MUST finish all blocking work before time T + timeoutMS. Methods that interact with a user-provided stream (e.g. upload_from_stream) MUST use timeoutMS as the timeout for the entire upload/download operation. If the user-provided streams do not support timeouts, drivers MUST document that the timeout for these methods may be breached if calls to interact with the stream take longer than the remaining timeout. If timeoutMS is set, all cursors created for GridFS API operations MUST internally set the timeoutMode option to CURSOR_LIFETIME.

See GridFS streams behavior.

RunCommand

The behavior of runCommand is undefined if the provided command document includes a maxTimeMS field and the timeoutMS option is set. Drivers MUST document the behavior of runCommand for this case and MUST NOT attempt to check the command document for the presence of a maxTimeMS field.

See runCommand behavior.

Test Plan

See the README.rst in the tests directory.

Motivation for Change

Users have many options to set timeouts for various parts of operation execution including, but not limited to, serverSelectionTimeoutMS, socketTimeoutMS, connectTimeoutMS, maxTimeMS, and wTimeoutMS. As a result, users are often unsure which timeout to use. Because some of these timeouts are additive, it is difficult to set a combination which ensures control will be returned to the user after a specified amount of time. To make timeouts more intuitive, changes are required to the drivers API to deprecate some of the existing timeouts and add a new one to specify the maximum execution time for an entire operation from start to finish.

In addition, automatically retrying reads and writes that failed due to transient network blips or planned maintenance scenarios has improved application resiliency but the original behavior of only retrying once still allowed some errors to be propagated to applications. Supporting a timeout for an entire operation allows drivers to retry operations multiple times while still guaranteeing that an application can get back control once the specified amount of time has elapsed.

Design Rationale

timeoutMS cannot be changed to unset once it's specified

If timeoutMS is specified at any level, it cannot be later changed to unset at a lower level. For example, a user cannot do:

client = MongoClient(uri, timeoutMS=1000)
db = client.database("foo", timeoutMS=None)

This is because drivers return existing exception types if timeoutMS is not specified, but will return new exception types and use new timeout behaviors if it is. Once the user has opted into this behavior, we should not allow them to opt out of it at a lower level. If a user wishes to set the timeout to infinite for a specific database, collection, or operation, they can explicitly set timeoutMS to 0.

serverSelectionTimeoutMS is not deprecated

The original goal of the project was to expose a single timeout and deprecate all others. This was not possible, however, because executing an operation consists of two distinct parts. The first is selecting a server and checking out a connection from its pool. This should have a default timeout because failure to do this indicates that the deployment is not in a healthy state or that there was a configuration error which prevents the driver from successfully connecting. The second is server-side operation execution, which cannot have a default timeout. Some operations finish in a few milliseconds, while others can run for many hours. Adding a default would inevitably break applications. To accomplish both of these goals, serverSelectionTimeoutMS was preserved and is used to timeout the client-side section of operation execution.

connectTimeoutMS is not deprecated

Similar to the reasoning for not deprecating serverSelectionTimeoutMS, socket establishment should have a default timeout because failure to create a socket likely means that the target server is not healthy or there is a network issue. To accomplish this, the connectTimeoutMS option is not deprecated by this specification. Drivers also use connectTimeoutMS to derive a socket timeout for monitoring connections, which are not subject to timeoutMS.

timeoutMS overrides deprecated timeout options

Applying both timeoutMS and a deprecated timeout option like socketTimeoutMS at the same time would lead to confusing semantics that are difficult to document and understand. When first writing this specification, we considered having drivers error in this situation to catch mismatched timeouts as early as possible. However, because timeoutMS can be set at any level, this behavior could lead to unanticipated runtime errors if an application set timeoutMS for a specific operation and the MongoClient used in production was configured with a deprecated timeout option. To have clear semantics and avoid unexpected errors in applications, we decided that timeoutMS should override deprecated timeout options.

maxTimeMS is not added for mongocryptd

The mongocryptd server annotates the provided command to indicate encryption requirements and returns the marked up result. If the command sent to mongocryptd contained maxTimeMS, the final command sent to MongoDB would contain two maxTimeMS fields: one added by the regular MongoClient and another added by the mongocryptd client. To avoid this complication, drivers do not add this field when sending commands to mongocryptd at all. Doing so does not sacrifice any functionality because mongocryptd always runs on localhost and does not perform any blocking work, so execution or network timeouts cannot occur.

maxTimeMS accounts for server RTT

When constructing a command, drivers use the timeoutMS option to derive a value for the maxTimeMS command option and the socket timeout. The full time to round trip a command is (network RTT + server-side execution time). If both maxTimeMS and socket timeout were set to the same value, the server would never be able to respond with a MaxTimeMSExpired error because drivers would hit the socket timeout first and close the connection. This would lead to connection churn if the specified timeout is too low. To allow the server to gracefully error and avoid churn, drivers must account for the network round trip in the maxTimeMS calculation.

Monitoring threads do not use timeoutMS

Using timeoutMS in the monitoring and RTT calculation threads would require another special case in the code that derives maxTimeMS from timeoutMS because the awaitable hello requests sent to 4.4+ servers already have a maxAwaitTimeMS field. Adding maxTimeMS also does not help for non-awaitable hello commands because we expect them to execute quickly on the server. The Server Monitoring spec already mandates that drivers set and dynamically update the read/write timeout of the dedicated connections used in monitoring threads, so we rely on that to time out commands rather than adding complexity to the behavior of timeoutMS.

runCommand behavior

The behavior of runCommand varies across drivers. If the provided command document includes a maxTimeMS field and the timeoutMS option is set, some drivers would overwrite the maxTimeMS field with the value derived from timeoutMS, while others would append a second maxTimeMS field, which would cause a server error on versions 3.4+. To be prescriptive, we could mandate that drivers raise a client-side error in this case, but this would require a potentially expensive lookup in the command document. To avoid this additional cost, drivers are only required to document the behavior and suggest that timeoutMS be used instead of including a manual maxTimeMS field.

Why don't drivers use backoff/jitter between retry attempts?

Earlier versions of this specification proposed adding backoff and/or jitter between retry attempts to avoid connection storming or overloading the server, but we later deemed this unnecessary. If multiple concurrent operations select the same server for a retry and its connection pool is empty, we rely on the maxConnecting parameter introduced in DRIVERS-781 to rate limit new connection attempts, which mitigates the risk of connection storms. Even if the new server has enough connections in its pool to service the operations, recent server versions do very little resource-intensive work until execution reaches the storage layer, which is already guarded by read/write tickets, so we don’t expect the server to be overwhelmed. If we later decide that adding jitter would be useful, it may be easier to do so in the server itself via a ticket-based admission system earlier in the execution stack.

Cursor close() methods refresh timeoutMS

If a cursor times out client-side (e.g. a non-tailable cursor created with timeoutMode=CURSOR_LIFETIME), it’s imperative that drivers make a good-faith effort to close the server-side cursor even though the timeout has expired because failing to do so would leave resources open on the server for a potentially long time. It was decided that timeoutMS will be refreshed for close operations to allow the cursor to be killed server-side.

Non-tailable cursor behavior

There are two usage patterns for non-tailable cursors. The first is to read documents from a cursor into an iterable object, either by explicitly iterating the cursor in a loop or using a language construct like Python list comprehensions. To supply a timeout for the entire process, drivers use timeoutMS to cap the execution time for the initial command and all required getMore’s. This use case also matches the server behavior; if maxTimeMS is set for an operation that creates a non-tailable cursor, the server will use the time limit to cap the total server-side execution time for future getMore’s. Because this type of usage matches the server behavior and is the more common case, this is the default behavior.

The second use case is batch processing, where the user takes advantage of the lazy nature of cursors to process documents from a large collection. In this case, the user does not want all documents from the collection to be in an array because that would require too much memory. To accommodate this use case, drivers support a new timeoutMode option. Users can set the value for this option to ITERATION to have timeoutMS apply to the original command and then separately to each next call. When this option is used, drivers do not set maxTimeMS on the initial command to avoid capping the cursor lifetime in the server.

Tailable cursor behavior

Once a tailable cursor is created, it conceptually lives forever. Therefore, it only makes sense to support timeoutMode=ITERATION for these cursors and drivers error if timeoutMode=CURSOR_LIFETIME is specified.

There are two types of tailable cursors. The first, tailable non-awaitData cursors, support maxTimeMS for the original command but not for any getMore requests. However, setting maxTimeMS on the original command also incorrectly caps the server-side execution time for future getMore’s (SERVER-51153). This is undesirable behavior because it does not match the guarantees made by timeoutMode=ITERATION. To work around this, drivers honor timeoutMS for both the original operation and all getMore’s but only use it to derive client-side timeouts and do not append a maxTimeMS field to any commands. The server-side execution time is enforced via socket timeouts.

The second type is tailable awaitData cursors. The server supports the maxTimeMS option for the original command. For getMore’s, the option is supported, but instead of limiting the server-side execution time, it specifies how long the server should wait for new data to arrive if it reaches the end of the capped collection and the batch is still empty. If no new data arrives within that time limit, the server will respond with an empty batch. For these cursors, drivers support both the timeoutMS and maxAwaitTimeMS options. The timeoutMS option is used to derive client-side timeouts, while the maxAwaitTimeMS option is used as the maxTimeMS field for getMore commands. These values have distinct meanings, so supporting both yields a more robust, albeit verbose, API. Drivers error if maxAwaitTimeMS is greater than or equal to timeoutMS because in that case, getMore requests would not succeed if the batch was empty: the server would wait for maxAwaitTimeMS, but the driver would close the socket after timeoutMS.

Change stream behavior

Change streams internally behave as tailable awaitData cursors, so the behavior of the timeoutMS option is the same for both. The main difference is that change streams are resumable and drivers automatically perform resume attempts when they encounter transient errors. This allows change streams to be resilient to timeouts. If timeoutMS expires during a next call, drivers can’t auto-resume, but they can make sure the change stream is not invalidated so the user can call next again. In this case, the subsequent call would perform the resume without doing a getMore first.

withTransaction communicates timeoutMS via ClientSession

Because the withTransaction API doesn’t allow drivers to plumb down the remaining timeout into the user-provided callback, this spec requires the remaining timeout to be stored on the ClientSession. Operations in the callback that run under that ClientSession can then extract the timeout from the session and apply it. To avoid confusing validation semantics, operations error if there is a timeout on the session but also an overridden timeout for the operation. It’s possible that the ability to communicate timeouts for a block of operations via a ClientSession is useful as a general purpose API, but we’ve decided to make it private until there are other known use cases.

withTransaction refreshes the timeout for abortTransaction

If the user-provided callback to withTransaction times out, it could leave a transaction running on the server. It’s imperative that drivers make an effort to abort the open transaction because failing to do so could result in the collections and databases affected by the transaction being locked for a long period of time, which could cause applications to stall. Because timeoutMS has expired before drivers attempt to abort the transaction, we require drivers to refresh it and apply the original value to the execution of the abortTransaction operation. This can cause the entire withTransaction call to take up to 2*timeoutMS, but it was decided that this risk is worthwhile given the importance of transaction cleanup.

GridFS streams behavior

Streams created by GridFS API operations (e.g. by open_upload_stream and open_download_stream) present a challenge for this specification. These types of streams execute multiple operations, but there can be artificial gaps between operations if the application does not invoke the stream functions for long periods of time. Generally, we expect users to upload or download an entire file as quickly as possible, so we decided to have timeoutMS cap the lifetime of the created stream. The other option was to apply the entire timeoutMS value to each operation executed by the stream, but streams perform many hidden operations, so this approach could cause an upload/download to take much longer than expected.

timeoutMS cannot be overridden for startSession calls

In general, users can override timeoutMS at the level of a single operation. The startSession operation, however, only inherits timeoutMS from the MongoClient and does not allow the option to be overridden. This was a conscious API design decision because drivers are moving towards only supporting MongoDB versions 3.6 and higher, so sessions will always be supported. Adding an override for startSession would introduce a new knob and increase the API surface of drivers without providing a significant benefit.

Drivers use minimum RTT to short circuit operations

A previous version of this spec used the 90th percentile RTT to short circuit operations that might otherwise fail with a socket timeout. We decided to change this logic to avoid canceling operations that may have a high chance of succeeding and also remove a dependency on t-digest. Instead, drivers use the minimum RTT from the last 10 samples, or 0 until at least 2 samples have been recorded.

Future work

Modify GridFS streams behavior via new options

As explained in the design rationale, drivers use timeoutMS to cap the entire lifetime of streams created by GridFS operations. If we find that users are often encountering timeout errors when using these APIs due to the time spent during non-MongoDB operations (e.g. streaming data read from a GridFS stream into another data store), we could consider toggling GridFS behavior via an option similar to timeoutMode for cursors. To avoid backwards-breaking behavioral changes, the default would continue to cap the stream lifetime but there could be another mode that refreshes the timeout for each database operation. This would mimic using timeoutMode=ITERATION for cursors.

Changelog

  • 2023-12-07: Migrated from reStructuredText to Markdown.
  • 2022-11-17: Use minimum RTT for maxTimeMS calculation instead of 90th percentile RTT.
  • 2022-10-05: Remove spec front matter.
  • 2022-01-19: Initial version.

Driver Sessions Specification

  • Status: Accepted
  • Minimum Server Version: 3.6

Abstract

Version 3.6 of the server introduces the concept of logical sessions for clients. A session is an abstract concept that represents a set of sequential operations executed by an application that are related in some way. This specification is limited to how applications start and end sessions. Other specifications define various ways in which sessions are used (e.g. causally consistent reads, retryable writes, or transactions).

This specification also discusses how drivers participate in distributing the cluster time throughout a deployment, a process known as "gossipping the cluster time". While gossipping the cluster time is somewhat orthogonal to sessions, any driver that implements sessions MUST also implement gossipping the cluster time, so it is included in this specification.

Definitions

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Terms

ClientSession
The driver object representing a client session and the operations that can be performed on it. Depending on the language a driver is written in this might be an interface or a class. See also ServerSession.

Deployment
A set of servers that are all part of a single MongoDB cluster. We avoid the word "cluster" because some people interpret "cluster" to mean "sharded cluster".

Explicit session
A session that was started explicitly by the application by calling startSession and passed as an argument to an operation.

MongoClient
The root object of a driver's API. MAY be named differently in some drivers.

Implicit session
A session that was started implicitly by the driver because the application called an operation without providing an explicit session.

MongoCollection
The driver object representing a collection and the operations that can be performed on it. MAY be named differently in some drivers.

MongoDatabase
The driver object representing a database and the operations that can be performed on it. MAY be named differently in some drivers.

ServerSession
The driver object representing a server session. This type is an implementation detail and does not need to be public. See also ClientSession.

Server session ID
A server session ID is a token used to identify a particular server session. A driver can ask the server for a session ID using the startSession command or it can generate one locally (see Generating a Session ID locally).

Session
A session is an abstract concept that represents a set of sequential operations executed by an application that are related in some way. Other specifications define the various ways in which operations can be related, but examples include causally consistent reads and retryable writes.

Topology
The current configuration and state of a deployment.

Unacknowledged writes
Unacknowledged writes are write operations that are sent to the server without waiting for a reply acknowledging the write. See the "When using unacknowledged writes" section below for information on how unacknowledged writes interact with sessions.

Network error
Any network exception writing to or reading from a socket (e.g. a socket timeout or error).

Specification

Drivers currently have no concept of a session. The driver API will be expanded to provide a way for applications to start and end sessions and to execute operations in the context of a session. The goal is to expand the API in a way that introduces no backward breaking changes. Existing applications that don't use sessions don't need to be changed, and new applications that don't need sessions can continue to be written using the existing API.

To use sessions an application will call new (or overloaded) methods that take a session parameter.

Naming variations

This specification defines names for new methods and types. To the extent possible, these names SHOULD be used by drivers. However, where a driver and/or language's naming conventions differ, those naming conventions SHOULD be used. For example, a driver might name a method StartSession or start_session instead of startSession, or might name a type client_session instead of ClientSession.

High level summary of the API changes for sessions

This section is just a high level summary of the new API. Details are provided further on.

Applications start a new session like this:

options = new SessionOptions(/* various settings */);
session = client.startSession(options);

The SessionOptions will be individually defined in several other specifications. It is expected that the set of SessionOptions will grow over time as sessions are used for new purposes.

Applications use a session by passing it as an argument to operation methods. For example:

collection.InsertOne(session /* etc. */)
collection.UpdateOne(session /* etc. */)

Applications end a session like this:

session.endSession()

This specification does not deal with multi-document transactions, which are covered in their own specification.

MongoClient changes

MongoClient interface summary

class SessionOptions {
  // various other options as defined in other specifications
}

interface MongoClient {
  ClientSession startSession(SessionOptions options);
  // other existing members of MongoClient
}

Each new member is documented below.

While it is not part of the public API, MongoClient also needs a private (or internal) clusterTime member (containing either a BSON document or null) to record the highest clusterTime observed in a deployment (as described below in Gossipping the cluster time).

startSession

The startSession method starts a new ClientSession with the provided options.

It MUST NOT be possible to change the options provided to startSession after startSession has been called. This can be accomplished by making the SessionOptions class immutable or using some equivalent mechanism that is idiomatic for your language.

It is valid to call startSession with no options set. This will result in a ClientSession that has no effect on the operations performed in the context of that session, other than to include a session ID in commands sent to the server.

The SessionOptions MAY be a strongly typed class in some drivers, or MAY be a loosely typed dictionary in other drivers. Drivers MUST define SessionOptions in such a way that new options can be added in a backward compatible way (it is acceptable for backward compatibility to be at the source level).

A ClientSession MUST be associated with a ServerSession at the time startSession is called. As an implementation optimization drivers MUST reuse ServerSession instances across multiple ClientSession instances subject to the rule that a server session MUST NOT be used by two ClientSession instances at the same time (see the Server Session Pool section). Additionally, a ClientSession may only ever be associated with one ServerSession for its lifetime.

Drivers MUST NOT check for session support in startSession. Instead, if sessions are not supported, the error MUST be reported the first time the session is used for an operation (See How to Tell Whether a Connection Supports Sessions).

Explicit vs implicit sessions

An explicit session is one started explicitly by the application by calling startSession. An implicit session is one started implicitly by the driver because the application called an operation without providing an explicit session. Internally, a driver must be able to distinguish between explicit and implicit sessions, but no public API for this is necessary because an application will never see an implicit session.

The motivation for starting an implicit session for all methods that don't take an explicit session parameter is to make sure that all commands that are sent to the server are tagged with a session ID. This improves the ability of an operations team to monitor (and kill if necessary) long running operations. Tagging an operation with a session ID is specially useful if a deployment wide operation needs to be killed.

Authentication

When using authentication, using a session requires that only a single user be authenticated. Drivers that still support authenticating multiple users at once MAY continue to do so, but MUST NOT allow sessions to be used under such circumstances.

If startSession is called when multiple users are authenticated drivers MUST raise an error with the error message "Cannot call startSession when multiple users are authenticated."

If a driver allows authentication to be changed on the fly (presumably few still do) the driver MUST either prevent ClientSession instances from being used with a connection that doesn't have matching authentication or MUST return an error if such use is attempted.

ClientSession

ClientSession instances are not thread safe or fork safe. They can only be used by one thread or process at a time.

Drivers MUST document the thread-safety and fork-safety limitations of sessions. Drivers MUST NOT attempt to detect simultaneous use by multiple threads or processes (see Q&A for the rationale).

ClientSession interface summary:

interface ClientSession {
    MongoClient client;
    Optional<BsonDocument> clusterTime;
    SessionOptions options;
    BsonDocument sessionId;

    void advanceClusterTime(BsonDocument clusterTime);
    void endSession();
}

While it is not part of the public API, a ClientSession also has a private (or internal) reference to a ServerSession.

Each member is documented below.

client

This property returns the MongoClient that was used to start this ClientSession.

clusterTime

This property returns the most recent cluster time seen by this session. If no operations have been executed using this session this value will be null unless advanceClusterTime has been called. This value will also be null when a cluster does not report cluster times.

When a driver is gossiping the cluster time it should send the more recent clusterTime of the ClientSession and the MongoClient.

options

This property returns the SessionOptions that were used to start this ClientSession.

sessionId

This property returns the session ID of this session. Note that since ServerSessions are pooled, different ClientSession instances can have the same session ID, but never at the same time.

advanceClusterTime

This method advances the clusterTime for a session. If the new clusterTime is greater than the session's current clusterTime then the session's clusterTime MUST be advanced to the new clusterTime. If the new clusterTime is less than or equal to the session's current clusterTime then the session's clusterTime MUST NOT be changed.

This method MUST NOT advance the clusterTime in MongoClient because we have no way of verifying that the supplied clusterTime is valid. If the clusterTime in MongoClient were set to an invalid value all future operations with this MongoClient would result in the server returning an error. The clusterTime in MongoClient should only be advanced with a $clusterTime received directly from a server.

endSession

This method ends a ClientSession.

In languages that have idiomatic ways of disposing of resources, drivers SHOULD support that in addition to or instead of endSession. For example, in the .NET driver ClientSession would implement IDisposable and the application could choose to call session.Dispose or put the session in a using statement instead of calling session.endSession. If your language has an idiomatic way of disposing resources you MAY choose to implement that in addition to or instead of endSession, whichever is more idiomatic for your language.

A driver MUST allow multiple calls to endSession (or Dispose). All calls after the first one are ignored.

Conceptually, calling endSession implies ending the corresponding server session (by calling the endSessions command). As an implementation detail drivers SHOULD cache server sessions for reuse (see Server Session Pool).

Once a ClientSession has ended, drivers MUST report an error if any operations are attempted with that ClientSession.

ServerSession

A ServerSession is the driver object that tracks a server session. This object is an implementation detail and does not need to be public. Drivers may store this information however they choose; this data structure is defined here merely to describe the operation of the server session pool.

ServerSession interface summary

interface ServerSession {
  BsonDocument sessionId;
  DateTime lastUse;
}

sessionId

This property returns the server session ID.

lastUse

The driver MUST update the value of this property with the current DateTime every time the server session ID is sent to the server. This allows the driver to track with reasonable accuracy the server's view of when a server session was last used.

Creating a ServerSession

When a driver needs to create a new ServerSession instance the only information it needs is the session ID to use for the new session. It can either get the session ID from the server by running the startSession command, or it can generate it locally.

In either case, the lastUse field of the ServerSession MUST be set to the current time when the ServerSession is created.

Generating a session ID locally

Running the startSession command to get a session ID for a new session requires a round trip to the server. As an optimization the server allows drivers to generate new session IDs locally and to just start using them. When a server sees a new session ID that it has never seen before it simply assumes that it is a new session.

A session ID is a BsonDocument that has the following form:

interface SessionId {
  id: UUID
}

Where the UUID is encoded as a BSON binary value of subtype 4.

The id field of the session ID is a version 4 UUID that must comply with the format described in RFC 4122. Section 4.4 describes an algorithm for generating correctly-versioned UUIDs from a pseudo-random number generator.

If a driver is unable to generate a version 4 UUID it MAY instead run the startSession command and let the server generate the session ID.

MongoDatabase changes

All MongoDatabase methods that talk to the server MUST send a session ID with the command when connected to a deployment that supports sessions so that the server can associate the operation with a session ID.

New database methods that take an explicit session

All MongoDatabase methods that talk to the server SHOULD be overloaded to take an explicit session parameter. (See why is session an explicit parameter?.)

When overloading methods to take a session parameter, the session parameter SHOULD be the first parameter. If overloading is not possible for your language, it MAY be in a different position or MAY be embedded in an options structure.

Methods that have a session parameter MUST check that the session argument is not null and was created by the same MongoClient that this MongoDatabase came from and report an error if they do not match.

Existing database methods that start an implicit session

When an existing MongoDatabase method that does not take a session is called, the driver MUST behave as if a new ClientSession was started just for this one operation and ended immediately after this operation completes. The actual implementation will likely involve calling client.startSession, but that is not required by this spec. Regardless, please consult the startSession section to replicate the required steps for creating a session. The driver MUST NOT use the session if the checked out connection does not support sessions (see How to Tell Whether a Connection Supports Sessions) and, in all cases, MUST NOT consume a server session id until after the connection is checked out and session support is confirmed.

MongoCollection changes

All MongoCollection methods that talk to the server MUST send a session ID with the command when connected to a deployment that supports sessions so that the server can associate the operation with a session ID.

New collection methods that take an explicit session

All MongoCollection methods that talk to the server, with the exception of estimatedDocumentCount, SHOULD be overloaded to take an explicit session parameter. (See why is session an explicit parameter?.)

When overloading methods to take a session parameter, the session parameter SHOULD be the first parameter. If overloading is not possible for your language, it MAY be in a different position or MAY be embedded in an options structure.

Methods that have a session parameter MUST check that the session argument is not null and was created by the same MongoClient that this MongoCollection came from and report an error if they do not match.

The estimatedDocumentCount helper does not support an explicit session parameter. The underlying command, count, is not supported in a transaction, so supporting an explicit session would likely confuse application developers. The helper returns an estimate of the documents in a collection and causal consistency is unlikely to improve the accuracy of the estimate.

Existing collection methods that start an implicit session

When an existing MongoCollection method that does not take a session is called, the driver MUST behave as if a new ClientSession was started just for this one operation and ended immediately after this operation completes. The actual implementation will likely involve calling client.startSession, but that is not required by this spec. Regardless, please consult the startSession section to replicate the required steps for creating a session. The driver MUST NOT use the session if the checked out connection does not support sessions (see How to Tell Whether a Connection Supports Sessions) and, in all cases, MUST NOT consume a server session id until after the connection is checked out and session support is confirmed.

Sessions and Cursors

When an operation using a session returns a cursor, all subsequent GETMORE commands for that cursor MUST be run using the same session ID.

If a driver decides to run a KILLCURSORS command on the cursor, it also MAY be run using the same session ID. See the Exceptions below for when it is permissible to not include a session ID in a KILLCURSORS command.

Sessions and Connections

To reduce the number of ServerSessions created, the driver MUST only obtain an implicit session's ServerSession after it successfully checks out a connection. A driver SHOULD NOT attempt to release the acquired session before connection check in.

Explicit sessions MAY be changed to allocate a server session similarly.

How to Tell Whether a Connection Supports Sessions

A driver can determine whether a connection supports sessions by checking whether the logicalSessionTimeoutMinutes property of the establishing handshake response has a value or not. If it has a value, sessions are supported.

In the case of an explicit session, if sessions are not supported, the driver MUST raise an error. In the case of an implicit session, if sessions are not supported, the driver MUST ignore the session.

Possible race condition when checking for session support

There is a possible race condition that can happen between the time the driver checks whether sessions are supported and subsequently sends a command to the server:

  • The server might have supported sessions at the time the connection was first opened (and reported a value for logicalSessionTimeoutMinutes in the initial response to the handshake), but have subsequently been downgraded to not support sessions. The server does not close the socket in this scenario, so the driver will conclude that the server at the other end of this connection supports sessions.

There is nothing that the driver can do about this race condition, and the server will just return an error in this scenario.

Sending the session ID to the server on all commands

When connected to a server that supports sessions a driver MUST append the session ID to every command it sends to the server (with the exceptions noted in the following section). It does this by adding a top level lsid field to the command sent to the server. A driver MUST do this without modifying any data supplied by the application (e.g. the command document passed to runCommand).:

interface ExampleCommandWithLSID {
  foo: 1;
  lsid: SessionId;
}

Exceptions to sending the session ID to the server on all commands

There are some exceptions to the rule that a driver MUST append the session ID to every command it sends to the server.

When opening and authenticating a connection

A driver MUST NOT append a session ID to any command sent during the process of opening and authenticating a connection.

When monitoring the state of a deployment

A driver MAY omit a session ID in hello and legacy hello commands sent solely for the purposes of monitoring the state of a deployment.

When sending a parallelCollectionScan command

Sessions are designed for sequential operations and parallelCollectionScan is designed for parallel operation. Because these are fundamentally incompatible goals, drivers MUST NOT append session ID to the parallelCollectionScan command so that the resulting cursors have no associated session ID and thus can be used in parallel.

When sending a killCursors command

A driver MAY omit a session ID in killCursors commands for two reasons. First, killCursors is only ever sent to a particular server, so operation teams wouldn't need the lsid for cluster-wide killOp. An admin can manually kill the op with its operation id in the case that it is slow. Secondly, some drivers have a background cursor reaper to kill cursors that aren't exhausted and closed. Due to GC semantics, it can't use the same lsid for killCursors as was used for a cursor's find and getMore, so there's no point in using any lsid at all.

When multiple users are authenticated and the session is implicit

The driver MUST NOT send a session ID from an implicit session when multiple users are authenticated. If possible the driver MUST NOT start an implicit session when multiple users are authenticated. Alternatively, if the driver cannot determine whether multiple users are authenticated at the point in time that an implicit session is started, then the driver MUST ignore any implicit sessions that subsequently end up being used on a connection that has multiple users authenticated.

When using unacknowledged writes

A session ID MUST NOT be used simultaneously by more than one operation. Since drivers don't wait for a response for an unacknowledged write a driver would not know when the session ID could be reused. In theory a driver could use a new session ID for each unacknowledged write, but that would result in many orphaned sessions building up at the server.

Therefore drivers MUST NOT send a session ID with unacknowledged writes under any circumstances:

  • For unacknowledged writes with an explicit session, drivers SHOULD raise an error. If a driver allows users to provide an explicit session with an unacknowledged write (e.g. for backwards compatibility), the driver MUST NOT send the session ID.
  • For unacknowledged writes without an explicit session, drivers SHOULD NOT use an implicit session. If a driver creates an implicit session for unacknowledged writes without an explicit session, the driver MUST NOT send the session ID.

Drivers MUST document the behavior of unacknowledged writes for both explicit and implicit sessions.

When wrapping commands in a $query field

If the driver is wrapping the command in a $query field for non-OP_MSG messages in order to pass a readPreference to a mongos (see ReadPreference and Mongos), the driver SHOULD NOT add the lsid as a top-level field, and MUST add the lsid as a field of the $query

// Wrapped command:
interface WrappedCommandExample {
  $query: {
    find: { foo: 1 }
  },
  $readPreference: {}
}

// Correct application of lsid
interface CorrectLSIDUsageExample {
  $query: {
    find: { foo: 1 },
    lsid: SessionId
  },
  $readPreference: {}
}

// Incorrect application of lsid
interface IncorrectLSIDUsageExample {
  $query: {
    find: { foo: 1 }
  },
  $readPreference: {},
  lsid: SessionId
}

Server Commands

startSession

The startSession server command has the following format:

interface StartSessionCommand {
  startSession: 1;
  $clusterTime?: ClusterTime;
}

The $clusterTime field should only be sent when gossipping the cluster time. See the section "Gossipping the cluster time" for information on $clusterTime.

The startSession command MUST be sent to the admin database.

The server response has the following format:

interface StartSessionResponse {
  ok: 1;
  id: BsonDocument;
}

In case of an error, the server response has the following format:

interface StartSessionError {
  ok: 0;
  errmsg: string;
  code: number;
}

When connected to a replica set the startSession command MUST be sent to the primary if the primary is available. The startSession command MAY be sent to a secondary if there is no primary available at the time the startSession command needs to be run.

Drivers SHOULD generate session IDs locally if possible instead of running the startSession command, since running the command requires a network round trip.

endSessions

The endSessions server command has the following format:

interface EndSessionCommand {
  endSessions: Array<SessionId>;
  $clusterTime?: ClusterTime;
}

The $clusterTime field should only be sent when gossipping the cluster time. See the section of "Gossipping the cluster time" for information on $clusterTime.

The endSessions command MUST be sent to the admin database.

The server response has the following format:

interface EndSessionResponse {
  ok: 1;
}

In case of an error, the server response has the following format:

interface EndSessionError {
  ok: 0;
  errmsg: string;
  code: number;
}

Drivers MUST ignore any errors returned by the endSessions command.

The endSessions command MUST be run once when the MongoClient instance is shut down. If the number of sessions is very large the endSessions command SHOULD be run multiple times to end 10,000 sessions at a time (in order to avoid creating excessively large commands).

When connected to a sharded cluster the endSessions command can be sent to any mongos. When connected to a replica set the endSessions command MUST be sent to the primary if the primary is available, otherwise it MUST be sent to any available secondary.

Server Session Pool

Conceptually, each ClientSession can be thought of as having a new corresponding ServerSession. However, starting a server session might require a round trip to the server (which can be avoided by generating the session ID locally) and ending a session requires a separate round trip to the server. Drivers can operate more efficiently and put less load on the server if they cache ServerSession instances for reuse. To this end drivers MUST implement a server session pool containing ServerSession instances available for reuse. A ServerSession pool MUST belong to a MongoClient instance and have the same lifetime as the MongoClient instance.

When a new implicit ClientSession is started it MUST NOT attempt to acquire a server session from the server session pool immediately. When a new explicit ClientSession is started it MAY attempt to acquire a server session from the server session pool immediately. See the algorithm below for the steps to follow when attempting to acquire a ServerSession from the server session pool.

Note that ServerSession instances acquired from the server session pool might have as little as one minute left before becoming stale and being discarded server side. Drivers MUST document that if an application waits more than one minute after calling startSession to perform operations with that session it risks getting errors due to the server session going stale before it was used.

A server session is considered stale by the server when it has not been used for a certain amount of time. The default amount of time is 30 minutes, but this value is configurable on the server. Servers that support sessions will report this value in the logicalSessionTimeoutMinutes field of the reply to the hello and legacy hello commands. The smallest reported timeout is recorded in the logicalSessionTimeoutMinutes property of the TopologyDescription. See the Server Discovery And Monitoring specification for details.

When a ClientSession is ended it MUST return the server session to the server session pool. See the algorithm below for the steps to follow when returning a ServerSession instance to the server session pool.

The server session pool has no maximum size. The pool only shrinks when a server session is acquired for use or discarded.

When a MongoClient instance is closed the driver MUST proactively inform the server that the pooled server sessions will no longer be used by sending one or more endSessions commands to the server.

The server session pool is modeled as a double ended queue. The algorithms below require the ability to add and remove ServerSession instances from the front of the queue and to inspect and possibly remove ServerSession instances from the back of the queue. The front of the queue holds ServerSession instances that have been released recently and should be the first to be reused. The back of the queue holds ServerSession instances that have not been used recently and that potentially will be discarded if they are not used again before they expire.

An implicit session MUST be returned to the pool immediately following the completion of an operation. When an implicit session is associated with a cursor for use with getMore operations, the session MUST be returned to the pool immediately following a getMore operation that indicates that the cursor has been exhausted. In particular, it MUST not wait until all documents have been iterated by the application or until the application disposes of the cursor. For language runtimes that provide the ability to attach finalizers to objects that are run prior to garbage collection, the cursor class SHOULD return an implicit session to the pool in the finalizer if the cursor has not already been exhausted.

If a driver supports process forking, the session pool needs to be cleared on one side of the forked processes (just like sockets need to reconnect). Drivers MUST provide a way to clear the session pool without sending endSessions. Drivers MAY make this automatic when the process ID changes. If they do not, they MUST document how to clear the session pool wherever they document fork support. After clearing the session pool in this way, drivers MUST ensure that sessions already checked out are not returned to the new pool.

If a driver has a server session pool and a network error is encountered when executing any command with a ClientSession, the driver MUST mark the associated ServerSession as dirty. Dirty server sessions are discarded when returned to the server session pool. It is valid for a dirty session to be used for subsequent commands (e.g. an implicit retry attempt, a later command in a bulk write, or a later operation on an explicit session), however, it MUST remain dirty for the remainder of its lifetime regardless if later commands succeed.

Algorithm to acquire a ServerSession instance from the server session pool

  1. If the server session pool is empty create a new ServerSession and use it
  2. Otherwise remove a ServerSession from the front of the queue and examine it:
    • If the driver is in load balancer mode, use this ServerSession.
    • If it has at least one minute left before becoming stale use this ServerSession
    • If it has less than one minute left before becoming stale discard it (let it be garbage collected) and return to step 1.

See the Load Balancer Specification for details on session expiration.

Algorithm to return a ServerSession instance to the server session pool

  1. Before returning a server session to the pool a driver MUST first check the server session pool for server sessions at the back of the queue that are about to expire (meaning they will expire in less than one minute). A driver MUST stop checking server sessions once it encounters a server session that is not about to expire. Any server sessions found that are about to expire are removed from the end of the queue and discarded (or allowed to be garbage collected)
  2. Then examine the server session that is being returned to the pool and:
    • If this session is marked dirty (i.e. it was involved in a network error) discard it (let it be garbage collected)
    • If it will expire in less than one minute discard it (let it be garbage collected)
    • If it won't expire for at least one minute add it to the front of the queue

Gossipping the cluster time

Drivers MUST gossip the cluster time when connected to a deployment that uses cluster times.

Gossipping the cluster time is a process in which the driver participates in distributing the logical cluster time in a deployment. Drivers learn the current cluster time (from a particular server's perspective) in responses they receive from servers. Drivers in turn forward the highest cluster time they have seen so far to any server they subsequently send commands to.

A driver detects that it MUST participate in gossipping the cluster time when it sees a $clusterTime in a response received from a server.

Receiving the current cluster time

Drivers MUST examine all responses from the server commands to see if they contain a top level field named $clusterTime formatted as follows:

interface ClusterTime {
  clusterTime: Timestamp;
  signature: {
    hash: Binary;
    keyId: Int64;
  };
}

interface AnyServerResponse {
  // ... other properties ...
  $clusterTime: ClusterTime;
}

Whenever a driver receives a cluster time from a server it MUST compare it to the current highest seen cluster time for the deployment. If the new cluster time is higher than the highest seen cluster time it MUST become the new highest seen cluster time. Two cluster times are compared using only the BsonTimestamp value of the clusterTime embedded field (be sure to include both the timestamp and the increment of the BsonTimestamp in the comparison). The signature field does not participate in the comparison.

Sending the highest seen cluster time

Whenever a driver sends a command to a server it MUST include the highest seen cluster time in a top level field called $clusterTime, in the same format as it was received in (but see Gossipping with mixed server versions below).

How to compute the $clusterTime to send to a server

When sending $clusterTime to the server the driver MUST send the greater of the clusterTime values from MongoClient and ClientSession. Normally a session's clusterTime will be less than or equal to the clusterTime in MongoClient, but it could be greater than the clusterTime in MongoClient if advanceClusterTime was called with a clusterTime that came from somewhere else.

A driver MUST NOT use the clusterTime of a ClientSession anywhere else except when executing an operation with this session. This rule protects the driver from the scenario where advanceClusterTime was called with an invalid clusterTime by limiting the resulting server errors to the one session. The clusterTime of a MongoClient MUST NOT be advanced by any clusterTime other than a $clusterTime received directly from a server.

The safe way to compute the $clusterTime to send to a server is:

  1. When the ClientSession is first started its clusterTime is set to null.

  2. When the driver sends $clusterTime to the server it should send the greater of the ClientSession clusterTime and the MongoClient clusterTime (either one could be null).

  3. When the driver receives a $clusterTime from the server it should advance both the ClientSession and the MongoClient clusterTime. The clusterTime of a ClientSession can also be advanced by calling advanceClusterTime.

This sequence ensures that if the clusterTime of a ClientSession is invalid only that one session will be affected. The MongoClient clusterTime is only updated with $clusterTime values known to be valid because they were received directly from a server.

Tracking the highest seen cluster time does not require checking the deployment topology or the server version

Drivers do not need to check the deployment topology or the server version they are connected to in order to track the highest seen $clusterTime. They simply need to check for the presence of the $clusterTime field in responses received from servers.

Gossipping with mixed server versions

Drivers MUST check that the server they are sending a command to supports $clusterTime before adding $clusterTime to the command. A server supports $clusterTime when the maxWireVersion >= 6.

This supports the (presumably short lived) scenario where not all servers have been upgraded to 3.6.

Test Plan

See the README for tests.

Motivation

Drivers currently have no concept of a session. The driver API needs to be extended to support sessions.

Design Rationale

The goal is to modify the driver API in such a way that existing programs that don't use sessions continue to compile and run correctly. This goal is met by defining new methods (or overloads) that take a session parameter. An application does not need to be modified unless it wants to take advantage of the new features supported by sessions.

Backwards Compatibility

The API changes to support sessions extend the existing API but do not introduce any backward breaking changes. Existing programs that don't use sessions continue to compile and run correctly.

Reference Implementation (always required)

A reference implementation must be completed before any spec is given status "Final", but it need not be completed before the spec is "Accepted". While there is merit to the approach of reaching consensus on the specification and rationale before writing code, the principle of "rough consensus and running code" is still useful when it comes to resolving many discussions of spec details. A final reference implementation must include test code and documentation.

The C and C# drivers will do initial POC implementations.

Future work (optional)

Use this section to discuss any possible work for a future spec. This could cover issues where no consensus could be reached but that don't block this spec, changes that were rejected due to unclear use cases, etc.

Open questions

Q&A

Why do we say drivers MUST NOT attempt to detect unsafe multi-threaded or multi-process use of ClientSession?

Because doing so would provide an illusion of safety. It doesn't make these instances thread safe. And even if when testing an application no such exceptions are encountered, that doesn't prove anything. The application might still be using the instances in a thread-unsafe way and just didn't happen to do so during a test run. The final argument is that checking this would require overhead that doesn't provide any clear benefit.

Why is session an explicit parameter?

A previous draft proposed that ClientSession would be a MongoClient-like object added to the object hierarchy:

session = client.startSession(...)
database = session.getDatabase(...) // database is associated with session
collection = database.getCollection(...) // collection is associated with session
// operations on collection implicitly use session
collection.insertOne({})
session.endSession()

The central feature of this design is that a MongoCollection (or database, or perhaps a GridFS object) is associated with a session, which is then an implied parameter to any operations executed using that MongoCollection.

This API was rejected, with the justification that a ClientSession does not naturally belong to the state of a MongoCollection. MongoCollection has up to now been a stable long-lived object that could be widely shared, and in most drivers it is thread safe. Once we associate a ClientSession with it, the MongoCollection object becomes short-lived and is no longer thread safe. It is a bad sign that MongoCollection's thread safety and lifetime vary depending on how its parent MongoDatabase is created.

Instead, we require users to pass session as a parameter to each function:

session = client.startSession(...)
database = client.getDatabase(...)
collection = database.getCollection(...)
// users must explicitly pass session to operations
collection.insertOne(session, {})
session.endSession()

Why does a network error cause the ServerSession to be discarded from the pool?

When a network error is encountered when executing an operation with a ClientSession, the operation may be left running on the server. Re-using this ServerSession can lead to parallel operations which violates the rule that a session must be used sequentially. This results in multiple problems:

  1. killSessions to end an earlier operation would surprisingly also end a later operation.
  2. An otherwise unrelated operation that just happens to use that same server session will potentially block waiting for the previous operation to complete. For example, a transactional write will block a subsequent transactional write.

Why do automatic retry attempts re-use a dirty implicit session?

The retryable writes spec requires that both the original and retry attempt use the same server session. The server will block the retry attempt until the initial attempt completes at which point the retry attempt will continue executing.

For retryable reads that use an implicit session, drivers could choose to use a new server session for the retry attempt however this would lose the information that these two reads are related.

Why don't drivers run the endSessions command to cleanup dirty server sessions?

Drivers do not run the endSessions command when discarding a dirty server session because disconnects should be relatively rare and the server won't normally accumulate a large number of abandoned dirty sessions. Any abandoned sessions will be automatically cleaned up by the server after the configured logicalSessionTimeoutMinutes.

Why must drivers wait to consume a server session until after a connection is checked out?

The problem that may occur is when the number of concurrent application requests are larger than the number of available connections, the driver may generate many more implicit sessions than connections. For example with maxPoolSize=1 and 100 threads, 100 implicit sessions may be created. This increases the load on the server since session state is cached in memory. In the worst case this kind of workload can hit the session limit and trigger TooManyLogicalSessions.

In order to address this, drivers MUST NOT consume a server session id until after the connection is checked out. This change will limit the number of "in use" server sessions to no greater than an application's maxPoolSize.

The language here is specific about obtaining a server session as opposed to creating the implicit session to permit drivers to take an implementation approach where the implicit session creation logic largely remains unchanged. Implicit session creation can be left as is, as long as the underlying server resource isn't allocated until it is needed and, known it will be used, after connection checkout succeeds.

It is still possible that via explicit sessions or cursors, which hold on to the session they started with, a driver could over allocate sessions. But those scenarios are extenuating and outside the scope of solving in this spec.

Why should drivers NOT attempt to release a serverSession before checking back in the operation's connection?

There are a variety of cases, such as retryable operations or cursor creating operations, where a serverSession must remain acquired by the ClientSession after an operation is attempted. Attempting to account for all these scenarios has risks that do not justify the potential guaranteed ServerSession allocation limiting.

Changelog

  • 2024-05-08: Migrated from reStructuredText to Markdown.
  • 2017-09-13: If causalConsistency option is omitted assume true
  • 2017-09-16: Omit session ID when opening and authenticating a connection
  • 2017-09-18: Drivers MUST gossip the cluster time when they see a $clusterTime.
  • 2017-09-19: How to safely use initialClusterTime
  • 2017-09-29: Add an exception to the rule that KILLCURSORS commands always require a session id
  • 2017-10-03: startSession and endSessions commands MUST be sent to the admin database
  • 2017-10-03: Fix format of endSessions command
  • 2017-10-04: Added advanceClusterTime
  • 2017-10-06: Added descriptions of explicit and implicit sessions
  • 2017-10-17: Implicit sessions MUST NOT be used when multiple users authenticated
  • 2017-10-19: Possible race conditions when checking whether a deployment supports sessions
  • 2017-11-21: Drivers MUST NOT send a session ID for unacknowledged writes
  • 2018-01-10: Note that MongoClient must retain highest clusterTime
  • 2018-01-10: Update test plan for drivers without APM
  • 2018-01-11: Clarify that sessions require replica sets or sharded clusters
  • 2018-02-20: Add implicit/explicit session tests
  • 2018-02-20: Drivers SHOULD error if unacknowledged writes are used with sessions
  • 2018-05-23: Drivers MUST not use session ID with parallelCollectionScan
  • 2018-06-07: Document that estimatedDocumentCount does not support explicit sessions
  • 2018-07-19: Justify why session must be an explicit parameter to each function
  • 2018-10-11: Session pools must be cleared in child process after fork
  • 2019-05-15: A ServerSession that is involved in a network error MUST be discarded
  • 2019-10-22: Drivers may defer checking if a deployment supports sessions until the first
  • 2021-04-08: Updated to use hello and legacy hello
  • 2021-04-08: Adding in behaviour for load balancer mode.
  • 2020-05-26: Simplify logic for determining sessions support
  • 2022-01-28: Implicit sessions MUST obtain server session after connection checkout succeeds
  • 2022-03-24: ServerSession Pooling is required and clarifies session acquisition bounding
  • 2022-06-13: Move prose tests to test README and apply new ordering
  • 2022-10-05: Remove spec front matter
  • 2023-02-24: Defer checking for session support until after connection checkout

Causal Consistency Specification

  • Status: Accepted
  • Minimum Server Version: 3.6

Abstract

Version 3.6 of the server introduces support for causal consistency. This spec builds upon the Sessions Specification to define how an application requests causal consistency and how a driver interacts with the server to implement causal consistency.

Definitions

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Terms

Causal consistency
A property that guarantees that an application can read its own writes and that a later read will never observe a version of the data that is older than an earlier read.

ClientSession
The driver object representing a client session and the operations that can be performed on it.

Cluster time
The current cluster time. The server reports its view of the current cluster time in the $clusterTime field in responses from the server and the driver participates in distributing the current cluster time to all nodes (called "gossipping the cluster time") by sending the highest $clusterTime it has seen so far in messages it sends to mongos servers. The current cluster time is a logical time, but is digitally signed to prevent malicious clients from propagating invalid cluster times. Cluster time is only used in replica sets and sharded clusters.

Logical time
A time-like quantity that can be used to determine the order in which events occurred. Logical time is represented as a BsonTimestamp.

MongoClient
The root object of a driver's API. MAY be named differently in some drivers.

MongoCollection
The driver object representing a collection and the operations that can be performed on it. MAY be named differently in some drivers.

MongoDatabase
The driver object representing a database and the operations that can be performed on it. MAY be named differently in some drivers.

Operation time
The logical time at which an operation occurred. The server reports the operation time in the response to all commands, including error responses. The operation time by definition is always less than or equal to the cluster time. Operation times are tracked on a per ClientSession basis, so the operationTime of each ClientSession corresponds to the time of the last operation performed in that particular ClientSession.

ServerSession
The driver object representing a server session.

Session
A session is an abstract concept that represents a set of sequential operations executed by an application that are related in some way. This specification defines how sessions are used to implement causal consistency.

Unacknowledged writes
Unacknowledged writes are write operations that are sent to the server without waiting for a reply acknowledging the write. See the "Unacknowledged Writes" section below for information on how unacknowledged writes interact with causal consistency.

Specification

An application requests causal consistency by creating a ClientSession with options that specify that causal consistency is desired. An application then passes the session as an argument to methods in the MongoDatabase and MongoCollection classes. Any operations performed against that session will then be causally consistent.

Naming variations

This specification defines names for new methods and types. To the extent possible you SHOULD use these names in your driver. However, where your driver's and/or language's naming conventions differ you SHOULD continue to use them instead. For example, you might use StartSession or start_session instead of startSession.

High level summary of the API changes for causal consistency

Causal consistency is built on top of client sessions.

Applications will start a new client session for causal consistency like this:

options = new SessionOptions(causalConsistency = true);
session = client.startSession(options);

All read operations performed using this session will now be causally consistent.

If no value is provided for causalConsistency and snapshot reads are not requested a value of true is implied. See the causalConsistency section.

MongoClient changes

There are no API changes to MongoClient to support causal consistency. Applications indicate whether they want causal consistency by setting the causalConsistency field in the options passed to the startSession method.

SessionOptions changes

SessionOptions change summary

class SessionOptions {
    Optional<bool> causalConsistency;

    // other options defined by other specs
}

In order to support causal consistency a new property named causalConsistency is added to SessionOptions. Applications set causalConsistency when starting a client session to indicate whether they want causal consistency. All read operations performed using that client session are then causally consistent.

Each new member is documented below.

causalConsistency

Applications set causalConsistency when starting a session to indicate whether they want causal consistency.

Note that the causalConsistency property is optional. For explicit sessions, the default value of this property is not supplied. If no value is supplied for causalConsistency the value will be inherited. Currently it is inherited from the global default which is defined to be true. In the future it might be inherited from client settings. For implicit sessions, the value of this property MUST be set to false in order to avoid potential conflicts with an operation's read concern level.

Causal consistency is provided at the session level by tracking the clusterTime and operationTime for each session. In some cases an application may wish subsequent operations in one session to be causally consistent with operations that were executed in a different session. In that case the application can call the advanceClusterTime and advanceOperationTime methods in ClientSession to advance the clusterTime and operationTime of one session to the clusterTime and operationTime from another session.

ClientSession changes

ClientSession changes summary

interface ClientSession {
    Optional<BsonTimestamp> operationTime;

    void advanceOperationTime(BsonTimestamp operationTime);

    // other members as defined in other specs
}

Each new member is documented below.

operationTime

This property returns the operation time of the most recent operation performed using this session. If no operations have been performed using this session the value will be null unless advanceOperationTime has been called. This value will also be null when the cluster does not report operation times.

advanceOperationTime

This method advances the operationTime for a session. If the new operationTime is greater than the session's current operationTime then the session's operationTime MUST be advanced to the new operationTime. If the new operationTime is less than or equal to the session's current operationTime then the session's operationTime MUST NOT be changed.

Drivers MUST NOT attempt to validate the supplied operationTime. While the server requires that operationTime be less than or equal to clusterTime we don't want to check that when advanceOperationTime is called. This allows an application to call advanceClusterTime and advanceOperationTime in any order, or perhaps to not call advanceClusterTime at all and let the clusterTime that is sent to the server be implied by the clusterTime in MongoClient.

MongoDatabase changes

There are no additional API changes to MongoDatabase beyond those specified in the Sessions Specification. All MongoDatabase methods that talk to the server have been overloaded to take a session parameter. If that session was started with causalConsistency = true then all operations using that session will be causally consistent.

MongoCollection changes

There are no additional API changes to MongoCollection beyond those specified in the Sessions Specification. All MongoCollection methods that talk to the server have been overloaded to take a session parameter. If that session was started with causalConsistency = true then all operations using that session will be causally consistent.

Server Commands

There are no new server commands related to causal consistency. Instead, causal consistency is implemented by:

  1. Saving the operationTime returned by 3.6+ servers for all operations in a property of the ClientSession object. The server reports the operationTime whether the operation succeeded or not and drivers MUST save the operationTime in the ClientSession whether the operation succeeded or not.
  2. Passing that operationTime in the afterClusterTime field of the readConcern field for subsequent causally consistent read operations (for all commands that support a readConcern)
  3. Gossiping clusterTime (described in the Driver Session Specification)

Server Command Responses

To support causal consistency the server returns the operationTime in responses it sends to the driver (for both read and write operations).

{
    ok : 1 or 0,
    ... // the rest of the command reply
    operationTime : <BsonTimestamp>
    $clusterTime : <BsonDocument> // only in deployments that support cluster times
}

The operationTime MUST be stored in the ClientSession to later be passed as the afterClusterTime field of the readConcern field in subsequent read operations. The operationTime is returned whether the command succeeded or not and MUST be stored in either case.

Drivers MUST examine all responses from the server for the presence of an operationTime field and store the value in the ClientSession.

When connected to a standalone node command replies do not include an operationTime field. All operations against a standalone node are causally consistent automatically because there is only one node.

When connected to a deployment that supports cluster times the command response also includes a field called $clusterTime that drivers MUST use to gossip the cluster time. See the Sessions Specification for details.

Causally consistent read commands

For causal consistency the driver MUST send the operationTime saved in the ClientSession as the value of the afterClusterTime field of the readConcern field:

{
    find : <string>, // or other read command
    ... // the rest of the command parameters
    readConcern :
    {
        level : ..., // from the operation's read concern (only if specified)
        afterClusterTime : <BsonTimestamp>
    }
}

For the lists of commands that support causally consistent reads, see ReadConcern spec.

The driver MUST merge the ReadConcern specified for the operation with the operationTime from the ClientSession (which goes in the afterClusterTime field) to generate the combined readConcern to send to the server. If the level property of the read concern for the operation is null then the driver MUST NOT include a level field alongside the afterClusterTime of the readConcern value sent to the server. Drivers MUST NOT attempt to verify whether the server supports causally consistent reads or not for a given read concern level. The server will return an error if a given level does not support causal consistency.

The Read and Write Concern specification states that when a user has not specified a ReadConcern or has specified the server's default ReadConcern, drivers MUST omit the ReadConcern parameter when sending the command. For causally consistent reads this requirement is modified to state that when the ReadConcern parameter would normally be omitted drivers MUST send a ReadConcern after all because that is how the afterClusterTime value is sent to the server.

The Read and Write Concern Specification states that drivers MUST NOT add a readConcern field to commands that are run using a generic runCommand method. The same is true for causal consistency, so commands that are run using runCommand MUST NOT have an afterClusterTime field added to them.

When executing a causally consistent read, the afterClusterTime field MUST be sent when connected to a deployment that supports cluster times, and MUST NOT be sent when connected to a deployment that does not support cluster times.

Unacknowledged writes

The implementation of causal consistency relies on the operationTime returned by the server in the acknowledgement of a write. Since unacknowledged writes don't receive a response from the server (or don't wait for a response) the ClientSession's operationTime is not updated after an unacknowledged write. That means that a causally consistent read after an unacknowledged write cannot be causally consistent with the unacknowledged write. Rather than prohibiting unacknowledged writes in a causally consistent session we have decided to accept this limitation. Drivers MUST document that causally consistent reads are not causally consistent with unacknowledged writes.

Test Plan

Below is a list of test cases to write.

Note: some tests are only relevant to certain deployments. For the purpose of deciding which tests to run assume that any deployment that is version 3.6 or higher and is either a replica set or a sharded cluster supports cluster times.

  1. When a ClientSession is first created the operationTime has no value.
    • session = client.startSession()
    • assert session.operationTime has no value
  2. The first read in a causally consistent session must not send afterClusterTime to the server (because the operationTime has not yet been determined)
    • session = client.startSession(causalConsistency = true)
    • document = collection.anyReadOperation(session, ...)
    • capture the command sent to the server (using APM or other mechanism)
    • assert that the command does not have an afterClusterTime
  3. The first read or write on a ClientSession should update the operationTime of the ClientSession, even if there is an error.
    • skip this test if connected to a deployment that does not support cluster times
    • session = client.startSession() // with or without causal consistency
    • collection.anyReadOrWriteOperation(session, ...) // test with errors also if possible
    • capture the response sent from the server (using APM or other mechanism)
    • assert session.operationTime has the same value that is in the response from the server
  4. A findOne followed by any other read operation (test them all) should include the operationTime returned by the server for the first operation in the afterClusterTime parameter of the second operation
    • skip this test if connected to a deployment that does not support cluster times
    • session = client.startSession(causalConsistency = true)
    • collection.findOne(session, {})
    • operationTime = session.operationTime
    • collection.anyReadOperation(session, ...)
    • capture the command sent to the server (using APM or other mechanism)
    • assert that the command has an afterClusterTime field with a value of operationTime
  5. Any write operation (test them all) followed by a findOne operation should include the operationTime of the first operation in the afterClusterTime parameter of the second operation, including the case where the first operation returned an error.
    • skip this test if connected to a deployment that does not support cluster times
    • session = client.startSession(causalConsistency = true)
    • collection.anyWriteOperation(session, ...) // test with errors also where possible
    • operationTime = session.operationTime
    • collection.findOne(session, {})
    • capture the command sent to the server (using APM or other mechanism)
    • assert that the command has an afterClusterTime field with a value of operationTime
  6. A read operation in a ClientSession that is not causally consistent should not include the afterClusterTime parameter in the command sent to the server.
    • skip this test if connected to a deployment that does not support cluster times
    • session = client.startSession(causalConsistency = false)
    • collection.anyReadOperation(session, {})
    • operationTime = session.operationTime
    • capture the command sent to the server (using APM or other mechanism)
    • assert that the command does not have an afterClusterTime field
  7. A read operation in a causally consistent session against a deployment that does not support cluster times does not include the afterClusterTime parameter in the command sent to the server.
    • skip this test if connected to a deployment that does support cluster times
    • session = client.startSession(causalConsistency = true)
    • collection.anyReadOperation(session, {})
    • capture the command sent to the server (using APM or other mechanism)
    • assert that the command does not have an afterClusterTime field
  8. When using the default server ReadConcern the readConcern parameter in the command sent to the server should not include a level field.
    • skip this test if connected to a deployment that does not support cluster times
    • session = client.startSession(causalConsistency = true)
    • configure collection to use default server ReadConcern
    • collection.findOne(session, {})
    • operationTime = session.operationTime
    • collection.anyReadOperation(session, ...)
    • capture the command sent to the server (using APM or other mechanism)
    • assert that the command does not have a `level field
    • assert that the command has a afterClusterTime field with a value of operationTime
  9. When using a custom ReadConcern the readConcern field in the command sent to the server should be a merger of the ReadConcern value and the afterClusterTime field.
    • skip this test if connected to a deployment that does not support cluster times
    • session = client.startSession(causalConsistency = true)
    • configure collection to use a custom ReadConcern
    • collection.findOne(session, {})
    • operationTime = session.operationTime
    • collection.anyReadOperation(session, ...)
    • capture the command sent to the server (using APM or other mechanism)
    • assert that the command has a level field with a value matching the custom readConcern
    • assert that the command has an afterClusterTime field with a value of operationTime
  10. Removed
  11. When connected to a deployment that does not support cluster times messages sent to the server should not include $clusterTime.
    • skip this test when connected to a deployment that does support cluster times
    • document = collection.findOne({})
    • capture the command sent to the server
    • assert that the command does not include a $clusterTime field
  12. When connected to a deployment that does support cluster times messages sent to the server should include $clusterTime.
    • skip this test when connected to a deployment that does not support cluster times
    • document = collection.findOne({})
    • capture the command sent to the server
    • assert that the command includes a $clusterTime field

Motivation

To support causal consistency. Only supported with server version 3.6 or newer.

Design Rationale

The goal is to modify the driver API as little as possible so that existing programs that don't need causal consistency don't have to be changed. This goal is met by defining a SessionOptions field that applications use to start a ClientSession that can be used for causal consistency. Any operations performed with such a session are then causally consistent.

The operationTime is tracked on a per ClientSession basis. This allows each ClientSession to have an operationTime that is sufficiently new to guarantee causal consistency for that session, but no newer. Using an operationTime that is newer than necessary can cause reads to block longer than necessary when sent to a lagging secondary. The goal is to block for just long enough to guarantee causal consistency and no longer.

Backwards Compatibility

The API changes to support sessions extend the existing API but do not introduce any backward breaking changes. Existing programs that don't use causal consistency continue to compile and run correctly.

Reference Implementation

A reference implementation must be completed before any spec is given status "Final", but it need not be completed before the spec is "Accepted". While there is merit to the approach of reaching consensus on the specification and rationale before writing code, the principle of "rough consensus and running code" is still useful when it comes to resolving many discussions of spec details. A final reference implementation must include test code and documentation.

Q&A

Changelog

  • 2024-02-08: Migrated from reStructuredText to Markdown.

  • 2022-11-11: Require causalConsistency=false for implicit sessions.

  • 2022-10-05: Remove spec front matter and reformat changelog.

  • 2022-01-28: Fix formatting for prose tests

  • 2022-01-22: Remove outdated prose test #10

  • 2021-06-26: Default value for causalConsistency is influenced by snapshot reads

  • 2017-11-17: Added link to ReadConcern spec which lists commands that support readConcern

  • 2017-10-06: advanceOperationTime MUST NOT validate operationTime

  • 2017-10-05: How to handle default read concern

  • 2017-10-04: Added advanceOperationTime

  • 2017-09-28: Remove remaining references to collections being associated with
    sessions. Update spec to reflect that replica sets use $clusterTime also now.

  • 2017-09-13: Renamed "causally consistent reads" to "causal consistency". If no
    value is supplied for causallyConsistent assume true.

Snapshot Reads Specification

  • Status: Accepted
  • Minimum Server Version: 5.0

Abstract

Version 5.0 of the server introduces support for read concern level "snapshot" (non-speculative) for read commands outside of transactions, including on secondaries. This spec builds upon the Sessions Specification to define how an application requests "snapshot" level read concern and how a driver interacts with the server to implement snapshot reads.

Definitions

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Terms

ClientSession
The driver object representing a client session and the operations that can be performed on it.

MongoClient
The root object of a driver's API. MAY be named differently in some drivers.

MongoCollection
The driver object representing a collection and the operations that can be performed on it. MAY be named differently in some drivers.

MongoDatabase
The driver object representing a database and the operations that can be performed on it. MAY be named differently in some drivers.

ServerSession
The driver object representing a server session.

Session
A session is an abstract concept that represents a set of sequential operations executed by an application that are related in some way. This specification defines how sessions are used to implement snapshot reads.

Snapshot reads
Reads with read concern level snapshot that occur outside of transactions on both the primary and secondary nodes, including in sharded clusters. Snapshots reads are majority committed reads.

Snapshot timestamp
Snapshot timestamp, representing timestamp of the first supported read operation (i.e. find/aggregate/distinct) in the session. The server creates a cursor in response to a snapshot find/aggregate command and reports atClusterTime within the cursor field in the response. For the distinct command the server adds a top-level atClusterTime field to the response. The atClusterTime field represents the timestamp of the read and is guaranteed to be majority committed.

Specification

An application requests snapshot reads by creating a ClientSession with options that specify that snapshot reads are desired. An application then passes the session as an argument to methods in the MongoDatabase and MongoCollection classes. Read operations (find/aggregate/distinct) performed against that session will be read from the same snapshot.

High level summary of the API changes for snapshot reads

Snapshot reads are built on top of client sessions.

Applications will start a new client session for snapshot reads like this:

options = new SessionOptions(snapshot = true);
session = client.startSession(options);

All read operations performed using this session will be read from the same snapshot.

If no value is provided for snapshot a value of false is implied. There are no MongoDatabase, MongoClient, or MongoCollection API changes.

SessionOptions changes

SessionOptions change summary

class SessionOptions {
    Optional<bool> snapshot;

    // other options defined by other specs
}

In order to support snapshot reads a new property named snapshot is added to SessionOptions. Applications set snapshot when starting a client session to indicate whether they want snapshot reads. All read operations performed using that client session will share the same snapshot.

Each new member is documented below.

snapshot

Applications set snapshot when starting a session to indicate whether they want snapshot reads.

Note that the snapshot property is optional. The default value of this property is false.

Snapshot reads and causal consistency are mutually exclusive. Therefore if snapshot is set to true, causalConsistency must be false. Client MUST throw an error if both snapshot and causalConsistency are set to true. Snapshot reads are supported on both primaries and secondaries.

ClientSession changes

Transactions are not allowed with snapshot sessions. Calling session.startTransaction(options) on a snapshot session MUST raise an error.

ReadConcern changes

snapshot added to ReadConcernLevel enumeration.

Server Commands

There are no new server commands related to snapshot reads. Instead, snapshot reads are implemented by:

  1. Saving the atClusterTime returned by 5.0+ servers for the first find/aggregate/distinct operation in a private snapshotTime property of the ClientSession object. Drivers MUST save atClusterTime in the ClientSession object.
  2. Passing that snapshotTime in the atClusterTime field of the readConcern field for subsequent snapshot read operations (i.e. find/aggregate/distinct commands).

Server Command Responses

For find/aggregate commands the server returns atClusterTime within the cursor field of the response.

{
    ok : 1 or 0,
    ... // the rest of the command reply
    cursor : {
        ... // the rest of the cursor reply
        atClusterTime : <BsonTimestamp>
    }
}

For distinct commands the server returns atClusterTime as a top-level field in the response.

{
    ok : 1 or 0,
    ... // the rest of the command reply
    atClusterTime : <BsonTimestamp>
}

The atClusterTime timestamp MUST be stored in the ClientSession to later be passed as the atClusterTime field of the readConcern with a snapshot level in subsequent read operations.

Server Errors

  1. The server may reply to read commands with a SnapshotTooOld(239) error if the client's atClusterTime value is not available in the server's history.
  2. The server will return InvalidOptions(72) error if both atClusterTime and afterClusterTime options are set to true.
  3. The server will return InvalidOptions(72) error if the command does not support readConcern.level "snapshot".

Snapshot Read Commands

For snapshot reads the driver MUST first obtain atClusterTime from the server response of a find/aggregate/distinct command, by specifying readConcern with snapshot level field, and store it as snapshotTime in the ClientSession object.

{
    find : <string>, // or other read command
    ... // the rest of the command parameters
    readConcern :
    {
        level : "snapshot"
    }
}

For subsequent reads in the same session, the driver MUST send the snapshotTime saved in the ClientSession as the value of the atClusterTime field of the readConcern with a snapshot level:

{
    find : <string>, // or other read command
    ... // the rest of the command parameters
    readConcern :
    {
        level : "snapshot",
        atClusterTime : <BsonTimestamp>
    }
}

Lists of commands that support snapshot reads:

  1. find
  2. aggregate
  3. distinct

Sending readConcern to the server on all commands

Drivers MUST set the readConcern level and atClusterTime fields (as outlined above) on all commands in a snapshot session, including commands that do not accept a readConcern (e.g. insert, update). This ensures that the server will return an error for invalid operations, such as writes, within a session configured for snapshot reads.

Requires MongoDB 5.0+

Snapshot reads require MongoDB 5.0+. When the connected server's maxWireVersion is less than 13, drivers MUST throw an exception with the message "Snapshot reads require MongoDB 5.0 or later".

Motivation

To support snapshot reads. Only supported with server version 5.0+ or newer.

Design Rationale

The goal is to modify the driver API as little as possible so that existing programs that don't need snapshot reads don't have to be changed. This goal is met by defining a SessionOptions field that applications use to start a ClientSession that can be used for snapshot reads. Alternative explicit approach of obtaining atClusterTime from cursor object and passing it to read concern object was considered initially. A session-based approach was chosen as it aligns better with the existing API, and requires minimal API changes. Future extensibility for snapshot reads would be best served by a session-based approach, as no API changes will be required.

Backwards Compatibility

The API changes to support snapshot reads extend the existing API but do not introduce any backward breaking changes. Existing programs that don't use snapshot reads continue to compile and run correctly.

Reference Implementation

C# driver will provide the reference implementation. The corresponding ticket is CSHARP-3668.

Q&A

Changelog

  • 2024-05-08: Migrated from reStructuredText to Markdown.
  • 2021-06-15: Initial version.
  • 2021-06-28: Raise client side error on < 5.0.
  • 2021-06-29: Send readConcern with all snapshot session commands.
  • 2021-07-16: Grammar revisions. Change SHOULD to MUST for startTransaction error to comply with existing tests.
  • 2021-08-09: Updated client-side error spec tests to use correct syntax for test.expectEvents
  • 2022-10-05: Remove spec front matter

Driver Transactions Specification

  • Status: Accepted
  • Minimum Server Version: 4.0

Abstract

Version 4.0 of the server introduces multi-statement transactions. This spec builds upon the Driver Sessions Specification to define how an application uses transactions and how a driver interacts with the server to implement transactions.

The API for transactions must be specified to ensure that all drivers and the mongo shell are consistent with each other, and to provide a natural interface for application developers and DBAs who use multi-statement transactions.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Specification

Terms

This specification uses the terms defined in the Driver Sessions Specification and Retryable Writes Specification. Additional terms are defined below.

Resource Management Block

Some programming languages have a concept of a code block that automatically frees resources when control exits the block. Such a pattern is known as context managers, "using" statements, RAII, etc.. This spec refers to such a pattern as a resource management block, regardless of the programming language.

Read operation

Any CRUD method that uses a read preference. The listIndexes, listCollections, and listDatabases, and RunCommand helpers are also considered read operations.

Write operation

All operations that write and accept a ClientSession argument. All MongoClient, Database, Collection helpers that write including (but not limited to) creating, updating, or deleting databases, collections, indexes, and users. Aggregate (even with a write stage) is considered a read operation, see Aggregate with write stage is a read operation.

Retryable Error

An error considered retryable by the Retryable Writes Specification.

Command Error

A server response with ok:0. A server response with ok:1 and writeConcernError or writeErrors is not considered a command error.

Network Error

Any error or timeout that occurs while selecting a server or reading from or writing to a network socket.

Error Label

Starting in MongoDB 4.0, any command error may include a top level "errorLabels" field. The field contains an array of string error labels. Drivers may also add error labels to errors that they return.

Transient Transaction Error

Any command error that includes the "TransientTransactionError" error label in the "errorLabels" field. Any network error encountered running any command other than commitTransaction in a transaction. If a network error occurs while running the commitTransaction command then it is not known whether the transaction committed or not, and thus the "TransientTransactionError" label MUST NOT be added.

Naming variations

This specification defines names for new methods and types. To the extent possible you SHOULD use these names in your driver. However, where your driver's and/or language's naming conventions differ you SHOULD continue to use them instead. For example, you might use StartTransaction or start_transaction instead of startTransaction.

A non-exhaustive list of acceptable naming deviations are as follows:

  • Using "maxCommitTimeMS" as an example, .NET would use "MaxCommitTime" where it's type is a TimeSpan structure that includes units. However, calling it "MaximumCommitTime" would not be acceptable.

Transaction API

Transactions are built on top of Driver Session API. Applications can run a transaction like this:

with client.start_session() as s:
    s.start_transaction()
    collection_one.insert_one(doc_one, session=s)
    collection_two.insert_one(doc_two, session=s)
    s.commit_transaction()

This section is an overview of the public API for transactions:

class TransactionOptions {
    /**
     * The readConcern to use for this transaction.
     */
    Optional<ReadConcern> readConcern;

    /**
     * The writeConcern to use for this transaction.
     */
    Optional<WriteConcern> writeConcern;

    /**
     * The readPreference to use for this transaction.
     */
    Optional<ReadPreference> readPreference;

    /**
     * The maximum amount of time to allow a single commitTransaction
     * command to run.
     *
     * NOTE: This option is deprecated in favor of timeoutMS.
     */
    Optional<Int64> maxCommitTimeMS;
}

class SessionOptions {
    /**
     * The default TransactionOptions to use for transactions started
     * on this session.
     */
    Optional<TransactionOptions> defaultTransactionOptions;

    // Options defined in other specifications...
}

interface ClientSession {
    /**
     * Starts a new transaction with the given options. This session's
     * defaultTransactionOptions is used when options is omitted.
     * Raises an error if this session is already in a transaction.
     *
     * The return type MAY be non-void if necessary to participate in
     * the programming language's resource management block idiom. The
     * type of the returned object, if any, MUST NOT be named
     * Transaction, see "Why is there no Transaction object?"
     */
    void startTransaction(Optional<TransactionOptions> options);

    /**
     * Commits the currently active transaction in this session.
     * Raises an error if this session has no transaction.
     */
    void commitTransaction();

    /**
     * Aborts the currently active transaction in this session.
     * Raises an error if this session has no transaction.
     */
    void abortTransaction();

    /**
     * Aborts any currently active transaction and ends this session.
     * MUST NOT raise an error.
     */
    void endSession();

    // Methods defined in other specifications...
}

Each new member is documented below.

TransactionOptions

It is expected that the set of TransactionOptions will grow over time, TransactionOptions MUST be designed such that future options can be added without breaking backward compatibility.

readConcern

The readConcern to use for the first command, and only the first command, in a transaction. Server transactions are started lazily with the first command using this session. For supported values see Behavior of the readConcern field.

Note that the readConcern property is optional. The default value is NULL. If readConcern is NULL the value will be inherited from this session’s defaultTransactionOptions. If defaultTransactionOptions itself is NULL or the readConcern in defaultTransactionOptions is NULL, the readConcern is inherited from the MongoClient associated with this session.

If the user supplies an explicit readConcern via a method option, the driver MUST raise an error with the message "Cannot set read concern after starting a transaction." See Users cannot pass readConcern or writeConcern to operations in transactions.

writeConcern

The writeConcern to use for the commitTransaction and abortTransaction commands. Note that the writeConcern property is optional. The default value is NULL. If writeConcern is NULL the value will be inherited from this session’s defaultTransactionOptions. If defaultTransactionOptions itself is NULL or the writeConcern in defaultTransactionOptions is NULL, the writeConcern is inherited from the MongoClient associated with this session.

If the writeConcern is not the server default, then Drivers MUST add the writeConcern to the commitTransaction and abortTransaction commands. Drivers MUST NOT add the transaction’s writeConcern or any writeConcern inherited from the collection, database, or client to any preceding commands in the transaction.

If the user supplies an explicit writeConcern via a method option, the driver MUST raise an error with the message "Cannot set write concern after starting a transaction." See Users cannot pass readConcern or writeConcern to operations in transactions.

Drivers MUST raise an error if the user provides or if defaults would result in an unacknowledged writeConcern. The Driver Sessions spec disallows using unacknowledged writes in a session. The error message MUST contain "transactions do not support unacknowledged write concerns".

readPreference

The read preference to use for all read operations in this transaction.

Note that the readPreference property is optional. The default value is NULL. If readPreference is NULL the value will be inherited from this session’s defaultTransactionOptions. If defaultTransactionOptions itself is NULL or the readPreference in defaultTransactionOptions is NULL, the readPreference is inherited from the MongoClient associated with this session.

The transaction’s read preference MUST override all other user configurable read preferences, with the exception of drivers that allow an operation level read preference. In this case, the driver MUST respect the read preference specified by the user, allowing the server to report an error.

In MongoDB 4.0, transactions may only read from the primary. If a read is attempted and the transaction’s read preference is not Primary drivers MUST raise an error containing the string "read preference in a transaction must be primary". Drivers MUST NOT validate the read preference during write operations or in startTransaction. See Why is readPreference part of TransactionOptions?.

client = MongoClient("mongodb://host/?readPreference=nearest")
coll = client.db.test
with client.start_session() as s:
    with s.start_transaction():
        coll.insert_one({}, session=s)
        coll.find_one(session=s)  # Error: "read preference in a transaction must be primary"

In the future, we might relax this restriction and allow any read preference on a transaction.

maxCommitTimeMS

NOTE: This option is deprecated in favor of timeoutMS.

The maximum amount of time to allow a single commitTransaction command to run.

This option is only sent with the commitTransaction command(s) and only if the caller explicitly provides a value. The default is to not send a value.

Note, this option is an alias for the maxTimeMS commitTransaction command option.

SessionOptions changes

defaultTransactionOptions

The default TransactionOptions to use for transactions started on this session.

ClientSession changes

ClientSession is in one of five states: "no transaction", "starting transaction", "transaction in progress", "transaction committed", and "transaction aborted". It transitions among these states according to the following diagram:

states
(GraphViz source)

When a ClientSession is created it starts in the "no transaction" state. Starting, committing, and aborting a transaction transitions the session between the "starting transaction", "transaction in progress", "transaction committed", and "transaction aborted" states. If the session is in the "transaction aborted" or "transaction committed" state, then any operation using the session (besides commitTransaction and abortTransaction) MUST reset the session state to "no transaction".

Note that "error" is not a state, it represents throwing an error due to an invalid operation. When such errors are thrown the session state is unchanged.

Client-side errors MUST NOT change transaction state. For example, if an invalid key or an excessively large document is provided by the application to an insert when the transaction state is "starting transaction", the transaction state MUST remain "starting transaction". If the same situation occurs when the transaction state is "transaction in progress", the state MUST remain "transaction in progress".

startTransaction

This method starts a new transaction on this session with the given TransactionOptions. When options is omitted or if particular options are not specified, drivers will use the defaultTransactionOptions from ClientSession.options or inherit them from the session's client, as described in the text above for each option. This session is in the "starting transaction" state after this method returns.

If this session is in the "starting transaction " or "transaction in progress" state, then Drivers MUST raise an error containing the message "Transaction already in progress" without modifying any session state.

startTransaction SHOULD report an error if the driver can detect that transactions are not supported by the deployment. A deployment does not support transactions when the deployment does not support sessions, or maxWireVersion < 7, or the maxWireVersion < 8 and the topology type is Sharded, see How to Check Whether a Deployment Supports Sessions. Note that checking the maxWireVersion does not guarantee that the deployment supports transactions, for example a MongoDB 4.0 replica set using MMAPv1 will report maxWireVersion 7 but does not support transactions. In this case, Drivers rely on the deployment to report an error when a transaction is started.

Drivers MUST increment the txnNumber for the corresponding server session.

In programming languages that support resource management blocks, startTransaction MAY be used to initiate such a block:

with client.start_session() as s:
    with s.start_transaction():
        collection_one.insert_one(doc1, session=s)
        s.commit_transaction()

The exact API SHOULD match the idioms of the programming language. Depending on the conventions of the programming language, exiting the block without calling commitTransaction MAY automatically abort the transaction, or MAY abort the transaction when exiting due to an exception and commit it when exiting normally. The driver MUST NOT automatically commit the transaction when exiting the block due to an exception. This means that for languages that use an RAII pattern for resource management blocks, if object destruction can't tell if the containing scope has exited normally or for an exception, object destruction MUST NOT automatically commit the transaction.

If the driver returns a type to support resource management blocks, the type MUST NOT be named "Transaction". The type MAY be named "TransactionContext", "TransactionScopeGuard" or something similar for your language. See Why is there no Transaction object?

commitTransaction

This method commits the currently active transaction on this session. Drivers MUST run a commitTransaction command with the writeConcern and, if configured, the maxCommitTimeMS from TransactionOptions. Drivers MUST report an error when the command fails or the command succeeds but contains a writeConcernError. This session is in the "transaction committed" state after this method returns — even on error.

If this session is in the "no transaction" state, then Drivers MUST raise an error containing the message "No transaction started".

If this session is in the "transaction aborted" state, then Drivers MUST raise an error containing the message "Cannot call commitTransaction after calling abortTransaction".

If this session is already in the "transaction committed" state, then Drivers MUST re-run the previous commitTransaction.

It is valid to call commitTransaction when the session is in the "starting transaction" or "transaction in progress" state. When the session is in the "starting transaction" state, meaning no operations have been performed on this transaction, drivers MUST NOT run the commitTransaction command.

commitTransaction is a retryable write command. Drivers MUST retry once after commitTransaction fails with a retryable error, including a handshake network error, according to the Retryable Writes Specification, regardless of whether retryWrites is set on the MongoClient or not.

When commitTransaction is retried, either by the driver's internal retry logic or explicitly by the user calling commitTransaction again, drivers MUST apply w: majority to the write concern of the commitTransaction command. If the transaction is using a writeConcern that is not the server default (i.e. specified via TransactionOptions during the startTransaction call or otherwise inherited), any other write concern options (e.g. wtimeout) MUST be left as-is when applying w: majority. Finally, if the modified write concern does not include a wtimeout value, drivers MUST also apply wtimeout: 10000 to the write concern in order to avoid waiting forever (or until a socket timeout) if the majority write concern cannot be satisfied. See Majority write concern is used when retrying commitTransaction.

Drivers MUST add error labels to certain errors when commitTransaction fails. See the Error reporting changes and Error Labels sections for a precise description.

abortTransaction

This method aborts the currently active transaction on this session. Drivers MUST run an abortTransaction command with the transaction’s writeConcern. When this method completes the session moves to the "transaction aborted" state.

It is only valid to call abortTransaction when the session is in the "starting transaction" or "transaction in progress" state, otherwise drivers MUST raise an error without modifying transaction state.

If this session is in the "no transaction" state, then drivers MUST raise an error containing the message "No transaction started".

If this session is in the "transaction committed" state, then drivers MUST raise an error containing the message "Cannot call abortTransaction after calling commitTransaction".

If this session is already in the "transaction aborted" state, then drivers MUST raise an error containing the message "Cannot call abortTransaction twice".

It is valid to call abortTransaction when the session is in the "starting transaction" or "transaction in progress" state. When the session is in the "starting transaction" state, meaning, no operations have been performed on this transaction, drivers MUST NOT run the abortTransaction command.

abortTransaction is a retryable write command. Drivers MUST retry after abortTransaction fails with a retryable error according to the Retryable Writes Specification, including a handshake network error, regardless of whether retryWrites is set on the MongoClient or not.

If the operation times out or fails with a non-retryable error, drivers MUST ignore all errors from the abortTransaction command. Errors from abortTransaction are meaningless to the application because they cannot do anything to recover from the error. The transaction will ultimately be aborted by the server anyway either upon reaching an age limit or when the application starts a new transaction on this session, see Drivers ignore all abortTransaction errors.

endSession changes

This method ends a ClientSession. Drivers MUST call abortTransaction if this session is in the "transaction in progress" state in order to release resources on the server. Drivers MUST ignore any errors raised by abortTransaction while ending a session.

Error reporting changes

This spec introduces the concept of an "error label". Which labels are applied to an error may be communicated from the server to the client, or determined client-side. Any error reported by the driver in response to a server error, server selection error, or network error MUST have an API for determining whether it has a given label. In programming languages that use class inheritance hierarchies for exceptions, the presence of an error label MUST NOT affect an exception's class. Error labels MUST be expressed as a collection of text strings, and it MUST be possible for applications to check if an error has a label that is not yet specified in MongoDB 4.0. Drivers MAY define constants for error label strings that are known at this time.

Drivers MAY implement an error label API similar to the following:

try:
    session.commit_transaction()
except (OperationFailure, ConnectionFailure) as exc:
    if exc.has_error_label("UnknownTransactionCommitResult"):
        print("tried to commit, don't know the outcome")

Drivers MAY expose the list of all error labels for an exception object.

Drivers MUST add the error label "TransientTransactionError" to network errors thrown in a transaction except for network errors thrown during commitTransaction.

Transactions Wire Protocol

The server requires each operation executed within a transaction to provide an lsid and txnNumber in its command document. Each field is obtained from the ClientSession object passed to the operation from the application. Drivers will be responsible for maintaining a monotonically increasing transaction number for each ServerSession used by a ClientSession object. The txnNumber is incremented by the call to startTransaction and remains the same for all commands in the transaction.

Drivers that pool ServerSessions MUST preserve the transaction number when reusing a server session from the pool with a new ClientSession (this can be tracked as another property on the driver's object for the server session).

Drivers MUST ensure that each transaction specifies a transaction number larger than any previously used transaction number for its session ID.

Constructing commands within a transaction

Behavior of the startTransaction field

The first command within a multi-statement transaction MUST include startTransaction:true. Subsequent commands MUST NOT include the startTransaction field.

Behavior of the autocommit field

All operations within a multi-statement transaction (including commitTransaction and abortTransaction) MUST include autocommit:false, to distinguish them from single-statement retryable writes.

Behavior of the readConcern field

Any command that marks the beginning of a transaction MAY include a readConcern argument with an optional level and afterClusterTime fields. Read concern level 'local', 'majority', and 'snapshot' are all supported, although they will all have the same behavior as "snapshot" in MongoDB 4.0. To support causal consistency, if readConcern afterClusterTime is specified, then the server will ensure that the transaction’s read timestamp is after the afterClusterTime.

All commands of a multi-statement transaction subsequent to the initial command MUST NOT specify a readConcern, since the readConcern argument is only needed to establish the transaction’s read timestamp. If a readConcern argument is specified on a subsequent (non-initial) command, the server will return an error.

Read concern level "snapshot" is new in MongoDB 4.0 and can only be used when starting a transaction. The server will return an error if read concern level "snapshot" is specified on a command that is not the start of a transaction. Drivers MUST rely on the server to report an error if read concern level snapshot is used incorrectly.

Behavior of the writeConcern field

The commitTransaction and abortTransaction commands are the only commands of a multi-statement transaction that allow a writeConcern argument. If a writeConcern argument is given on any other command of a transaction, the server will return an error. The writeConcern argument of the commitTransaction and abortTransaction commands has semantics analogous to existing write commands.

Behavior of the recoveryToken field

Only included for sharded transactions and only when running a commitTransaction or abortTransaction command. See the recoveryToken field section for more info.

Constructing the first command within a transaction

When constructing the first command within a transaction, drivers MUST add the lsid, txnNumber, startTransaction, and autocommit fields.

Drivers MUST include the transaction's readConcern in the first command in a transaction if and only if the readConcern is supplied and not the server's default. The readConcern MUST NOT be inherited from the collection, database, or client associated with the driver method that invokes the first command.

Drivers MUST NOT add to subsequent commands the readConcern from the transaction or any readConcern inherited from the collection, database, or client.

This is an example of an insert command that begins a server transaction:

{
    insert : "test",
    documents : [{}],
    lsid : { id : <UUID> }
    txnNumber: NumberLong(1),
    // The "level" is optional, supported values are "local", "majority"
    // and "snapshot". "afterClusterTime" is only present in causally
    // consistent sessions.
    readConcern : {
        level : "snapshot",
        afterClusterTime : Timestamp(42,1)
    },
    startTransaction : true,
    autocommit : false
}

This command uses the readConcern set on the transaction's TransactionOptions during the startTransaction call. It is not inherited from a client, database, or collection at the time of the first command.

The session transitions to the "transaction in progress" state after completing the first command within a transaction — even on error.

Constructing any other command within a transaction

When constructing any other command within a transaction, drivers MUST add the lsid, txnNumber, and autocommit fields. Drivers MUST NOT automatically add the writeConcern, readConcern, or startTransaction fields. This is an example of a find command within a transaction:

{
    find : "test",
    filter : {},
    lsid : { id : <UUID> }
    txnNumber : NumberLong(1),
    autocommit : false
}

Generic RunCommand helper within a transaction

If your driver offers a generic RunCommand method on your database object, the driver MUST add the lsid, autocommit, and txnNumber fields. If the RunCommand operation is the first operation in a transaction then the driver MUST also add the startTransaction and readConcern fields. A driver MUST do this without modifying any data supplied by the application (e.g. the command document passed to RunCommand). If the user supplies an explicit readConcern as an argument to the runCommand method in a transaction, the client MUST raise an error with the message "Cannot set read concern after starting a transaction."

The RunCommand method is considered a read operation and MUST use the transaction’s read preference.

The behavior is not defined if the command document passed to RunCommand already contains some of the transaction fields.

Interaction with Causal Consistency

Drivers MUST add readConcern.afterClusterTime to the command that starts a transaction in a causally consistent session -- even if the command is a write. Drivers MUST NOT add readConcern.afterClusterTime to subsequent commands in a transaction.

Interaction with Retryable Writes

In MongoDB 4.0 the only supported retryable write commands within a transaction are commitTransaction and abortTransaction. Therefore drivers MUST NOT retry write commands within transactions even when retryWrites has been enabled on the MongoClient. In addition, drivers MUST NOT add the RetryableWriteError label to any error that occurs during a write command within a transaction (excepting commitTransation and abortTransaction), even when retryWrites has been enabled on the MongoClient.

Drivers MUST retry the commitTransaction and abortTransaction commands even when retryWrites has been disabled on the MongoClient. commitTransaction and abortTransaction are retryable write commands and MUST be retried according to the Retryable Writes Specification.

Retryable writes and transactions both use the txnNumber associated with a ServerSession. For retryable writes, txnNumber would normally increment before each retryable command, whereas in a transaction, the txnNumber is incremented at the start and then stays constant, even for retryable operations within the transaction. When executing the commitTransaction and abortTransaction commands within a transaction drivers MUST use the same txnNumber used for all preceding commands in the transaction.

Server Commands

commitTransaction

The commitTransaction server command has the following format:

{
    commitTransaction : 1,
    lsid : { id : <UUID> },
    txnNumber : <Int64>,
    autocommit : false,
    writeConcern : {...},
    maxTimeMS: <Int64>,
    recoveryToken : {...}
}

abortTransaction

The abortTransaction server command has the following format:

{
    abortTransaction : 1,
    lsid : { id : <UUID> },
    txnNumber : <Int64>,
    autocommit : false,
    writeConcern : {...},
    recoveryToken : {...}
}

Both commands MUST be sent to the admin database.

The server response has the following format:

{ ok : 1 }

In case of an error, the server response has the following format:

{ ok : 0, errmsg : "...", code : <Number>, errorLabels: ["Label"] }

In case of a write concern error, the server response has the following format:

{ ok : 1, writeConcernError: {code: <Number>, errmsg : "..."} }

Sharded Transactions

MongoDB 4.2 (maxWireVersion 8) introduces support for sharded transactions. Sharded transactions support all of the same features as single replica set transaction but introduce two new driver concepts: mongos pinning and the recoveryToken field.

Mongos Pinning

Drivers MUST send all commands for a single transaction to the same mongos (excluding retries of commitTransaction and abortTransaction).

After the driver selects a mongos for the first command within a transaction, the driver MUST pin the ClientSession to the selected mongos. Drivers MUST send all subsequent commands that are part of the same transaction (excluding certain retries of commitTransaction and abortTransaction) to the same mongos.

When to unpin

Drivers MUST unpin a ClientSession in the following situations:

  1. The transaction is aborted. The session MUST be unpinned regardless of whether or the abortTransaction command succeeds or fails, or was executed at all. If the operation fails with a retryable error, the session MUST be unpinned before performing server selection for the retry.
  2. Any operation in the transaction, including commitTransaction fails with a TransientTransactionError. Transient errors indicate that the transaction in question has already been aborted or that the pinnned mongos is down/unavailable. Unpinning the session ensures that a subsequent abortTransaction (or commitTransaction) does not block waiting on a server that is unreachable.
  3. Any commitTransaction attempt fails with an UnknownTransactionCommitResult error label. If the error is also considered retryable, the session MUST be unpinned before performing server selection for the retry.
  4. A new transaction is started on the ClientSession after the previous transaction has been committed. The session MUST be unpinned before performing server selection for the first operation of the new transaction.
  5. A non-transactional operation is performed using the ClientSession. The session MUST be unpinned before performing server selection for the operation.

Note that committing a transaction on a pinned ClientSession MUST NOT unpin the session as commitTransaction may be called multiple times.

Pinning in Load Balancer Mode

See the Load Balancer Specification for details.

recoveryToken field

The recoveryToken field enables the driver to recover a sharded transaction's outcome on a new mongos when the original mongos is no longer available.1

Every successful (ok:1) command response in a sharded transaction includes a recoveryToken field. Drivers MUST track the most recently received recoveryToken field and MUST append this field to any subsequent commitTransaction or abortTransaction commands. Tracking the most recently returned recoveryToken allows the server to update the recoveryToken mid-transaction if needed.

Drivers MUST clear a session's cached recoveryToken when transitioning to the "no transaction" or "starting transaction" state.

Drivers can safely assume that the recoveryToken field is always a BSON document but drivers MUST NOT modify the contents of the document.

Error Reporting and Retrying Transactions

Error Labels

Starting in MongoDB 4.0, any command error may include a top level "errorLabels" field. The field contains an array of string error labels.

TransientTransactionError

Any command error that includes the "TransientTransactionError" error label in the "errorLabels" field. Any network error or server selection error encountered running any command besides commitTransaction in a transaction. In the case of command errors, the server adds the label; in the case of network errors or server selection errors where the client receives no server reply, the client adds the label.

Retrying transactions that fail with TransientTransactionError

If an exception with the "TransientTransactionError" label is thrown, an application can retry the entire transaction from the beginning with a reasonable expectation that it will succeed. For example:

def run_transaction(client):
    with client.start_session() as s:
        with s.start_transaction():
            collection_one.insert_one(doc1, session=s)
            collection_two.insert_one(doc2, session=s)

while True:
    try:
        return run_transaction(client)
    except (OperationFailure, ConnectionFailure) as exc:
        if exc.has_error_label("TransientTransactionError"):
            print("Transient transaction error, retrying...")
            continue
        raise

In the above example, a transaction will never be committed twice. The retry loop ends when the transaction commits successfully or the transaction fails with a non-transient error.

An example of a non-transient transaction error is DuplicateKeyError, which causes the server to abort the transaction. Retrying a transaction that causes a DuplicateKeyError will again (likely) abort the transaction, therefore such an error is not labeled "transient."

UnknownTransactionCommitResult

The commitTransaction command is considered a retryable write. The driver will automatically retry the commitTransaction after a retryable error. Although this adds a layer of protection, the driver’s retry attempt of a commitTransaction may again fail with a retryable error. In that case, both the driver and the application do not know the state of the transaction.

The driver MUST add the "UnknownTransactionCommitResult" error label when commitTransaction fails with a server selection error, network error, retryable writes error, MaxTimeMSExpired error, or write concern failed / timeout. (See A server selection error is labeled UnknownTransactionCommitResult for justification.) The approximate meaning of the UnknownTransactionCommitResult label is, "We don't know if your commit has satisfied the provided write concern." The only write concern errors that are not labeled with "UnknownTransactionCommitResult" are CannotSatisfyWriteConcern (which will be renamed to the more precise UnsatisfiableWriteConcern in 4.2, while preserving the current error code) and UnknownReplWriteConcern. These errors codes mean that the provided write concern is not valid and therefore a retry attempt would fail with the same error.

In the case that the commitTransaction fails with a retryable writes error, that error will have both an UnknownTransactionCommitResult label and a RetryableWriteError label. This is currently the only scenario in which an error can be assigned two error labels.

Retrying commitTransaction

If an exception with this label is thrown, an application can safely call commitTransaction again. If this attempt succeeds it means the transaction has committed with the provided write concern. If this attempt fails it may also have the "UnknownTransactionCommitResult" error label. For example:

def run_transaction_and_retry_commit(client):
    with client.start_session() as s:
        with s.start_transaction():
            collection_one.insert_one(doc1, session=s)
            collection_two.insert_one(doc2, session=s)
            while True:
                try:
                    s.commit_transaction()
                    break
                except (OperationFailure, ConnectionFailure) as exc:
                    if exc.has_error_label("UnknownTransactionCommitResult"):
                        print("Unknown commit result, retrying...")
                        continue
                    raise

while True:
    try:
        return run_transaction_and_retry_commit(client)
    except (OperationFailure, ConnectionFailure) as exc:
        if exc.has_error_label("TransientTransactionError"):
            print("Transient transaction error, retrying...")
            continue
        raise

Handling command errors

Drivers MUST document that command errors inside a transaction may abort the transaction on the server. An attempt to commit such transaction will be rejected with NoSuchTransaction error.

Test Plan

See the README for tests.

The Python driver serves as a reference implementation.

Design Rationale

The design of this specification builds on the Driver Sessions Specification and modifies the driver API as little as possible.

Drivers will rely on the server to yield an error if an unsupported command is executed within a transaction. This will free drivers from having to maintain a list of supported operations and also allow for forward compatibility when future server versions begin to support transactions for additional commands.

Drivers ignore all abortTransaction errors

If the driver has cleared its client-side transaction state, then the next operation it performs will be in a new transaction or no transaction, which will cause any lingering transaction state on the server (associated with this session) to abort. Therefore abortTransaction can be considered fail-safe, and raising an exception from it only complicates application code. Applications would have to wrap abortTransaction in an exception-handling block, but have no useful action to perform in response to the error.

abortTransaction does, however, raise an error if there is no transaction in progress. We had considered making this situation raise no error, as well. However, we want to raise an error from abortTransaction if there is no transaction, because it discourages an antipattern like this:

s.start_transaction()
try:
    coll.insert_one({}, session=s)
    s.commit_transaction()
except:
    # We don't know if it was the insert_one, the commit,
    # or some other operation that failed, so we must not
    # commit the transaction.
    s.abort_transaction()  # Raises a client-side error

If a user puts "commit" in the same exception handling block as the other operations in the transaction, they don't know whether to retry the commit or the whole transaction on error. We want such code to raise an exception. One chance we have to do that is if a commit fails with a network error and enters the exception handling block, where abortTransaction throws "Cannot call abortTransaction after commitTransaction".

Drivers add the "TransientTransactionError" label to network errors

When any non-commitTransaction command fails with a network error within a transaction Drivers add the "TransientTransactionError" label because the client doesn't know if it has modified data in the transaction or not. Therefore it must abort and retry the entire transaction to be certain it has executed each command in the transaction exactly once.

Adding the "TransientTransactionError" label allows applications to use the the same error label API for both network errors and command errors. This also allows applications to distinguish between a network error that occurs within a transaction from a network error that occurs while committing a transaction.

Transactions in GridFS

The GridFS spec has not been updated to support sessions, however some drivers have already implemented support for it on their own. When the GridFS spec has been updated to support sessions, then drivers that implement that spec MUST also support transactions in GridFS because all APIs that allow sessions MUST support transactions.

Drivers that have already implemented session support in GridFS MUST also support transactions in GridFS. Drivers that have not implemented ClientSession support in GridFS are not required to support transactions (or sessions) in GridFS.

This spec does not require all drivers to implement transaction support in GridFS because transactions in GridFS are not very useful: transactions in 4.0 are too limited in time and space to operate on large GridFS files. Additionally, GridFS as specified already has some basic guarantees that make transactions less necessary: files are immutable and they are created "atomically", from the primary's perspective, because the file entry is only saved after all chunks are uploaded.

Causal Consistency with RunCommand helper

Causal Consistency alone only applies to commands that read, and we don't want to parse the document passed to runCommand to see if it's a command that reads. In a transaction, however, any command at all that starts a transaction must include afterClusterTime, so we can add afterClusterTime to the document passed to runCommand without adding per-command special logic to runCommand.

Calling commitTransaction with the generic runCommand helper is undefined behavior

Applications should only use the ClientSession API to manage transactions. Applications should not use a generic runCommand helper to run the commitTransaction or abortTransaction commands directly. This spec does not define the behavior of calling such commands, consistent with other drivers specifications that do not define the behavior of calling directly commands for which helper methods are available The purpose of the generic runCommand method is to execute a command directly with minimum additional client-side logic.

Dependencies

This specification depends on:

  1. Driver Sessions Specification
  2. Retryable Writes Specification

Backwards Compatibility

The API changes to support transactions extend the existing API but do not introduce any backward breaking changes. Existing programs that do not make use of transactions will continue to compile and run correctly.

Reference Implementation

The Python driver serves as a reference implementation.

Future work

  • Support retryable writes within a transaction.

  • Support transactions on secondaries. In this case, drivers would be
    required to pin a transaction to the server selected for the initial operation. All subsequent operations in the transaction would go to the pinned server.

  • Support for transactions that read from multiple nodes in a replica
    set. One interesting use case would be to run a single transaction that performs low-latency reads with readPreference "nearest" followed by some writes.

  • Support for unacknowledged transaction commits. This might be useful
    when data consistency is paramount but durability is optional. Imagine a system that increments two counters in two different collections. The system may want to use transactions to guarantee that both counters are always incremented together or not at all.

Justifications

Why is there no Transaction object?

In order to use transactions an application already has to create and manage a ClientSession object. Introducing a Transaction object would result in another object that the application needs to manage. Moreover, a server session can only have a single transaction in progress at a time. We chose not to introduce a public Transaction object so that applications only need to manage a single object and to more closely mirror how transactions work on the server.

Some drivers' startTransaction methods will return an object as part of the language's resource management block protocol. The object returned by startTransaction MUST NOT be named Transaction, in order to reserve that name for some future API extension. Additionally, by avoiding the name Transaction, we prevent users from thinking they can run multiple transactions in a session. Finally, we avoid the temptation to diverge from this spec's API by adding a commit() or abort() method to the object returned by startTransaction. Committing and aborting a transaction is the responsibility of the ClientSession object in all drivers.

Why is readPreference part of TransactionOptions?

Providing a read preference for the entire transaction makes it easier for applications that use one or more non-primary read preferences for non-transactional reads to run transactions under a single, primary read-preference. Applications only need to set primary read preference on the transaction instead of changing the read preference of all operations.

Because primary is the only read preference allowed with transactions in MongoDB 4.0, this specification could have omitted TransactionOptions.readPreference, or at least defaulted the read preference to primary instead of inheriting the client's read preference. However, this would have required a breaking change circa MongoDB 4.2 when we introduce secondary reads in transactions: TransactionOptions will inherit the client's read preference in 4.2, so for the sake of future-compatibility, TransactionOptions inherits the client's read preference now.

We considered defaulting TransactionOptions.readPreference to primary in 4.0, overriding the client's read preference by default for convenience. However, for consistency with other options-inheritance rules in our specifications, transactions MUST inherit the client's read preference.

In MongoDB 4.0, the error "read preference in a transaction must be primary" is thrown whenever the application attempts a read operation in a transaction with a non-primary read preference. We considered throwing this error from startTransaction instead, to make the error more deterministic and reduce the performance burden of re-checking the TransactionOptions on each operation. However, this behavior will have to change when we introduce secondary reads in transactions. There will then be new error scenarios, such as a transaction with secondary reads followed by a write. It won't be possible in the future for startTransaction to check that the read preference is correct for all operations the application will perform in the transaction. Therefore, we specify now that the readPreference must be checked per-operation. (However, we have not completely planned how read preference validation will behave in MongoDB 4.2.)

Users cannot pass readConcern or writeConcern to operations in transactions

For drivers that allow readConcern and/or writeConcern to be passed to a particular operation, If the driver did not prohibit the readConcern parameter to methods in a transaction, the following code would be ambiguous:

client = MongoClient("mongodb://localhost/?readConcernLevel=majority")
with client.start_session() as s:
    # Transaction uses readConcern majority.
    with s.start_transaction():
        # The first command in a transaction. Which readConcern?
        client.db.collection.distinct(
            readConcern={'level': 'snapshot'},
            session=s)

In this scenario, the driver must choose which of the two possible readConcerns to use for the first command in the transaction. The server will accept either without error, so the ambiguity MUST be resolved by raising a client-side error.

We could specify that if a user passes an explicit writeConcern to an operation in a transaction, that the driver passes this writeConcern to the server. The server correctly returns an error in this scenario; there is not the same ambiguity with an explicit writeConcern as there is with an explicit readConcern passed to the first operation in a transaction. For consistency, however, we specify that an explicit writeConcern passed to an operation in a transaction provokes a client-side error, the same as for readConcern.

Another alternative is to silently ignore the readConcern and/or writeConcern that the user has explicitly provided to a particular operation in a transaction. This would be a surprising and undetectable deviation from the user's explicit intent.

On the other hand, if a user configures the write concern of a client, database, or collection, and then configures the same option on a transaction, the transaction's configuration overrides the inherited configuration:

client = MongoClient("mongodb://localhost/?w=majority")
with client.start_session() as s:
    with s.start_transaction(writeConcern={'w': 1}):
        # Uses w: 1.
        client.db.collection.insert_one(
           {'_id': 1},
           session=s)

In this case the transaction options express a more immediate user intent than the client options, so it is not surprising to override the client options.

Aggregate with write stage is a read operation

We intend to migrate away from designs that require drivers to inspect the contents of the aggregation pipeline and override user read preferences for aggregate with a write stage (e.g. $out, $merge). In general, our specifications should stop defining different behaviors based on the contents of commands.

A server selection error is labeled UnknownTransactionCommitResult

Drivers add the "UnknownTransactionCommitResult" to a server selection error from commitTransaction, even if this is the first attempt to send commitTransaction. It is true in this case that the driver knows the result: the transaction is definitely not committed. However, the "UnknownTransactionCommitResult" label properly communicates to the application that calling commitTransaction again may succeed.

FAQ

What commands can be run in a transaction?

The following commands are allowed inside transactions:

  1. find
  2. getMore
    • Note that it is not possible to start a transaction with a getMore command, the cursor must have been created within the transaction in order for the getMore to succeed.
  3. killCursors
  4. insert, including into a non-existing collection that implicitly creates it
  5. update
  6. delete
  7. findAndModify
  8. aggregate (including $lookup)
    • The $out and $merge stages are prohibited.
  9. distinct
  10. geoSearch
  11. create
  12. createIndexes on an empty collection created in the same transaction or on a non-existing collection
  13. bulkWrite

Why don’t drivers automatically retry commit after a write concern timeout error?

A write concern timeout error indicates that the command succeeded but failed to meet the specified writeConcern within the given time limit. Attempting to retry would implicitly double the application’s wtimeout value so drivers do not automatically retry.

Note: this applies only to the driver's internal retry-once behavior. Write concern timeout errors will be labeled with "UnknownTransactionCommitResult", which signals that higher-level code may retry.

What happens when a command object passed to RunCommand already contains a transaction field (eg. lsid, txnNumber, etc...)?

The behavior of running such commands in a transaction are undefined. Applications should not run such commands inside a transaction.

Majority write concern is used when retrying commitTransaction

Drivers should apply a majority write concern when retrying commitTransaction to guard against a transaction being applied twice.

Consider the following scenario:

  1. The driver is connected to a replica set where node A is primary.
  2. The driver sends commitTransaction to A with w:1. A commits the transaction but dies before it can reply. This constitutes a retryable error, which means the driver can retry the commitTransaction command.
  3. Node B is briefly elected.
  4. The driver retries commitTransaction on B with w:1, and B replies with a NoSuchTransaction error code and TransientTransactionError error label. This implies that the driver may retry the entire transaction.
  5. Node A revives before B has done any w:majority writes and is reëlected as primary.
  6. The driver then retries the entire transaction on A where it commits successfully.

The outcome of this scenario is that two complete executions of the transaction operations are permanently committed on node A.

Drivers can avoid this scenario if they always use a majority write concern when retrying commitTransaction. Applying a majority write concern to step four in the above scenario would lead to one of the following possible outcomes:

  • Node B replies with failed response, which does not include a TransientTransactionError error label. This does not constitute a retryable error. Control is returned to the user.
  • Node B replies with a successful response (e.g. ok:1) indicating that the retried commitTransaction has succeeded durably and the driver can continue. Control is returned to the user.
  • Node B replies with a wtimeout error. This does not constitute a retryable error. Control is returned to the user.
  • Node B replies with a failure response that includes the TransientTransactionError label, which indicates it is safe to retry the entire transaction. Drivers can trust that a server response will not include both a write concern error and TransientTransactionError label (see: SERVER-37179).

Adding a majority write concern only when retrying commitTransaction provides a good compromise of performance and durability. Applications can use w:1 for the initial transaction attempt for a performance advantage in the happy path. In the event of retryable error, the driver can upgrade commitTransaction to use w:majority and provide additional guarantees to the user and avoid any risk of committing the transaction twice. Note that users may still be vulnerable to rollbacks by using w:1 (as with any write operation).

While it's possible that the original write concern may provide greater guarantees than majority (e.g. w:3 in a three-node replica set, custom write concern), drivers are not in a position to make that comparison due to the possibility of hidden members or the opaque nature of custom write concerns. Excluding the edge case where writeConcernMajorityJournalDefault has been disabled, drivers can readily trust that a majority write concern is durable, which achieves the primary objective of avoiding duplicate commits.

Changelog

  • 2024-05-08: Add bulkWrite to the list of commands allowed in transactions.

  • 2024-02-15: Migrated from reStructuredText to Markdown.

  • 2023-11-22: Specify that non-transient transaction errors abort the transaction
    on the server.

  • 2022-10-05: Remove spec front matter and reformat changelog

  • 2022-01-25: Mention the additional case of a retryable handshake error

  • 2022-01-19: Deprecate maxCommitTimeMS in favor of timeoutMS.

  • 2021-04-12: Adding in behaviour for load balancer mode.

  • 2020-04-07: Clarify that all abortTransaction attempts should unpin the session,
    even if the command is not executed.

  • 2020-04-07: Specify that sessions should be unpinned once a transaction is aborted.

  • 2019-10-21: Specify that a commit error can have two error labels

  • 2019-07-30: Clarify when the cached recoveryToken should be cleared.

  • 2019-06-10: Client-side errors must not change transaction state.

  • 2019-06-07: Mention $merge stage for aggregate alongside $out

  • 2019-05-13: Add support for maxTimeMS on transaction commit, MaxTimeMSExpired
    errors on commit are labelled UnknownTransactionCommitResult.

  • 2019-02-19: Add support for sharded transaction recoveryToken.

  • 2019-02-19: Clarify FAQ entry for not retrying commit on wtimeout

  • 2019-01-18: Apply majority write concern when retrying commitTransaction

  • 2018-11-13: Add mongos pinning to support sharded transaction.

  • 2018-06-18: Explicit readConcern and/or writeConcern are prohibited within
    transactions, with a client-side error.

  • 2018-06-07: The count command is not supported within transactions.

  • 2018-06-14: Any retryable writes error raised by commitTransaction must be
    labelled "UnknownTransactionCommitResult".

1

In 4.2, a new mongos waits for the outcome of the transaction but will never itself cause the transaction to be committed. If the initial commit on the original mongos itself failed to initiate the transaction's commit sequence, then a retry attempt on a new mongos will block until the transaction is automatically timed out by the cluster. In this case, the new mongos will return a transient error indicating that the transaction was aborted.

=============================== Convenient API for Transactions

:Status: Accepted :Minimum Server Version: 4.0

.. contents::


Abstract

Reliably committing a transaction in the face of errors can be a complicated endeavor using the MongoDB 4.0 drivers API. This specification introduces a withTransaction method on the ClientSession object that allows application logic to be executed within a transaction. This method is capable of retrying either the commit operation or entire transaction as needed (and when the error permits) to better ensure that the transaction can complete successfully.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 <https://www.ietf.org/rfc/rfc2119.txt>_.

Specification

Terms

.. _callback:

Callback A user-defined function that will be passed to the helper method defined in this specification. Depending on the implementing language, this may be a closure, function pointer, or other callable type.

ClientSession Driver object representing a client session, as defined in the Driver Session_ specification. The name of this object MAY vary across drivers.

.. _Driver Session: ../sessions/driver-sessions.md

MongoClient The root object of a driver's API. The name of this object MAY vary across drivers.

.. _TransactionOptions:

TransactionOptions Options for ClientSession.startTransaction, as defined in the Transactions_ specification. The structure of these options MAY vary across drivers (e.g. dictionary, typed class).

.. _Transactions: ../transactions/transactions.md

Naming Deviations

This specification defines the name for a new ClientSession method, withTransaction. Drivers SHOULD use the defined name but MAY deviate to comply with their existing conventions. For example, a driver may use with_transaction instead of withTransaction.

Callback Semantics

The purpose of the callback is to allow the application to specify some sequence of operations to be executed within the body of a transaction. In an ideal scenario, withTransaction will start a transaction, execute the callback, and commit the transaction. In the event of error, the commit or entire transaction may need to be retried and thusly the callback could be invoked multiple times.

Drivers MUST ensure that the application can access the ClientSession within the callback, since the ClientSession will be needed to associate operations with the transaction. Drivers may elect an idiomatic approach to satisfy this requirement (e.g. require the callback to receive the ClientSession as its first argument, expect the callback to access the ClientSession from its lexical scope). Drivers MAY allow the callback to support additional parameters as needed (e.g. user data parameter, error output parameter). Drivers MAY allow the callback to return a value to be propagated as the return value of withTransaction.

ClientSession Changes

This specification introduces a withTransaction method on the ClientSession class:

.. code:: typescript

interface ClientSession {
    withTransaction(function<any(...)> callback,
                    Optional<TransactionOptions> options,
                    ... /* other arguments as needed */): any

    // other existing members of ClientSession
}

ClientSession.withTransaction


This method is responsible for starting a transaction, invoking a callback, and
committing a transaction. The callback is expected to execute one or more
operations with the transaction; however, that is not enforced. The callback is
allowed to execute other operations not associated with the transaction.

Since ``withTransaction`` includes logic to retry transactions and commits,
drivers MUST apply timeouts per `Client Side Operations Timeout: Convenient
Transactions API
<../client-side-operations-timeout/client-side-operations-timeout.md#convenient-transactions-api>`__.
If ``timeoutMS`` is unset for a ``withTransaction`` call, drivers MUST
enforce a 120-second timeout to limit retry behavior and safeguard
applications from long-running (or infinite) retry loops. Drivers SHOULD use
a monotonic clock to determine elapsed time.

If an UnknownTransactionCommitResult error is encountered for a commit, the
driver MUST retry the commit if and only if the error is not MaxTimeMSExpired
and the retry timeout has not been exceeded. Otherwise, the driver MUST NOT
retry the commit and allow ``withTransaction`` to propagate the error to its
caller.

If a TransientTransactionError is encountered at any point, the entire
transaction may be retried. If the retry timeout has not been exceeded, the
driver MUST retry a transaction that fails with an error bearing the
"TransientTransactionError" label. Since retrying the entire transaction will
entail invoking the callback again, drivers MUST document that the callback may
be invoked multiple times (i.e. one additional time per retry attempt) and MUST
document the risk of side effects from using a non-idempotent callback. If the
retry timeout has been exceeded, drivers MUST NOT retry the transaction and
allow ``withTransaction`` to propagate the error to its caller.

If an error bearing neither the UnknownTransactionCommitResult nor the
TransientTransactionError label is encountered at any point, the driver MUST NOT
retry and MUST allow ``withTransaction`` to propagate the error to its caller.

This method MUST receive a `callback`_ as its first parameter.  An optional
`TransactionOptions`_ MUST be provided as its second parameter (with deviations
permitted as outlined in the `CRUD`_ specification). Drivers MAY support other
parameters or options as needed (e.g. user data to pass as a parameter to the
callback).

.. _CRUD: ../crud/crud.md#deviations

~~~~~~~~~~~~~~~~~~~
Sequence of Actions
~~~~~~~~~~~~~~~~~~~

This method should perform the following sequence of actions:

1. Record the current monotonic time, which will be used to enforce the
   120-second timeout before later retry attempts.

2. Invoke `startTransaction`_ on the session. If TransactionOptions were
   specified in the call to ``withTransaction``, those MUST be used for
   ``startTransaction``. Note that ``ClientSession.defaultTransactionOptions``
   will be used in the absence of any explicit TransactionOptions.

3. If ``startTransaction`` reported an error, propagate that error to the caller
   of ``withTransaction`` and return immediately.

4. Invoke the callback. Drivers MUST ensure that the ClientSession can be
   accessed within the callback (e.g. pass ClientSession as the first parameter,
   rely on lexical scoping). Drivers MAY pass additional parameters as needed
   (e.g. user data solicited by withTransaction).

5. Control returns to ``withTransaction``. Determine the current `state`_ of the
   ClientSession and whether the callback reported an error (e.g. thrown
   exception, error output parameter).

6. If the callback reported an error:

   a. If the ClientSession is in the "starting transaction" or "transaction in
      progress" state, invoke `abortTransaction`_ on the session.

   b. If the callback's error includes a "TransientTransactionError" label and
      the elapsed time of ``withTransaction`` is less than 120 seconds, jump
      back to step two.

   c. If the callback's error includes a "UnknownTransactionCommitResult" label,
      the callback must have manually committed a transaction, propagate the
      callback's error to the caller of ``withTransaction`` and return
      immediately.

   d. Otherwise, propagate the callback's error to the caller of
      ``withTransaction`` and return immediately.

7. If the ClientSession is in the "no transaction", "transaction aborted", or
   "transaction committed" state, assume the callback intentionally aborted or
   committed the transaction and return immediately.

8. Invoke `commitTransaction`_ on the session.

9. If ``commitTransaction`` reported an error:

   a. If the ``commitTransaction`` error includes a
      "UnknownTransactionCommitResult" label and the error is not
      MaxTimeMSExpired and the elapsed time of ``withTransaction`` is less
      than 120 seconds, jump back to step eight. We will trust
      ``commitTransaction`` to apply a majority write concern on
      retry attempts (see:
      `Majority write concern is used when retrying commitTransaction`_).

   b. If the ``commitTransaction`` error includes a "TransientTransactionError"
      label and the elapsed time of ``withTransaction`` is less than 120
      seconds, jump back to step two.

   c. Otherwise, propagate the ``commitTransaction`` error to the caller of
      ``withTransaction`` and return immediately.

10. The transaction was committed successfully. Return immediately.

.. _startTransaction: ../transactions/transactions.md#starttransaction
.. _state: ../transactions/transactions.md#clientsession-changes
.. _abortTransaction: ../transactions/transactions.md#aborttransaction
.. _commitTransaction: ../transactions/transactions.md#committransaction

~~~~~~~~~~~
Pseudo-code
~~~~~~~~~~~

This method can be expressed by the following pseudo-code:

.. code:: typescript

    withTransaction(callback, options) {
        // Note: drivers SHOULD use a monotonic clock to determine elapsed time
        var startTime = Date.now(); // milliseconds since Unix epoch

        retryTransaction: while (true) {
            this.startTransaction(options); // may throw on error

            try {
                callback(this);
            } catch (error) {
                if (this.transactionState == STARTING ||
                    this.transactionState == IN_PROGRESS) {
                    this.abortTransaction();
                }

                if (error.hasErrorLabel("TransientTransactionError") &&
                    Date.now() - startTime < 120000) {
                    continue retryTransaction;
                }

                throw error;
            }

            if (this.transactionState == NO_TXN ||
                this.transactionState == COMMITTED ||
                this.transactionState == ABORTED) {
                return; // Assume callback intentionally ended the transaction
            }

            retryCommit: while (true) {
                try {
                    /* We will rely on ClientSession.commitTransaction() to
                     * apply a majority write concern if commitTransaction is
                     * being retried (see: DRIVERS-601) */
                    this.commitTransaction();
                } catch (error) {
                    /* Note: a maxTimeMS error will have the MaxTimeMSExpired
                     * code (50) and can be reported as a top-level error or
                     * inside writeConcernError, ie:
                     * {ok:0, code: 50, codeName: "MaxTimeMSExpired"}
                     * {ok:1, writeConcernError: {code: 50, codeName: "MaxTimeMSExpired"}}
                     */
                    if (!isMaxTimeMSExpiredError(error) &&
                        error.hasErrorLabel("UnknownTransactionCommitResult") &&
                        Date.now() - startTime < 120000) {
                        continue retryCommit;
                    }

                    if (error.hasErrorLabel("TransientTransactionError") &&
                        Date.now() - startTime < 120000) {
                        continue retryTransaction;
                    }

                    throw error;
                }
                break; // Commit was successful
            }
            break; // Transaction was successful
        }
    }

ClientSession must provide access to a MongoClient
--------------------------------------------------

The callback invoked by ``withTransaction`` is only guaranteed to receive a
ClientSession parameter. Drivers MUST ensure that it is possible to obtain a
MongoClient within the callback in order to execute operations within the
transaction. Per the `Driver Session`_ specification, ClientSessions should
already provide access to a client object.

Handling errors inside the callback
-----------------------------------

Drivers MUST document that the callback MUST NOT silently handle command errors
without allowing such errors to propagate. Command errors may abort the transaction
on the server, and an attempt to commit the transaction will be rejected with
``NoSuchTransaction`` error.

For example, ``DuplicateKeyError`` is an error that aborts a transaction on the
server. If the callback catches ``DuplicateKeyError`` and does not re-throw it,
the driver will attempt to commit the transaction. The server will reject the
commit attempt with ``NoSuchTransaction`` error. This error has the
"TransientTransactionError" label and the driver will retry the commit. This
will result in an infinite loop.

Drivers MUST recommend that the callback re-throw command errors if they
need to be handled inside the callback. Drivers SHOULD also recommend using
Core Transaction API if a user wants to handle errors in a custom way.

Test Plan
=========

See the `README <tests/README.rst>`_ for tests.

Motivation for Change
=====================

Reliably committing a transaction in the face of errors can be a complicated
endeavor using the MongoDB 4.0 drivers API. Providing helper method in the
driver to execute a transaction (and retry when possible) will enable our users
to make better use of transactions in their applications.

Design Rationale
================

This specification introduces a helper method on the ClientSession object that
applications may optionally employ to execute a user-defined function within a
transaction. An application does not need to be modified unless it wants to take
advantage of this helper method.

Majority write concern is used when retrying commitTransaction
--------------------------------------------------------------

Drivers should apply a majority write concern when retrying commitTransaction to
guard against a transaction being applied twice. This requirement was originally
enforced in the implementation of ``withTransaction``, but will now be handled
by the transaction spec itself in order to benefit applications irrespective of
whether they use ``withTransaction`` (see the corresponding section in the
`Transactions spec Design Rationale`_).

.. _Transactions spec Design Rationale: ../transactions/transactions.md#majority-write-concern-is-used-when-retrying-committransaction

The callback function has a flexible signature
----------------------------------------------

An original design considered requiring the callback to accept a ClientSession
as its first parameter. That could be superfluous for languages where the
callback might already have access to ClientSession through its lexical scope.
Instead, the spec simply requires that drivers ensure the callback will be able
to access the ClientSession. 

Similarly, the specification does not concern itself with the return type of the
callback function. If drivers allow the callback to return a value, they may
also choose to propagate that value as the return value of withTransaction.

An earlier design also considered using the callback's return value to indicate
whether control should break out of ``withTransaction`` (and its retry loop) and
return to the application. The design allows this to be accomplished in one of
two ways:

- The callback aborts the transaction directly and returns to
  ``withTransaction``, which will then return to its caller.

- The callback raises an error without the "TransientTransactionError" label,
  in which case ``withTransaction`` will abort the transaction and return to
  its caller.

Applications are responsible for passing ClientSession for operations within a transaction
------------------------------------------------------------------------------------------

It remains the responsibility of the application to pass a ClientSession to all
operations that should be included in a transaction. With regard to
``withTransaction``, applications are free to execute any operations within the
callback, irrespective of whether those operations are associated with the
transaction.

It is assumed that the callback will not start a new transaction on the ClientSession
-------------------------------------------------------------------------------------

Under normal circumstances, the callback should not commit the transaction nor
should it start a new transaction. The ``withTransaction`` method will inspect
the ClientSession's transaction state after the callback returns and take the
most sensible course of action; however, it will not detect whether the callback
has started a new transaction.

The callback function may be executed multiple times
----------------------------------------------------

The implementation of withTransaction is based on the original examples for
`Retry Transactions and Commit Operation`_ from the MongoDB Manual. As such, the
callback may be executed any number of times. Drivers are free to encourage
their users to design idempotent callbacks.

.. _Retry Transactions and Commit Operation: https://www.mongodb.com/docs/manual/core/transactions/#retry-transaction-and-commit-operation

The commit is retried after a write concern timeout (i.e. wtimeout) error
-------------------------------------------------------------------------

Per the Transactions specification, drivers internally retry
``commitTransaction`` once if it fails due to a retryable error (as defined in
the `Retryable Writes`_ specification). Beyond that, applications may manually
retry ``commitTransaction`` if it fails with any error bearing the
`UnknownTransactionCommitResult`_ error label. This label is applied for the
the following errors:

.. _Retryable Writes: ../retryable-writes/retryable-writes.md#terms

.. _UnknownTransactionCommitResult: ../transactions/transactions.md#unknowntransactioncommitresult

- Server selection failure
- Retryable error (as defined in the `Retryable Writes`_ specification)
- Write concern failure or timeout (excluding UnsatisfiableWriteConcern and
  UnknownReplWriteConcern)
- MaxTimeMSExpired errors, ie ``{ok:0, code: 50, codeName: "MaxTimeMSExpired"}``
  and ``{ok:1, writeConcernError: {code: 50, codeName: "MaxTimeMSExpired"}}``.

A previous design for ``withTransaction`` retried for all of these errors
*except* for write concern timeouts, so as not to exceed the user's original
intention for ``wtimeout``. The current design of this specification no longer
excludes write concern timeouts, and simply retries ``commitTransaction`` within
its timeout period for all errors bearing the "UnknownTransactionCommitResult"
label.

This change was made in light of the forthcoming Client-side Operations Timeout
specification (see: `Future Work`_), which we expect will allow the current
120-second timeout for ``withTransaction`` to be customized and also obviate the
need for users to specify ``wtimeout``.

The commit is not retried after a MaxTimeMSExpired error
--------------------------------------------------------

This specification intentionally chooses not to retry commit operations after a
MaxTimeMSExpired error as doing so would exceed the user's original intention
for ``maxTimeMS``.

The transaction and commit may be retried any number of times within a timeout period
-------------------------------------------------------------------------------------

The implementation of withTransaction is based on the original examples for
`Retry Transactions and Commit Operation`_ from the MongoDB Manual. As such, the
transaction and commit may be continually retried as long as the error label
indicates that retrying is possible.

A previous design had no limits for retrying commits or entire transactions. The
callback is always able indicate that ``withTransaction`` should return to its
caller (without future retry attempts) by aborting the transaction directly;
however, that puts the onus on avoiding very long (or infinite) retry loops on
the application. We expect the most common cause of retry loops will be due to
TransientTransactionErrors caused by write conflicts, as those can occur
regularly in a healthy application, as opposed to
UnknownTransactionCommitResult, which would typically be caused by an election. 

In order to avoid blocking the application with infinite retry loops,
``withTransaction`` will cease retrying invocations of the callback or
commitTransaction if it has exceeded a fixed timeout period of 120 seconds. This
limit is a non-configurable default and is intentionally twice the value of
MongoDB 4.0's default for the `transactionLifetimeLimitSeconds`_ parameter (60
seconds). Applications that desire longer retry periods may call
``withTransaction`` additional times as needed. Applications that desire shorter
retry periods should not use this method.

.. _transactionLifetimeLimitSeconds: https://www.mongodb.com/docs/manual/reference/parameters/#param.transactionLifetimeLimitSeconds

Backwards Compatibility
=======================

The specification introduces a new method on the ClientSession class and does
not introduce any backward breaking changes. Existing programs that do not make
use of this new method will continue to compile and run correctly.

Reference Implementation
========================

The C, Java, and Ruby drivers will provide reference implementations. The
corresponding tickets for those implementations may be found via
`DRIVERS-556`_.

.. _DRIVERS-556: https://jira.mongodb.org/browse/DRIVERS-556

Security Implication
====================

Applications that use transaction guarantees to enforce security rules will
benefit from a less error-prone API. Adding a helper method to execute a
user-defined function within a transaction has few security implications, as it
only provides an implementation of a technique already described in the MongoDB
4.0 documentation (`DRIVERS-488`_).

.. _DRIVERS-488: https://jira.mongodb.org/browse/DRIVERS-488

Future Work
===========

The forthcoming Client-side Operations Timeout specification (`DRIVERS-555`_)
may allow users to alter the default retry timeout, as a client-side timeout
could be applied to ``withTransaction`` and its retry logic. In the absence of a
client-side operation timeout, withTransaction can continue to use the
120-second default and thus preserve backwards compatibility.

.. _DRIVERS-555: https://jira.mongodb.org/browse/DRIVERS-555

Changes
=======

:2023-11-22: Document error handling inside the callback.
:2022-10-05: Remove spec front matter and reformat changelog.
:2022-01-19: withTransaction applies timeouts per the client-side operations
             timeout specification.
:2019-04-24: withTransaction does not retry when commit fails with MaxTimeMSExpired.
:2018-02-13: withTransaction should retry commits after a wtimeout

===================== Enumerating Databases

:Status: Accepted :Minimum Server Version: 3.6

.. contents::


Abstract

A driver can provide functionality to enumerate all databases on a server. This specification defines several methods for enumerating databases.

META

The keywords “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119 <https://www.ietf.org/rfc/rfc2119.txt>_.

Specification

Terms

MongoClient Driver object representing a connection to MongoDB. This is the root object of a driver’s API and MAY be named differently in some drivers.

MongoDatabase Driver object representing a database and the operations that can be performed on it. MAY be named differently in some drivers.

Iterable An object or data structure that is a sequence of elements that can be iterated over. This spec is flexible on what that means as different drivers will have different requirements, types, and idioms.

Document An object or data structure used by the driver to represent a BSON document. This spec is flexible on what that means as different drivers will have different requirements, types, and idioms.

Naming Deviations

This specification defines names for methods. To the extent possible, drivers SHOULD use the defined names. However, where a driver or language’s naming conventions would conflict, drivers SHOULD honor their existing conventions. For example, a driver may use list_databases instead of listDatabases.

Filters

Drivers SHOULD support the filter option when implementing the listDatabases_ database command. The filter option is a query predicate that determines which databases are listed in the command result. You can specify a condition on any of the database fields returned in the command output:

.. _listDatabases: https://www.mongodb.com/docs/manual/reference/command/listDatabases/

  • name
  • sizeOnDisk
  • empty
  • shards

For example, to list only databases whose names begin with "foo":

::

db.adminCommand({listDatabases: 1, filter: {name: /^foo/}});

AuthorizedDatabases

MongoDB 4.0.5 added an authorizedDatabases boolean option to the listDatabases_ database command, which can be used to limit the command result to only include databases the user is authorized to use. Drivers SHOULD support the new authorizedDatabases option when implementing the listDatabases_ database command.

The possible values for authorizedDatabases are:

  • unspecified (missing entirely from the command document sent to the server)
  • false
  • true

See the server's listDatabases_ documentation for an explanation of what each value means.

Comment

MongoDB 4.4 introduced a comment option to the listDatabases command. This option enables users to specify a comment as an arbitrary BSON type to help trace the operation through the database profiler, currentOp and logs. The default is to not send a value. If a comment is provided on pre-4.4 servers, the comment should still be attached and the driver should rely on the server to provide an error to the user.

::

db.getSiblingDB("admin").runCommand({listDatabases: 1, comment: "hi there"})

Driver Methods

If a driver already has a method to perform one of the listed tasks, there is no need to change it. Do not break backwards compatibility when adding new methods.

All methods SHOULD be implemented on the MongoClient object.

All methods MUST apply timeouts per the Client Side Operations Timeout <./client-side-operations-timeout/client-side-operations-timeout.md>__ specification.

Enumerating Full Database Information


The `listDatabases`_ database command returns an array of documents, each of
which contains information about a database on the MongoDB server. Additionally,
the command reports the aggregate sum of all database sizes (in bytes). Consider
the following example:

.. _listDatabases: https://www.mongodb.com/docs/manual/reference/command/listDatabases/

::

  > db.getSiblingDB("admin").runCommand({listDatabases:1})
  {
      "databases" : [
          {
              "name" : "admin",
              "sizeOnDisk" : 83886080,
              "empty" : false
          },
          {
              "name" : "local",
              "sizeOnDisk" : 83886080,
              "empty" : false
          }
      ],
      "totalSize" : 167772160,
      "ok" : 1
  }

Drivers SHOULD implement a MongoClient method that returns an Iterable of
database specifications (e.g. model object, document type), each of which
correspond to an element in the databases array of the ``listDatabases`` command
result. This method SHOULD be named ``listDatabases``.

Drivers MAY report ``totalSize`` (e.g. through an additional output variable on
the ``listDatabases`` method), but this is not necessary.

Drivers SHOULD support the ``filter``, ``authorizedDatabases`` and ``comment``
options when implementing this method.

Enumerating Database Names
~~~~~~~~~~~~~~~~~~~~~~~~~~

MongoDB 3.6 introduced a ``nameOnly`` boolean option to the ``listDatabases``
database command, which limits the command result to only include database
names. Consider the following example:

::

  > db.getSiblingDB("admin").runCommand({listDatabases:1,nameOnly:true})
  {
      "databases" : [
          { "name" : "admin" },
          { "name" : "local" }
      ],
      "ok" : 1
  }

Drivers MAY implement a MongoClient method that returns an Iterable of strings,
each of which corresponds to a name in the databases array of the
``listDatabases`` command result. This method SHOULD be named
``listDatabaseNames``.

Older versions of the server that do not support the ``nameOnly`` option for the
``listDatabases`` command will ignore it without raising an error. Therefore,
drivers SHOULD always specify the ``nameOnly`` option when they only intend to
access database names from the ``listDatabases`` command result.

Drivers SHOULD support the ``filter``, ``authorizedDatabases`` and ``comment``
options when implementing this method.

Enumerating MongoDatabase Objects
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Drivers MAY implement a MongoClient method that returns an Iterable of
MongoDatabase types, each of which corresponds to a name in the databases array
of the ``listDatabases`` command result. This method MAY be named
``listMongoDatabases``.

Any MongoDatabase objects returned by this method SHOULD inherit the same
MongoClient options that would otherwise be inherited by selecting an individual
MongoDatabase through MongoClient (e.g. read preference, write concern).

Drivers SHOULD specify the ``nameOnly`` option when executing the
``listDatabases`` command for this method.

Drivers SHOULD support the ``filter``, ``authorizedDatabases`` and ``comment``
options when implementing this method.

Replica Sets
------------

The ``listDatabases`` command may be run on a secondary node. Drivers MUST run
the ``listDatabases`` command only on the primary node in replica set topology,
unless directly connected to a secondary node in Single topology.

Test Plan
=========

Test Environments
-----------------

The test plan should be executed against the following servers:

* Standalone
* Replica set primary
* Replica set secondary
* Sharding router (i.e. mongos)

Test Cases
----------

The following scenarios should be run for each test environment:

* Execute the method to enumerate full database information (e.g.
  ``listDatabases()``)
  - Verify that the method returns an Iterable of Document types
  - Verify that all databases on the server are present in the result set
  - Verify that the result set does not contain duplicates
* Execute the method to enumerate database names (e.g. ``listDatabaseNames()``)
  - Verify that the method returns an Iterable of strings
  - Verify that all databases on the server are present in the result set
  - Verify that the result set does not contain duplicates
* Execute the method to enumerate MongoDatabase objects (e.g.
  ``listMongoDatabases()``)
  - Verify that the method returns an Iterable of MongoDatabase objects
  - Verify that all databases on the server are present in the result set
  - Verify that the result set does not contain duplicates

Motivation for Change
=====================

Although most drivers provide a ``listDatabases`` command helper in their API,
there was previously no spec for a database enumeration. MongoDB 3.6 introduced
a ``nameOnly`` option to the ``listDatabases`` database command. The driver API
should to be expanded to support this option.

Design Rationale
================

The design of this specification is inspired by the `Collection Enumeration`_
and `Index Enumeration`_ specifications. Since most drivers already implement a
``listDatabases`` command helper in some fashion, this spec is flexible when it
comes to existing APIs.

.. _Collection Enumeration: ./enumerate-collections.rst
.. _Index Enumeration: ./enumerate-indexes.rst

Backwards Compatibility
=======================

There should be no backwards compatibility concerns. This specification merely
deals with how to enumerate databases in future versions of MongoDB and allows
flexibility for existing driver APIs.

Reference Implementation
========================

TBD

Q & A
=====

Why is reporting the total size of all databases optional?
----------------------------------------------------------

Although the ``listDatabases`` command provides two results, a ``databases``
array and ``totalSize`` integer, the array of database information documents is
the primary result. Returning a tuple or composite result type from a
``listDatabases`` driver method would complicate the general use case, as
opposed to an optional output argument (if supported by the language).
Furthermore, the ``totalSize`` value can be calculated client-side by summing
all ``sizeOnDisk`` fields in the array of database information documents.

Changelog
=========

:2022-10-05: Remove spec front matter and reformat changelog. Also reverts the
             minimum server version to 3.6, which is where ``nameOnly`` and
             ``filter`` options were first introduced for ``listDatabases``.
:2022-08-17: Clarify the behavior of comment on pre-4.4 servers.
:2022-02-01: Support comment option in listDatabases command
:2022-01-19: Require that timeouts be applied per the client-side operations timeout spec.
:2019-11-20: Support authorizedDatabases option in listDatabases command
:2017-10-30: Support filter option in listDatabases command

.. role:: javascript(code) :language: javascript

======================= Enumerating Collections

:Status: Accepted :Minimum Server Version: 1.8

.. contents::


Abstract

A driver can contain a feature to enumerate all collections belonging to a database. This specification defines how collections should be enumerated.

META

The keywords “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119 <https://www.ietf.org/rfc/rfc2119.txt>_.

Specification

Terms

MongoClient Driver object representing a connection to MongoDB. This is the root object of a driver’s API and MAY be named differently in some drivers.

Iterable An object or data structure that is a sequence of elements that can be iterated over. This spec is flexible on what that means as different drivers will have different requirements, types, and idioms.

The driver needs to implement different methods of enumeration depending on MongoDB server version. From MongoDB 2.7.6, the server implements a listCollections command that MUST be used if available. For the non-MMAP storage engines that come with 2.8, this will be the only way how their collections can be enumerated. For earlier MongoDB server versions, the driver SHOULD simply query against the systems.namespaces collection.

Pre MongoDB 2.7.6 versions

On MongoDB versions prior to 2.7.6, the only way to enumerate collections for a database is by querying the system.namespaces collection::

db.system.namespaces.find()

This query returns a list of documents through a normal query cursor, which always include a name field and optionally a options field.

Each document describes a namespace. Among collections, these namespaces also include special collections (containing .system. or .oplog. in the name). The returned documents also include indexes (containing a $ in the name). Index names MUST NOT be returned while enumerating collections through db.system.namespaces.find().

When returning information, drivers MUST strip the database name from the returned namespace, and only leave the collection name.

MongoDB versions 2.7.6 to 2.8.0-rc2

MAY be supported, but this is not necessary. These versions allow the listCollections command, but return results as a single document, but not as a cursor. This spec does not cover these versions.

Post MongoDB 2.8.0-rc3 versions

In versions 2.8.0-rc3 and later, the listCollections command returns a cursor::

$ use test
$ db.runCommand( { listCollections: 1 } );

The command also accepts options.

The filter option, which acts like a query against the returned collection documents. You can i.e. use the following to only list the collections beginning with test::

$ db.runCommand( { listCollections: 1, filter: { name: /^test/ } } );

Or to find all capped collections::

$ db.runCommand( { listCollections: 1, filter: { 'options.capped': true } } );

The cursor.batchSize option, which allows you to set how many initial collections should be returned as part of the cursor specification document that comes back from the server. This first batch is part of the returned structure in the firstBatch key (see more about return types further on).

The command returns a cursor definition structure::

{
    cursor: {
        id: <long>,
        ns: <string>,
        firstBatch: [<object>, <object>, ...]
    },
    ok: 1
}

With the cursor.id and cursor.ns fields you can retrieve further collection information structures.

The command also returns the field ok to signal whether the command was executed successfully.

This will return the first 25 collection descriptions as part of the returned document::

$ db.runCommand( { listCollections: 1, cursor : { batchSize: 25 } } );

MongoDB 4.4 introduced a comment option to the listCollections database command. This option enables users to specify a comment as an arbitrary BSON type to help trace the operation through the database profiler, currentOp and logs. The default is to not send a value. If a comment is provided on pre-4.4 servers, the comment should still be attached and the driver should rely on the server to provide an error to the user.

Example of usage of the comment option::

$ db.runCommand({"listCollections": 1, "comment": "hi there"})

Any comment set on a listCollections command is inherited by any subsequent getMore commands run on the same cursor.id returned from the listCollections command. Therefore, drivers MUST NOT attach the comment to subsequent getMore commands on a cursor.

Filters

Pre MongoDB 2.7.6 servers, which require querying system.namespaces, return the name of a collection prefixed with the database name.

Post MongoDB 2.7.6 servers, which have listCollections implemented, return the name of a collection without the database name prefixed to it.

Because of this, drivers MUST prefix filters against the name field with the database name for pre MongoDB 2.7.6 servers. Because prefixing causes issues with regular expressions, a driver MUST NOT allow a regular expression as the match value for the name field in the filter for pre MongoDB 2.7.6 drivers.

For example, to list all collections with a "listCollections" method, you would do the following with the filter argument::

if filter.name is set: if server version < 2.7.6 if typeof filter.name != string

    throw "value type not accepted" (as it needs to be a static string)

  else
    filter.name = "dbname." + filter.name

Filtering against the options field of a collection has no restrictions.

Return types


For servers that support the ``listCollections`` command, the return types
differ depending on server version. Versions 2.7.6 to 2.8.0-rc2 return a single
document containing all the results, but versions 2.8.0-rc3 and later return a
cursor description. The format that is returned is the same as for any other command cursor::

    {
        cursor: {
            id: <long>,
            ns: <string>,
            firstBatch: [<object>, <object>, ...]
        },
        ok: 1
    }

The number of objects in the ``firstBatch`` field depends on the
``cursor.batchSize`` option.


Drivers MAY expose methods to return collection names as an array. If your
driver already has such a method, its return type MUST NOT be changed in order
to prevent breaking backwards compatibility.

Drivers SHOULD expose (a) method(s) to return collection information through a
cursor, where the information for each collection is represented by a single
document.


Enumeration: getting collection names
-------------------------------------

With both the pre and post 2.7.6 versions having a different way to enumerate
all collections, drivers SHOULD implement their enumeration according to the
following algorithm (just like the
`shell does <https://github.com/mongodb/mongo/blob/f32ba54f971c045fb589fe4c3a37da77dc486cee/src/mongo/shell/db.js#L550>`_)::

  run the listCollections command, with the filter if given
    if res.code == 59 || res.code == 13390:
      fall back to querying system.namespaces.

    if !res.ok:
      if res.errmsg && res.errmsg.startsWith( "no such cmd" ):
        fall back to querying system.namespaces.

      else:
        throw listCollections command failed exception

    loop through res.cursor.firstBatch

    use getmore with res.cursor.id and res.cursor.ns information to loop over
    remaining results

If you need to fall back to querying ``system.namespaces``, then you need to
filter out all return documents that contain a ``$`` in the ``name`` field.

When falling back to querying ``system.namespaces`` you will also need to filter
out all the documents that contain ``".oplog.$"`` as this collection is used for
legacy replication within the local database.

Alternatively, and if a driver already implements checking MongoDB versions, a
driver MAY alternatively implement it as::

  if server version >= 2.7.6
    run the listCollections command, with the filter if given
      if res.code == 59 || res.code == 13390:
        throw listCollections command failed exception

      if res.errmsg && res.errmsg.startsWith( "no such cmd" ):
        throw listCollections command failed exception

    loop through res.cursor.firstBatch

    use getmore with res.cursor.id and res.cursor.ns information to loop over
    remaining results

  else
    fall back to querying system.namespaces.

Driver methods
--------------

Drivers SHOULD use the method name ``listCollections`` for a method that
returns all collections with a cursor return type. Drivers MAY use an idiomatic
variant that fits the language the driver is for.

If a driver already has a method to perform one of the listed tasks,
there is no need to change it. Do not break backwards compatibility when
adding new methods.

All methods:

- SHOULD be on the database object.
- MUST allow a filter to be passed to include only requested collections.
- MAY allow the ``cursor.batchSize`` option to be passed.
- SHOULD allow the ``comment`` option to be passed.
- MUST use the *same* return type (ie, array or cursor) whether either a
  pre-2.7.6 server, a post-2.7.6 or a post-2.8.0-rc3 server is being used.
- MUST apply timeouts per the `Client Side Operations Timeout
  <./client-side-operations-timeout/client-side-operations-timeout.md>`__
  specification.

All methods that return cursors MUST support the timeout options documented
in `Client Side Operations Timeout: Cursors
<./client-side-operations-timeout/client-side-operations-timeout.md#cursors>`__.

Getting Collection Names

Drivers MAY implement a MongoClient method that returns an Iterable of strings, where each string corresponds to a collection name. This method SHOULD be named listCollectionNames.

MongoDB 4.0 introduced a nameOnly boolean option to the listCollections database command, which limits the command result to only include collection names. NOTE: nameOnly is applied before any filter is applied.

Example return::

[
    "me",
    "oplog.rs",
    "replset.minvalid",
    "startup_log",
    "system.indexes",
    "system.replset"
]

Server version between 2.7.6 (inclusive) and 4.0 (exclusive) do not support the nameOnly option for the listCollections command and will ignore it without raising an error. Therefore, drivers MUST always specify the nameOnly option when they only intend to access collection names from the listCollections command result, except drivers MUST NOT set nameOnly if a filter specifies any keys other than name.

MongoDB 4.0 also added an authorizedCollections boolean option to the listCollections command, which can be used to limit the command result to only include collections the user is authorized to use. Drivers MAY allow users to set the authorizedCollections option on the listCollectionNames method.

Getting Full Collection Information


Drivers MAY implement a method to return the full ``name/options`` pairs that
are returned from both ``listCollections`` (in the ``res.cursor.firstBatch``
field, and subsequent retrieved documents through getmore on the cursor
constructed from ``res.cursor.ns`` and ``res.cursor.id``), and the query
result for ``system.namespaces``.

The returned result for each variant MUST be equivalent, and each collection
that is returned MUST use the field names ``name`` and ``options``.

In MongoDB 4.4, the ``ns`` field was removed from the index specifications, so
the index specification included in the ``idIndex`` field of the collection
information will no longer contain an ``ns`` field.

- For drivers that report those index specifications in the form of documents or
  dictionaries, no special handling is necessary, but any documentation of the
  contents of the documents/dictionaries MUST indicate that the ``ns`` field
  will no longer be present in MongoDB 4.4+. If the contents of the
  documents/dictionaries are undocumented, then no special mention of the ``ns``
  field is necessary.
- For drivers that report those index specifications in the form of statically
  defined models, the driver MUST manually populate the ``ns`` field of the
  models with the appropriate namespace if the server does not report it in the
  ``listCollections`` command response. The ``ns`` field is not required to be a
  part of the models, however.

Example return (a cursor which returns documents, not a simple array)::

    {
        "name" : "me", "options" : { "flags" : 1 }
    },
    {
        "name" : "oplog.rs", "options" : { "capped" : true, "size" : 10485760, "autoIndexId" : false }
    },
    {
        "name" : "replset.minvalid", "options" : { "flags" : 1 }
    },
    {
        "name" : "startup_log", "options" : { "capped" : true, "size" : 10485760 }
    },
    {
        "name" : "system.indexes", "options" : { }
    },
    {
        "name" : "system.replset", "options" : { "flags" : 1 }
    }

When returning this information as a cursor, a driver SHOULD use the
method name ``listCollections`` or an idiomatic variant.

Drivers MAY allow the ``nameOnly`` and ``authorizedCollections`` options
to be passed when executing the ``listCollections`` command for this method.

Returning a List of Collection Objects

Drivers MAY implement a method that returns a collection object for each returned collection, if the driver has such a concept. This method MAY be named listMongoCollections.

Example return (in PHP, but abbreviated)::

array(6) {
  [0] => class MongoCollection#6 { }
  [1] => class MongoCollection#7 { }
  [2] => class MongoCollection#8 { }
  [3] => class MongoCollection#9 { }
  [4] => class MongoCollection#10 { }
  [5] => class MongoCollection#11 { }
}

Drivers MUST specify true for the nameOnly option when executing the listCollections command for this method, except drivers MUST NOT set nameOnly if a filter specifies any keys other than name.

Drivers MAY allow the authorizedCollections option to be passed when executing the listCollections command for this method

Replica Sets


- ``listCollections`` can be run on a secondary node.
- Querying ``system.indexes`` on a secondary node requires secondaryOk to be set.
- Drivers MUST run ``listCollections`` on the primary node when in a replica set
  topology, unless directly connected to a secondary node in Single topology.


Test Plan
=========

Configurations
--------------

- standalone node
- replica set primary node
- replica set secondary node
- mongos node

Preparation
-----------

For each of the configurations:

- Create a (new) database
- Create a collection and a capped collection
- Create an index on each of the two collections
- Insert at least one document in each of the two collections

Tests
-----

- Run the driver's method that returns a list of collection names (e.g.
  ``listCollectionNames()``):

  - verify that *all* collection names are represented in the result
  - verify that there are no duplicate collection names
  - there are no returned collections that do not exist
  - there are no returned collections containing an '$'

- Run the driver's method that returns a list of collection names (e.g.
  ``listCollectionNames()``), pass a filter of ``{ 'options.capped': true }``,
  and:

  - verify that *only* names of capped collections are represented in the result
  - verify that there are no duplicate collection names
  - there are no returned collections that do not exist
  - there are no returned collections containing an '$'


Backwards Compatibility
=======================

There should be no backwards compatibility concerns. This SPEC merely deals
with how to enumerate collections in future versions of MongoDB.


Reference Implementation
========================

The shell implements the first algorithm for falling back if the
``listCollections`` command does not exist
(`<https://github.com/mongodb/mongo/blob/f32ba54f971c045fb589fe4c3a37da77dc486cee/src/mongo/shell/db.js#L550>`_).


Changelog
=========

:2022-10-05: Remove spec front matter and reformat changelog.
:2022-09-15: Clarify the behavior of ``comment`` on pre-4.4 servers.
:2022-02-01: Add ``comment`` option to ``listCollections`` command.
:2022-01-20: Require that timeouts be applied per the client-side operations
             timeout spec.
:2021-12-17: Support ``authorizedCollections`` option in ``listCollections``
             command.
:2021-04-22: Update to use secondaryOk.
:2020-03-18: MongoDB 4.4 no longer includes ``ns`` field in ``idIndex`` field
             for ``listCollections`` responses.
:2019-03-21: The method that returns a list of collection names should be named
             ``listCollectionNames``. The method that returns a list of
             collection objects may be named ``listMongoCollections``.
:2018-07-03: Clarify that ``nameOnly`` must not be used with filters other than
             ``name``.
:2018-05-18: Support ``nameOnly`` option in ``listCollections`` command.
:2017-09-27: Clarify reason for filtering collection names containing '$'.
:2015-01-14: Clarify trimming of database name. Put preferred method name for
             listing collections with a cursor as return value.
:2014-12-18: Update with the server change to return a cursor for
             ``listCollections``.

Index Management

  • Status: Accepted
  • Minimum Server Version: 3.6

Specification

The index management spec defines a set of behaviour in the drivers for creating, removing and viewing indexes in a collection. It defines implementation details when required but also provides flexibility in the driver in that one or both of 2 unique APIs can be chosen to be implemented.

Definitions

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Terms

Collection
The term Collection references the object in the driver that represents a collection on the server.

Cursor
The term Cursor references the driver's cursor object.

Iterable
The term Iterable is to describe an object that is a sequence of elements that can be iterated over.

Document
The term Document refers to the implementation in the driver's language of a BSON document.

Result
The term Result references the object that is normally returned by the driver as the result of a command execution. In the case of situations where an actual command is not executed, rather an insert or a query, an object that adheres to the same interface must be returned with as much information as possible that could be obtained from the operation.

Guidance

Documentation

The documentation provided in code below is merely for driver authors and SHOULD NOT be taken as required documentation for the driver.

Operations

All drivers MUST offer at least one of the sections of operations, the Standard API or the Index View API. The driver MAY elect to have both. Implementation details are noted in the comments when a specific implementation is required. Within each API, all methods are REQUIRED unless noted otherwise in the comments.

Operation Parameters

All drivers MUST include the specified parameters in each operation. This does not preclude a driver from offering more. A driver SHOULD NOT require a user to specify the options parameter if they wish to use the server defaults.

As of 3.4 (see https://jira.mongodb.org/browse/SERVER-769) the server validates options passed to the createIndexes command -- drivers should be aware when testing that passing arbitrary options when the driver does not validate them could fail on the server.

Deviations

A non-exhaustive list of acceptable deviations are as follows:

  • Using named parameters in place of an options hash or class. For instance, collection.create_index({x: 1}, commit_quorum="majority").
  • When using an Options class, if multiple Options classes are structurally equatable, it is permissible to consolidate them into one with a clear name. For instance, it would be permissible to use the name CreateIndexOptions as the options for createIndex and createIndexes.

Naming

All drivers MUST name operations and parameters as defined in the following sections. Exceptions to this rule are noted in the appropriate section. Class and interface names may vary according to the driver and language best practices.

Naming Deviations

When deviating from a defined name, an author should consider if the altered name is recognizable and discoverable to the user of another driver.

A non-exhaustive list of acceptable naming deviations are as follows:

  • Using "maxTimeMS" as an example, .NET would use "MaxTime" where it's type is a TimeSpan structure that includes units. However, calling it "MaximumTime" would not be acceptable.
  • Using "CreateIndexOptions" as an example, Javascript wouldn't need to name it while other drivers might prefer to call it "CreateIndexArgs" or "CreateIndexParams".
  • Acceptable naming deviations should fall within the basic style of the language. For example, createIndex would be a required name in Java, where camel-case method names are used, but in Ruby create_index would be acceptable.

Index Name Generation

When the client generates a name for an index based on the keys, the driver MUST generate the name as key-direction pairs, separated by underscores. For example, the key { name: 1, dob: -1 } MUST generate an index name of name_1_dob_-1.

Note there is one exception to this rule on the _id field. The server uses an index name with no direction, _id_, which cannot be overridden.

Timeouts

Drivers MUST enforce timeouts for all operations per the Client Side Operations Timeout specification. All operations that return cursors MUST support the timeout options documented in the Cursors section of that specification.

Standard API

interface Collection {

  /**
   * This is a convenience method for creating a single index. This MUST call the
   * createIndexes method and pass the provided specification document in a
   * sequence to that method with the same options.
   *
   * @return The name of the created index.
   *
   * @note Drivers MAY opt to implement this method signature, the signature that
   *   takes an IndexModel as a parameter, or for those languages with method
   *   overloading MAY decide to implement both.
   *
   * @note Drivers MAY combine the two options types into a single one. If the options are
   *   explicitly typed, the combined options type MUST be named CreateIndexOptions or an acceptable
   *   variation.
   */
  createIndex(keys: Document, indexOptions: Optional<IndexOptions>, options: Optional<CreateIndexOptions>): String;

  /**
   * @see Comments above.
   */
  createIndex(model: IndexModel, options: Optional<CreateIndexOptions>): String

  /**
   * Creates multiple indexes in the collection.
   * 
   * In all server versions, this MUST execute a createIndexes command.
   *
   * @return The names of all the indexes that were created.
   */
  createIndexes(models: Iterable<IndexModel>, options: Optional<CreateIndexesOptions>): Iterable<String>;

  /**
   * Drops a single index from the collection by the index name.
   *
   * In all server versions this MUST execute a dropIndexes command.
   *
   * @note If the string passed is '*', the driver MUST raise an error since
   *   more than one index would be dropped.
   */
  dropIndex(name: String, options: Optional<DropIndexOptions>): Result;

  /**
   * Attempts to drop a single index from the collection given the keys and options.
   *
   * In all server versions this MUST execute a dropIndexes command.
   *
   * This is OPTIONAL until partial indexes are implemented.
   *
   * @note Drivers MAY opt to implement this method signature, the signature that
   *   takes an IndexModel as a parameter, or for those languages with method
   *   overloading MAY decide to implement both.
   *
   * @note Drivers MAY combine the two options types into a single one. If the options are
   *   explicitly typed, the combined options type MUST be named DropIndexOptions or an acceptable
   *   variation.
   */
  dropIndex(keys: Document, indexOptions: IndexOptions, options: Optional<DropIndexOptions>): Result;

  /**
   * @see Comments above.
   */
  dropIndex(model: IndexModel, options: Optional<DropIndexOptions>): Result;

  /**
   * Drops all indexes in the collection.
   */
  dropIndexes(options: Optional<DropIndexesOptions>): Result;

  /**
   * Gets index information for all indexes in the collection. The behavior for 
   * enumerating indexes is described in the :ref:`Enumerating Indexes` section.
   *
   */
  listIndexes(options: Optional<ListIndexesOptions>): Cursor;
}

interface CreateIndexOptions {
  /**
   * Specifies how many data-bearing members of a replica set, including the primary, must
   * complete the index builds successfully before the primary marks the indexes as ready.
   *
   * This option accepts the same values for the "w" field in a write concern plus "votingMembers",
   * which indicates all voting data-bearing nodes.
   *
   * This option is only supported by servers >= 4.4. Drivers MUST manually raise an error if this option
   * is specified when creating an index on a pre 4.4 server. See the Q&A section for the rationale behind this.
   *
   * @note This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @since MongoDB 4.4
   */
  commitQuorum: Optional<Int32 | String>;

  /**
   * The maximum amount of time to allow the index build to take before returning an error.
   *
   * @note This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   */
  maxTimeMS: Optional<Int64>;

  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/createIndexes/
   *
   * @since MongoDB 4.4
   */
  comment: Optional<any>;
}

interface CreateIndexesOptions {
  // same as CreateIndexOptions
}

interface DropIndexOptions {
 /**
   * The maximum amount of time to allow the index drop to take before returning an error.
   *
   * @note This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   */
  maxTimeMS: Optional<Int64>;

  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/dropIndexes/
   *
   * @since MongoDB 4.4
   */
  comment: Optional<any>;
}

interface DropIndexesOptions {
  // same as DropIndexOptions
}

Examples

Create an index in a collection.

Ruby:

collection.create_index({ name: 1 }, { unique: true })

Java:

collection.createIndex(new Document("name", 1), new IndexOptions().unique(true));

Produces the shell equivalent (>= 2.6.0) of:

db.runCommand({
  createIndexes: "users",
  indexes: [
    { key: { name: 1 }, name: "name_1", unique: true }
  ]
});

Create multiple indexes in a collection.

Ruby:

collection.create_indexes([
  { key: { name: 1 }, unique: true },
  { key: { age: -1 }, name: "age" }
])

Java:

collection.createIndexes(asList(
  new IndexModel(new Document("name", 1), new IndexOptions().unique(true)),
  new IndexModel(new Document("age", -1), new IndexOptions().name("age"))
));

Produces the shell equivalent (>= 2.6.0) of:

db.runCommand({
  createIndexes: "users",
  indexes: [
    { key: { name: 1 }, name: "name_1", unique: true },
    { key: { age: -1 }, name: "age" }
  ]
});

Drop an index in a collection.

Ruby:

collection.drop_index("age")

Java:

collection.dropIndex("age");

Produces the shell equivalent of:

db.runCommand({ dropIndexes: "users", index: "age" });

Drop all indexes in a collection.

Ruby:

collection.drop_indexes

Java:

collection.dropIndexes();

Produces the shell equivalent of:

db.runCommand({ dropIndexes: "users", index: "*" });

List all indexes in a collection.

Ruby:

collection.list_indexes

Java:

collection.listIndexes();

Produces the shell equivalent (>= 3.0.0) of:

db.runCommand({ listIndexes: "users" });

Index View API

interface Collection {

  /**
   * Returns the index view for this collection.
   */
  indexes(options: Optional<ListIndexesOptions>): IndexView;
}

interface IndexView extends Iterable<Document> {

  /**
   * Enumerates the index information for all indexes in the collection. This should be
   * implemented as described in the :ref:`Enumerate Indexes` section, although the naming
   * requirement is dropped in favor of the driver language standard for handling iteration
   * over a sequence of objects.
   *
   * @see https://github.com/mongodb/specifications/blob/master/source/enumerate-indexes.rst
   *
   * @note For drivers that cannot make the IndexView iterable, they MUST implement a list
   *   method. See below.
   */
  iterator(): Iterator<Document>;

  /**
   * For drivers that cannot make IndexView iterable, they MUST implement this method to
   * return a list of indexes. In the case of async drivers, this MAY return a Future<Cursor>
   *  or language/implementation equivalent.
   * 
   *  If drivers are unable to make the IndexView iterable, they MAY opt to provide the options for 
   *  listing search indexes via the `list` method instead of the `Collection.indexes` method.

   */
  list(): Cursor;

  /**
   * This is a convenience method for creating a single index. This MUST call the
   * createMany method and pass the provided specification document in a
   * sequence to that method with the same options.
   *
   * @return The name of the created index.
   *
   * @note Drivers MAY opt to implement this method signature, the signature that
   *   takes an IndexModel as a parameter, or for those languages with method
   *   overloading MAY decide to implement both.
   *
   * @note Drivers MAY combine the two options types into a single one. If the options are
   *   explicitly typed, the combined options type MUST be named CreateOneIndexOptions or an acceptable
   *   variation.
   */
  createOne(keys: Document, indexOptions: IndexOptions, options: Optional<CreateOneIndexOptions>): String;

  /**
   * @see Comments above.
   */
  createOne(model: IndexModel, options: Optional<CreateOneIndexOptions>): String

  /**
   * Creates multiple indexes in the collection.
   *
   * For all server versions this method MUST execute a createIndexes command.
   *
   * @return The names of the created indexes.
   *
   * @note Each specification document becomes the "key" field in the document that
   *   is inserted or the command.
   *   
   */
  createMany(models: Iterable<IndexModel>, options: Optional<CreateManyIndexesOptions>): Iterable<String>;

  /**
   * Drops a single index from the collection by the index name.
   *
   * In all server versions this MUST execute a dropIndexes command.
   *
   * @note If the string passed is '*', the driver MUST raise an error since
   *   more than one index would be dropped.
   */
  dropOne(name: String, options: Optional<DropOneIndexOptions>): Result;

  /**
   * Attempts to drop a single index from the collection given the keys and options.
   * This is OPTIONAL until partial indexes are implemented.
   *
   * In all server versions this MUST execute a dropIndexes command.
   *
   * @note Drivers MAY opt to implement this method signature, the signature that
   *   takes an IndexModel as a parameter, or for those languages with method
   *   overloading MAY decide to implement both.
   *
   * @note Drivers MAY combine the two options types into a single one. If the options are
   *   explicitly typed, the combined options type MUST be named DropOneIndexOptions or an acceptable
   *   variation.
   */
  dropOne(keys: Document, indexOptions: IndexOptions, options: Optional<DropOneIndexOptions>): Result;

  /**
   * @see Comments above.
   */
  dropOne(model: IndexModel, options: Optional<DropOneIndexOptions>): Result;

  /**
   * Drops all indexes in the collection.
   */
  dropAll(options: Optional<DropAllIndexesOptions>): Result;
}

interface CreateOneIndexOptions {
  // same as CreateIndexOptions in the Standard API
}

interface CreateManyIndexesOptions {
  // same as CreateIndexesOptions in the Standard API
}

interface DropOneIndexOptions {
  // same as DropIndexOptions in the Standard API
}

interface DropAllIndexesOptions {
  // same as DropIndexesOptions in the Standard API
}

Examples

Create an index in a collection.

Ruby:

collection.indexes.create_one({ name: 1 }, { unique: true })

Java:

collection.indexes().createOne(new Document("name", 1), new IndexOptions().unique(true));

Produces the shell equivalent (>= 2.6.0) of:

db.runCommand({
  createIndexes: "users",
  indexes: [
    { key: { name: 1 }, name: "name_1", unique: true }
  ]
});

Create multiple indexes in a collection.

Ruby:

collection.indexes.create_many([
  { key: { name: 1 }, unique: true },
  { key: { age: -1 }, name: "age" }
])

Java:

collection.indexes().createMany(asList(
  new IndexModel(new Document("name", 1), new IndexOptions().unique(true),
  new IndexModel(new Document("age", -1), new IndexOptions().name("age")
));

Produces the shell equivalent (>= 2.6.0) of:

db.runCommand({
  createIndexes: "users",
  indexes: [
    { key: { name: 1 }, name: "name_1", unique: true },
    { key: { age: -1 }, name: "age" }
  ]
});

Drop an index in a collection.

Ruby:

collection.indexes.drop_one("age")

Java:

collection.indexes().dropOne("age");

Produces the shell equivalent of:

db.runCommand({ dropIndexes: "users", index: "age" });

Drop all indexes in a collection.

Ruby:

collection.indexes.drop_all

Java:

collection.indexes().dropAll();

Produces the shell equivalent of:

db.runCommand({ dropIndexes: "users", index: "*" });

List all indexes in a collection.

Ruby:

collection.indexes.each do |document|
  p document
end

Java:

for (BsonDocument document: collection.indexes()) {
  /* ... */
}

Produces the shell equivalent (>= 3.0.0) of:

var indexes = db.runCommand({ listIndexes: "users" });
for (index in indexes) {
  console.log(index);
}

Common API Components

interface IndexModel {

  /**
   * Contains the required keys for the index.
   */
  keys: Document;

  /**
   * Contains the options for the index.
   */
  options: IndexOptions;
}

interface IndexOptions {

  /**
   * Optionally tells the server to build the index in the background and not block
   * other tasks.
   *
   * @note Starting in MongoDB 4.2, this option is ignored by the server.
   * @see https://www.mongodb.com/docs/manual/reference/command/createIndexes/
   * @deprecated 4.2
   */
  background: Boolean;

  /**
   * Optionally specifies the length in time, in seconds, for documents to remain in
   * a collection.
   */
  expireAfterSeconds: Int32;

  /**
   * Optionally specify a specific name for the index outside of the default generated
   * name. If none is provided then the name is generated in the format "[field]_[direction]".
   *
   * Note that if an index is created for the same key pattern with different collations,
   * a name must be provided by the user to avoid ambiguity.
   *
   * @example For an index of name: 1, age: -1, the generated name would be "name_1_age_-1".
   */
  name: String;
  
  /**
   * Optionally tells the index to only reference documents with the specified field in
   * the index.
   */
  sparse: Boolean;

  /**
   * Optionally used only in MongoDB 3.0.0 and higher. Allows users to configure the storage
   * engine on a per-index basis when creating an index.
   */
  storageEngine: Document;

  /**
   * Optionally forces the index to be unique.
   */
  unique: Boolean;

  /**
   * Optionally specifies the index version number, either 0 or 1.
   */
  version: Int32;

  /**
   * Optionally specifies the default language for text indexes.
   * Is 'english' if none is provided.
   */
  defaultLanguage: String;

  /**
   * Optionally Specifies the field in the document to override the language.
   */
  languageOverride: String;

  /**
   * Optionally provides the text index version number.
   *
   * MongoDB 2.4 can only support version 1.
   *
   * MongoDB 2.6 and higher may support version 1 or 2.
   */
  textIndexVersion: Int32;

  /**
   * Optionally specifies fields in the index and their corresponding weight values.
   */
  weights: Document;

  /**
   * Optionally specifies the 2dsphere index version number.
   *
   * MongoDB 2.4 can only support version 1.
   *
   * MongoDB 2.6 and higher may support version 1 or 2.
   */
  2dsphereIndexVersion: Int32;

  /**
   * Optionally specifies the precision of the stored geo hash in the 2d index, from 1 to 32.
   */
  bits: Int32;

  /**
   * Optionally sets the maximum boundary for latitude and longitude in the 2d index.
   */
  max: Double;

  /**
   * Optionally sets the minimum boundary for latitude and longitude in the index in a
   * 2d index.
   */
  min: Double;

  /**
   * Optionally specifies the number of units within which to group the location values
   * in a geo haystack index.
   */
  bucketSize: Int32;

  /**
   * Optionally specifies a filter for use in a partial index. Only documents that match the
   * filter expression are included in the index. New in MongoDB 3.2.
   */
  partialFilterExpression: Document;

  /**
   * Optionally specifies a collation to use for the index in MongoDB 3.4 and higher.
   * If not specified, no collation is sent and the default collation of the collection
   * server-side is used.
   */
  collation: Document;

  /**
   * Optionally specifies the wildcard projection of a wildcard index.
   */
  wildcardProjection: Document;

  /**
   * Optionally specifies that the index should exist on the target collection but should not be used by the query
   * planner when executing operations.
   *
   * This option is only supported by servers >= 4.4.
   */
  hidden: Boolean;

  /**
   * Optionally specifies that this index is clustered.  This is not a valid option to provide to
   * 'createIndexes', but can appear in the options returned for an index via 'listIndexes'.  To
   * create a clustered index, create a new collection using the 'clusteredIndex' option.
   *
   * This options is only supported by servers >= 6.0.
   */
   clustered: Boolean;
}

interface ListIndexesOptions {
  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   * If a comment is provided, drivers MUST attach this comment to all
   * subsequent getMore commands run on the same cursor.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/listIndexes/
   *
   * @since MongoDB 4.4
   */
  comment: Optional<any>;

  /**
   * Configures the batch size of the cursor returned from the ``listIndexes`` command.
   * 
   * @note drivers MAY chose to support batchSize on the ListIndexesOptions.
   */
  batchSize: Optional<Int32>;
}

Enumerating Indexes

For all server versions, drivers MUST run a listIndexes command when enumerating indexes.

Drivers SHOULD use the method name listIndexes for a method that returns all indexes with a cursor return type. Drivers MAY use an idiomatic variant that fits the language the driver is for. An exception is made for drivers implementing the index view API.

In MongoDB 4.4, the ns field was removed from the index specifications returned from the listIndexes command.

  • For drivers that report those index specifications in the form of documents or dictionaries, no special handling is necessary, but any documentation of the contents of the documents/dictionaries MUST indicate that the ns field will no longer be present in MongoDB 4.4+. If the contents of the documents/dictionaries are undocumented, then no special mention of the ns field is necessary.
  • For drivers that report those index specifications in the form of statically defined models, the driver MUST manually populate the ns field of the models with the appropriate namespace if the server does not report it in the listIndexes command response. The ns field is not required to be a part of the models, however.

Getting Index Names

Drivers MAY implement a method to enumerate all indexes, and return only the index names. The helper operates the same as the following example:

Example:

> a = [];
[ ]
> db.runCommand( { listIndexes: 'poiConcat' } ).indexes.forEach(function(i) { a.push(i.name); } );
> a
[ "_id_", "ty_1", "l_2dsphere", "ts_1" ]

Search Indexes

Server 7.0 introduced three new server commands and a new aggregation stage to facilitate management of search indexes. Drivers MUST provide an API similar to the existing index management API specifically for search indexes. Drivers MAY choose to implement either the standard API or the index view API.

Search Index Management Helper Options

There are currently no supported options for any of the search index management commands. To future proof drivers implementations so that any options added in the future do not constitute a breaking change to drivers, empty options structs have been added as placeholders. If a driver's language has a mechanism to add options in a non-breaking manner (i.e., method overloading) drivers MAY omit the empty options structs from their search index management helpers.

listSearchIndexes is implemented using an aggregation pipeline. The list helper MUST support a driver's aggregation options as outline in the CRUD specification. Drivers MAY combine the aggregation options with any future listSearchIndexes stage options, if that is idiomatic for a driver's language.

Asynchronicity

The search index commands are asynchronous and return from the server before the index is successfully updated, created or dropped. In order to determine when an index has been created / updated, users are expected to run the listSearchIndexes repeatedly until index changes appear.

An example, from Javascript:

const name = await collection.createSearchIndex({ definition: { ... fill out definition } })
while (!(await collection.listSearchIndexes({ name }).hasNext())) {
  await setTimeout(1000);
}

Where are read concern and write concern?

These commands internally proxy the search index management commands to a separate process that runs alongside an Atlas cluster. As such, read concern and write concern are not relevant for the search index management commands.

Drivers MUST NOT apply a read concern or write concern to the commands. Atlas search index management commands return an error if a readConcern or writeConcern field is present in the command.

Consistency with Existing APIs

Drivers SHOULD strive for a search index management API that is as consistent as possible with their existing index management API.

NamespaceNotFound Errors

Some drivers suppress NamespaceNotFound errors for CRUD helpers. Drivers MAY suppress NamespaceNotFound errors from the search index management helpers.

Drivers MUST suppress NamespaceNotFound errors for the dropSearchIndex helper. Drop operations should be idempotent:

await collection.dropSearchIndex('my-test-index');
// subsequent calls should behave the same for the user as the first call
await collection.dropSearchIndex('my-test-index');
await collection.dropSearchIndex('my-test-index');

Common Interfaces

interface SearchIndexModel {
  // The definition for this index.
  definition: Document;

  // The name for this index, if present.
  name: Optional<string>;
     
  // The type for this index, if present. Can be either "search" or "vectorSearch".  
  type: Optional<string>;
}

interface SearchIndexOptions {
  // The name for this index, if present.
  name: Optional<string>;
}

/**
 * The following interfaces are empty but are provided as placeholders for drivers that cannot 
 * add options in a non-breaking manner, if options are added in the future.
 */
interface CreateSearchIndexOptions {} 
interface UpdateSearchIndexOptions {}
interface ListSearchIndexOptions {}
interface DropSearchIndexOptions {}

Standard API for Search Indexes

interface Collection {
  /**
   * Convenience method for creating a single search index.
   * 
   * @return The name of the created search index
   * 
   * @note Drivers MAY opt to implement this method signature, the signature that
   *   takes an SearchIndexModel as a parameter, or for those languages with method
   *   overloading MAY decide to implement both.
   *   
   * @note Drivers MAY combine the `indexOptions` with the `createSearchIndexOptions`, if that is idiomatic for their language.
   */
  createSearchIndex(definition: Document, indexOptions: Optional<SearchIndexOptions>, createSearchIndexOptions: Optional<CreateSearchIndexOptions>): String;

  /**
   * Convenience method for creating a single index.
   * 
   * @return The name of the created search index
   * 
   * @note Drivers MAY opt to implement this method signature, the signature that
   *   takes an name and a definition as parameters, or for those languages with method
   *   overloading MAY decide to implement both.
   */
  createSearchIndex(model: SearchIndexModel, options: Optional<CreateSearchIndexOptions>): String;

  /**
   * Creates multiple search indexes on the collection.
   * 
   * @return An iterable of the newly created index names.
   */
  createSearchIndexes(models: Iterable<SearchIndexModel>, options: CreateSearchIndexOptions): Iterable<String>;

  /**
   * Updates the search index with the given name to use the provided 
   * definition.
   */
  updateSearchIndex(name: String, definition: Document, options: Optional<UpdateSearchIndexOptions>): void;

  /**
   * Drops the search index with the given name.
   */
  dropSearchIndex(name: String, options: Optional<DropSearchIndexOptions>): void;

  /**
   * Gets index information for one or more search indexes in the collection.
   *
   * If name is not specified, information for all indexes on the specified collection will be returned.
   */
  listSearchIndexes(name: Optional<String>, aggregationOptions: Optional<AggregationOptions>, listIndexOptions: Optional<ListSearchIndexOptions>): Cursor<Document>;
}

Index View API for Search Indexes

interface Collection {
  /**
   * Returns the search index view for this collection.
   */
  searchIndexes(name: Optional<String>, aggregateOptions: Optional<AggregationOptions>, options: Optional<ListSearchIndexOptions>): SearchIndexView;
}

interface SearchIndexView extends Iterable<Document> {
  /**
   * Enumerates the index information for all search indexes in the collection. 
   *
   * @note For drivers that cannot make the IndexView iterable, they MUST implement a list
   *   method. See below.
   */
  iterator(): Iterator<Document>;

  /**
   * For drivers that cannot make SearchIndexView iterable, they MUST implement this method to
   * return a list of indexes. In the case of async drivers, this MAY return a Future<Cursor>
   *  or language/implementation equivalent.
   *  
   *  If drivers are unable to make the SearchIndexView iterable, they MAY opt to provide the options for 
   *  listing search indexes via the `list` method instead of the `Collection.searchIndexes` method.
   */
  list(): Cursor<Document>;


  /**
   * This is a convenience method for creating a single index.
   *
   * @return The name of the created index.
   *
   * @note Drivers MAY opt to implement this method signature, the signature that
   *   takes an SearchIndexModel as a parameter, or for those languages with method
   *   overloading MAY decide to implement both.
   *   
   * @note Drivers MAY combine the `indexOptions` with the `createSearchIndexOptions`, if that is idiomatic for their language.
   */
  createOne(definition: Document, indexOptions: Optional<SearchIndexOptions>, createSearchIndexOptions: Optional<CreateSearchIndexOptions>): String;

  /**
   * This is a convenience method for creating a single index.
   *
   * @return The name of the created index.
   *
   * @note Drivers MAY opt to implement this method signature, the signature that
   *   takes an name and a definition as parameters, or for those languages with method
   *   overloading MAY decide to implement both.
   */
  createOne(model: SearchIndexModel, options: Optional<CreateSearchIndexOptions>): String;

  /**
   * Creates multiple search indexes in the collection.
   *
   * @return The names of the created indexes.
   */
  createMany(models: Iterable<SearchIndexModel>, options: Optional<CreateSearchIndexOptions>): Iterable<String>;

  /**
   * Drops a single search index from the collection by the index name.
   */
  dropOne(name: String, options: Optional<DropSearchIndexOptions>): Result;

  /**
   * Updates a single search index from the collection by the index name.
   */
  updateOne(name: String, definition: Document, options: Optional<UpdateSearchIndexOptions>): Result;
}

Q & A

Q: Where is write concern?
The createIndexes and dropIndexes commands take a write concern that indicates how the write is acknowledged. Since all operations defined in this specification are performed on a collection, it's uncommon that two different index operations on the same collection would use a different write concern. As such, the most natural place to indicate write concern is on the client, the database, or the collection itself and not the operations within it.

However, it might be that a driver needs to expose write concern to a user per operation for various reasons. It is permitted to allow a write concern option, but since writeConcern is a top-level command option, it MUST NOT be specified as part of an IndexModel passed into the helper. It SHOULD be specified via the options parameter of the helper. For example, it would be ambiguous to specify write concern for one or more models passed to createIndexes(), but it would not be to specify it via the CreateIndexesOptions.

Q: What does the commitQuorum option do?
Prior to MongoDB 4.4, secondaries would simply replicate index builds once they were completed on the primary. Building indexes requires an exclusive lock on the collection being indexed, so the secondaries would be blocked from replicating all other operations while the index build took place. This would introduce replication lag correlated to however long the index build took.

Starting in MongoDB 4.4, secondaries build indexes simultaneously with the primary, and after starting an index build, the primary will wait for a certain number of data-bearing nodes, including itself, to have completed the build before it commits the index. commitQuorum configures this node requirement. Once the index is committed, all the secondaries replicate the commit too. If a secondary had already completed the index build, the commit will be quick, and no new replication lag would be introduced. If a secondary had not finished building the index before the primary committed it (e.g. if commitQuorum: 0 was used), then that secondary may lag behind the primary while it finishes building and committing the index.

The server-default value for commitQuorum is "votingMembers", which means the primary will wait for all voting data-bearing nodes to complete building the index before it commits it.

Q: Why would a user want to specify a non-default commitQuorum?
Like w: "majority", commitQuorum: "votingMembers" doesn't consider non-voting data-bearing nodes such as analytics nodes. If a user wanted to ensure these nodes didn't lag behind, then they would specify commitQuorum: <total number of data-bearing nodes, including non-voting nodes>. Alternatively, if they wanted to ensure only specific non-voting nodes didn't lag behind, they could specify a custom getLastErrorMode based on the nodes' tag sets (e.g. commitQuorum: <custom getLastErrorMode name>).

Additionally, if a user has a high tolerance for replication lag, they can set a lower value for commitQuorum. This is useful for situations where certain secondaries take longer to build indexes than the primaries, and the user doesn't care if they lag behind.

Q: What is the difference between write concern and commitQuorum?
While these two options share a lot in terms of how they are specified, they configure entirely different things. commitQuorum determines how much new replication lag an index build can tolerably introduce, but it says nothing of durability. Write concern specifies the durability requirements of an index build, but it makes no guarantees about introducing replication lag.

For instance, an index built with writeConcern: { w: 1 }, commitQuorum: "votingMembers" could possibly be rolled back, but it will not introduce any new replication lag. Likewise, an index built with writeConcern: { w: "majority", j: true }, commitQuorum: 0 will not be rolled back, but it may cause the secondaries to lag. To ensure the index is both durable and will not introduce replication lag on any data-bearing voting secondary, writeConcern: { w: "majority", j: true }, commitQuorum: "votingMembers" must be used.

Also note that, since indexes are built simultaneously, higher values of commitQuorum are not as expensive as higher values of writeConcern.

Q: Why does the driver manually throw errors if the commitQuorum option is specified against a pre 4.4 server?
Starting in 3.4, the server validates all options passed to the createIndexes command, but due to a bug in versions 4.2.0-4.2.5 of the server (SERVER-47193), specifying commitQuorum does not result in an error. The option is used internally by the server on those versions, and its value could have adverse effects on index builds. To prevent users from mistakenly specifying this option, drivers manually verify it is only sent to 4.4+ servers.

Changelog

  • 2024-03-05: Migrated from reStructuredText to Markdown.

  • 2023-11-08: Clarify that readConcern and writeConcern must not be
    applied to search index management commands.

  • 2023-07-27: Add search index management clarifications.

  • 2023-05-18: Add the search index management API.

  • 2023-05-10: Merge index enumeration and index management specs and get rid of references
    to legacy server versions.

  • 2022-10-05: Remove spec front matter and reformat changelog.

  • 2022-04-18: Added the clustered attribute to IndexOptions in order to
    support clustered collections.

  • 2022-02-10: Specified that getMore command must explicitly send inherited
    comment.

  • 2022-02-01: Added comment field to helper methods.

  • 2022-01-19: Require that timeouts be applied per the client-side operations
    timeout spec.

  • 2020-03-30: Added options types to various helpers. Introduced commitQuorum
    option. Added deprecation message for background option.

  • 2019-04-24: Added wildcardProjection attribute to IndexOptions in order
    to support setting a wildcard projection on a wildcard index.

  • 2017-06-07: Include listIndexes() in Q&A about maxTimeMS.

  • 2017-05-31: Add Q & A addressing write concern and maxTimeMS option.

  • 2016-10-11: Added note on 3.4 servers validation options passed to
    createIndexes. Add note on server generated name for the _id index.

  • 2016-08-08: Fixed collation language to not mention a collection default.

  • 2016-05-19: Added collation attribute to IndexOptions in order to
    support setting a collation on an index.

  • 2015-09-17: Added partialFilterExpression attribute to IndexOptions in
    order to support partial indexes. Fixed "provides" typo.

Driver CRUD API

  • Status: Accepted
  • Minimum Server Version: 2.6

Abstract

The CRUD API defines a set of related methods and structures defining a driver's API. As not all languages/frameworks have the same abilities, parts of the spec may or may not apply. These sections have been called out.

Specification

This specification is about Guidance in 3 areas related to the CRUD API as well as a specification for the API itself. It does not define implementation details and provides room and flexibility for the idioms and differences in languages and frameworks.

Definitions

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Terms

Collection:
The term interface Collection will be seen in most of the sections. Each driver will likely have a class or interface defined for the concept of a collection. Operations appearing inside the interface Collection are required operations to be present on a driver's concept of a collection.

Iterable:
The term Iterable will be seen as a return type from some of the Read methods. Its use is as that of a sequence of items. For instance, collection.find({}) returns a sequence of documents that can be iterated over.

Guidance

Documentation

The documentation provided in code below is merely for driver authors and SHOULD NOT be taken as required documentation for the driver.

Operations

All drivers MUST offer the operations defined in the following sections. This does not preclude a driver from offering more.

Operation Parameters

All drivers MUST offer the same options for each operation as defined in the following sections. This does not preclude a driver from offering more. A driver SHOULD NOT require a user to specify optional parameters, denoted by the Optional<> signature. Unless otherwise specified, optional values should not be sent to the server.

Deviations

A non-exhaustive list of acceptable deviations are as follows:

  • Using named parameters instead of an options hash. For instance, collection.find({x:1}, sort: {a: -1}).

  • When using an Options class, if multiple Options classes are structurally equatable, it is permissible to consolidate them into one with a clear name. For instance, it would be permissible to use the name UpdateOptions as the options for UpdateOne and UpdateMany.

  • Using a fluent style builder for find or aggregate:

    collection.find({x: 1}).sort({a: -1}).skip(10);
    

    When using a fluent-style builder, all options should be named rather than inventing a new word to include in the pipeline (like options). Required parameters are still required to be on the initiating method.

    In addition, it is imperative that documentation indicate when the order of operations is important. For instance, skip and limit in find is order irrelevant where skip and limit in aggregate is not.

Naming

All drivers MUST name operations, objects, and parameters as defined in the following sections.

Deviations are permitted as outlined below.

Deviations

When deviating from a defined name, an author should consider if the altered name is recognizable and discoverable to the user of another driver.

A non-exhaustive list of acceptable naming deviations are as follows:

  • Using "batchSize" as an example, Java would use "batchSize" while Python would use "batch_size". However, calling it "batchCount" would not be acceptable.
  • Using "maxTimeMS" as an example, .NET would use "MaxTime" where it's type is a TimeSpan structure that includes units. However, calling it "MaximumTime" would not be acceptable.
  • Using "FindOptions" as an example, Javascript wouldn't need to name it while other drivers might prefer to call it "FindArgs" or "FindParams". However, calling it "QueryOptions" would not be acceptable.
  • Using "isOrdered" rather than "ordered". Some languages idioms prefer the use of "is", "has", or "was" and this is acceptable.

Timeouts

Drivers MUST enforce timeouts for all operations per the Client Side Operations Timeout specification. All operations that return cursors MUST support the timeout options documented in the Cursors section of that specification.

API

Read

[!NOTE] The term Iterable<T> is used below to indicate many of T. This spec is flexible on what that means as different drivers will have different requirements, types, and idioms.

interface Collection {

  /**
   * Runs an aggregation framework pipeline.
   *
   * Note: $out and $merge are special pipeline stages that cause no results
   * to be returned from the server. As such, the iterable here would never
   * contain documents. Drivers MAY setup a cursor to be executed upon
   * iteration against the output collection such that if a user were to
   * iterate the return value, results would be returned.
   *
   * Note: result iteration should be backed by a cursor. Depending on the implementation,
   * the cursor may back the returned Iterable instance or an iterator that it produces.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/aggregate/
   */
  aggregate(pipeline: Document[], options: Optional<AggregateOptions>): Iterable<Document>;

  /**
   * Gets the number of documents matching the filter.
   *
   * **This method is DEPRECATED and should not be implemented in new drivers.**
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/count/
     @deprecated 4.0
   */
  count(filter: Document, options: Optional<CountOptions>): Int64;

  /**
   * Count the number of documents in a collection that match the given
   * filter. Note that an empty filter will force a scan of the entire
   * collection. For a fast count of the total documents in a collection
   * see estimatedDocumentCount.
   *
   * See "Count API Details" section below.
   */
  countDocuments(filter: Document, options: Optional<CountOptions>): Int64;

  /**
   * Gets an estimate of the count of documents in a collection using collection metadata.
   *
   * See "Count API Details" section below for implementation and documentation
   * requirements.
   */
  estimatedDocumentCount(options: Optional<EstimatedDocumentCountOptions>): Int64;

  /**
   * Finds the distinct values for a specified field across a single collection.
   *
   * Note: the results are backed by the "values" array in the distinct command's result
   * document. This differs from aggregate and find, where results are backed by a cursor.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/distinct/
   */
  distinct(fieldName: string, filter: Document, options: Optional<DistinctOptions>): Iterable<any>;

  /**
   * Finds the documents matching the model.
   *
   * Note: The filter parameter below equates to the $query meta operator. It cannot
   * contain other meta operators like $maxScan. However, do not validate this document
   * as it would be impossible to be forwards and backwards compatible. Let the server
   * handle the validation.
   *
   * Note: If $explain is specified in the modifiers, the return value is a single
   * document. This could cause problems for static languages using strongly typed entities.
   *
   * Note: result iteration should be backed by a cursor. Depending on the implementation,
   * the cursor may back the returned Iterable instance or an iterator that it produces.
   *
   * @see https://www.mongodb.com/docs/manual/core/read-operations-introduction/
   */
  find(filter: Document, options: Optional<FindOptions>): Iterable<Document>;

}

interface Database {

  /**
   * Runs an aggregation framework pipeline on the database for pipeline stages
   * that do not require an underlying collection, such as $currentOp and $listLocalSessions.
   *
   * Note: result iteration should be backed by a cursor. Depending on the implementation,
   * the cursor may back the returned Iterable instance or an iterator that it produces.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/aggregate/#dbcmd.aggregate
   */
  aggregate(pipeline: Document[], options: Optional<AggregateOptions>): Iterable<Document>;

}

class AggregateOptions {

  /**
   * Enables writing to temporary files. When set to true, aggregation stages
   * can write data to the _tmp subdirectory in the dbPath directory.
   *
   * This option is sent only if the caller explicitly provides a value. The default
   * is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/aggregate/
   */
  allowDiskUse: Optional<Boolean>;

  /**
   * The number of documents to return per batch.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * If specified, drivers SHOULD apply this option to both the original aggregate command and subsequent
   * getMore operations on the cursor.
   *
   * Drivers MUST NOT specify a batchSize of zero in an aggregate command that includes an $out or $merge stage,
   * as that will prevent the pipeline from executing. Drivers SHOULD leave the cursor.batchSize command option
   * unset in an aggregate command that includes an $out or $merge stage.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/aggregate/
   */
  batchSize: Optional<Int32>;

  /**
   * If true, allows the write to opt-out of document level validation. This only applies
   * when the $out or $merge stage is specified.
   *
   * This option is sent only if the caller explicitly provides a true value. The default is to not send a value.
   * For servers < 3.2, this option is ignored and not sent as document validation is not available.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/aggregate/
   */
  bypassDocumentValidation: Optional<Boolean>;

  /**
   * Specifies a collation.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/aggregate/
   */
  collation: Optional<Document>;

  /**
   * The maximum amount of time to allow the query to run.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * NOTE: This option is deprecated in favor of timeoutMS.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/aggregate/
   */
  maxTimeMS: Optional<Int64>;

  /**
   * The maximum amount of time for the server to wait on new documents to satisfy a tailable cursor
   * query.
   *
   * This options only applies to aggregations which return a TAILABLE_AWAIT cursor. Drivers
   * SHOULD always send this value, if the cursor is not a TAILABLE_AWAIT cursor the server will
   * ignore it.
   *
   * @note this option is an alias for maxTimeMS, used on getMore commands
   * @note this option is not set on the aggregate command
   */
  maxAwaitTimeMS: Optional<Int64>;

  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   * The comment can be any valid BSON type for server versions 4.4 and above.
   * Server versions between 3.6 and 4.2 only support string as comment,
   * and providing a non-string type will result in a server-side error.
   * Older server versions do not support comment for aggregate command at all,
   * and providing one will result in a server-side error.
   *
   * If a comment is provided, drivers MUST attach this comment to all
   * subsequent getMore commands run on the same cursor for server
   * versions 4.4 and above. For server versions below 4.4 drivers MUST NOT
   * attach a comment to getMore commands.
   */
  comment: Optional<any>;

  /**
   * The index to use for the aggregation. The hint does not apply to $lookup and $graphLookup stages.
   * Specify either the index name as a string or the index key pattern. If specified,
   * then the query system will only consider plans using the hinted index.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/aggregate/
   */
  hint: Optional<(String | Document)>;

  /**
   * Map of parameter names and values. Values must be constant or closed
   * expressions that do not reference document fields. Parameters can then be
   * accessed as variables in an aggregate expression context (e.g. "$$var").
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 5.0. Older servers >= 2.6 (and possibly earlier) will report an error for using this option.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/aggregate/
   */
  let: Optional<Document>;
}

class CountOptions {

  /**
   * Specifies a collation.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   */
  collation: Optional<Document>;

  /**
   * The index to use. Specify either the index name as a string or the index key pattern.
   * If specified, then the query system will only consider plans using the hinted index.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   */
  hint: Optional<(String | Document)>;

  /**
   * The maximum number of documents to count.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   */
  limit: Optional<Int64>;

  /**
   * The maximum amount of time to allow the operation to run.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * NOTE: This option is deprecated in favor of timeoutMS.
   */
  maxTimeMS: Optional<Int64>;

  /**
   * The number of documents to skip before counting.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   */
  skip: Optional<Int64>;

  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   * The comment can be any valid BSON type for server versions 4.4 and above.
   * Server versions prior to 4.4 do not support comment for count command,
   * and providing one will result in a server-side error.
   */
  comment: Optional<any>;
}

class EstimatedDocumentCountOptions {

  /**
   * The maximum amount of time to allow the operation to run.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * NOTE: This option is deprecated in favor of timeoutMS.
   */
  maxTimeMS: Optional<Int64>;

  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   * The comment can be any valid BSON type for server versions 4.4.14 and above.
   * For server versions between 4.4.0 and 4.4.14 string comment is supported.
   * Servers versions below 4.4.0 do not support comment for count command,
   * which is used to implement estimatedDocumentCount. Therefore, providing a
   * comment may result in a server-side error.
   */
  comment: Optional<any>;
}

class DistinctOptions {

  /**
   * Specifies a collation.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/distinct/
   */
  collation: Optional<Document>;

  /**
   * The maximum amount of time to allow the query to run.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * NOTE: This option is deprecated in favor of timeoutMS.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/distinct/
   */
  maxTimeMS: Optional<Int64>;

  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   * The comment can be any valid BSON type for server versions 4.4 and above.
   * Server versions prior to 4.4 do not support comment for distinct command,
   * and providing one will result in a server-side error.
   */
  comment: Optional<any>;
}

enum CursorType {
  /**
   * The default value. A vast majority of cursors will be of this type.
   */
  NON_TAILABLE,
  /**
   * Tailable means the cursor is not closed when the last data is retrieved.
   * Rather, the cursor marks the final object’s position. You can resume
   * using the cursor later, from where it was located, if more data were
   * received. Like any "latent cursor", the cursor may become invalid at
   * some point (CursorNotFound) – for example if the final object it
   * references were deleted.
   *
   * @see https://www.mongodb.com/docs/meta-driver/latest/legacy/mongodb-wire-protocol/#op-query
   */
  TAILABLE,
  /**
   * Combines the tailable option with awaitData, as defined below.
   *
   * Use with TailableCursor. If we are at the end of the data, block for a
   * while rather than returning no data. After a timeout period, we do return
   * as normal. The default is true.
   *
   * @see https://www.mongodb.com/docs/meta-driver/latest/legacy/mongodb-wire-protocol/#op-query
   */
  TAILABLE_AWAIT
}

class FindOptions {

  /**
   * Enables writing to temporary files on the server. When set to true, the server
   * can write temporary data to disk while executing the find operation.
   *
   * This option is sent only if the caller explicitly provides a value. The default
   * is to not send a value.
   *
   * This option is only supported by servers >= 4.4. Older servers >= 3.2 will report an error for using this option.
   * For servers < 3.2, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  allowDiskUse: Optional<Boolean>;

  /**
   * Get partial results from a mongos if some shards are down (instead of throwing an error).
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.2, the Partial wire protocol flag is used and defaults to false.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  allowPartialResults: Optional<Boolean>;

  /**
   * The number of documents to return per batch.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.2, this is combined with limit to create the wire protocol numberToReturn value.
   * If specified, drivers SHOULD apply this option to both the original query operation and subsequent
   * getMore operations on the cursor.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  batchSize: Optional<Int32>;

  /**
   * Specifies a collation.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  collation: Optional<Document>;

  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   * The comment can be any valid BSON type for server versions 4.4 and above.
   * Server versions prior to 4.4 only support string as comment,
   * and providing a non-string type will result in a server-side error.
   *
   * If a comment is provided, drivers MUST attach this comment to all
   * subsequent getMore commands run on the same cursor for server
   * versions 4.4 and above. For server versions below 4.4 drivers MUST NOT
   * attach a comment to getMore commands.
   */
  comment: Optional<any>;

  /**
   * Indicates the type of cursor to use. This value includes both
   * the tailable and awaitData options.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.2, the AwaitData and Tailable wire protocol flags are used and default to false.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  cursorType: Optional<CursorType>;

  /**
   * The index to use. Specify either the index name as a string or the index key pattern.
   * If specified, then the query system will only consider plans using the hinted index.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  hint: Optional<(String | Document)>;

  /**
   * The maximum number of documents to return.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.2, this is combined with batchSize to create the wire protocol numberToReturn value.
   *
   * A negative limit implies that the caller has requested a single batch of results. For servers >= 3.2, singleBatch
   * should be set to true and limit should be converted to a positive value. For servers < 3.2, the wire protocol
   * numberToReturn value may be negative.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  limit: Optional<Int64>;

  /**
   * The exclusive upper bound for a specific index.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  max: Optional<Document>;

  /**
   * The maximum amount of time for the server to wait on new documents to satisfy a tailable cursor
   * query. This only applies to a TAILABLE_AWAIT cursor. When the cursor is not a TAILABLE_AWAIT cursor,
   * this option is ignored.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.2, this option is ignored and not sent as maxTimeMS does not exist in the OP_GET_MORE wire protocol.
   *
   * Note: This option is specified as "maxTimeMS" in the getMore command and not provided as part of the
   * initial find command.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  maxAwaitTimeMS: Optional<Int64>;

  /**
   * Maximum number of documents or index keys to scan when executing the query.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   * @deprecated 4.0
   */
  maxScan: Optional<Int64>;

  /**
   * The maximum amount of time to allow the query to run.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * NOTE: This option is deprecated in favor of timeoutMS.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  maxTimeMS: Optional<Int64>;

  /**
   * The inclusive lower bound for a specific index.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  min: Optional<Document>;

  /**
   * The server normally times out idle cursors after an inactivity period (10 minutes)
   * to prevent excess memory use. Set this option to prevent that.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.2, the NoCursorTimeout wire protocol flag is used and defaults to false.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  noCursorTimeout: Optional<Boolean>;

  /**
   * Enables optimization when querying the oplog for a range of ts values
   *
   * Note: this option is intended for internal replication use only.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.2, the OplogReplay wire protocol flag is used and defaults to false.
   * For servers >= 4.4, the server will ignore this option if set (see: SERVER-36186).
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   * @deprecated 4.4
   */
  oplogReplay: Optional<Boolean>;

  /**
   * Limits the fields to return for all matching documents.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  projection: Optional<Document>;

  /**
   * If true, returns only the index keys in the resulting documents.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  returnKey: Optional<Boolean>;

  /**
   * Determines whether to return the record identifier for each document. If true, adds a field $recordId to the returned documents.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  showRecordId: Optional<Boolean>;

  /**
   * The number of documents to skip before returning.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.2, this is a wire protocol parameter that defaults to 0.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  skip: Optional<Int64>;

  /**
   * Prevents the cursor from returning a document more than once because of an intervening write operation.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   * @deprecated 4.0
   */
  snapshot: Optional<Boolean>;

  /**
   * The order in which to return matching documents.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  sort: Optional<Document>;

  /**
   * Map of parameter names and values. Values must be constant or closed
   * expressions that do not reference document fields. Parameters can then be
   * accessed as variables in an aggregate expression context (e.g. "$$var").
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 5.0. Older servers >= 2.6 (and possibly earlier) will report an error for using this option.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  let: Optional<Document>;
}
Count API Details

MongoDB drivers provide two helpers for counting the number of documents in a collection, estimatedDocumentCount and countDocuments. The names were chosen to make it clear how they behave and exactly what they do. The estimatedDocumentCount helper returns an estimate of the count of documents in the collection using collection metadata, rather than counting the documents or consulting an index. The countDocuments helper counts the documents that match the provided query filter using an aggregation pipeline.

The count() helper is deprecated. It has always been implemented using the count command. The behavior of the count command differs depending on the options passed to it and may or may not provide an accurate count. When no query filter is provided the count command provides an estimate using collection metadata. Even when provided with a query filter the count command can return inaccurate results with a sharded cluster if orphaned documents exist or if a chunk migration is in progress. The countDocuments helper avoids these sharded cluster problems entirely when used with MongoDB 3.6+, and when using Primary read preference with older sharded clusters.

estimatedDocumentCount

The estimatedDocumentCount function is implemented using the count command with no query filter, skip, limit, or other options that would alter the results. The only supported options are listed in the EstimatedDocumentCountOptions type defined above.

Drivers MUST document that, due to an oversight in versions 5.0.0-5.0.8 of MongoDB, the count command, which estimatedDocumentCount uses in its implementation, was not included in v1 of the Stable API, and so users of the Stable API with estimatedDocumentCount are recommended to upgrade their server version to 5.0.9+ or set apiStrict: false to avoid encountering errors.

Drivers MUST document that the count server command is used to implement estimatedDocumentCount and that users can find more information via Count: Behavior.

The 5.0-compat versions of many drivers updated their estimatedDocumentCount implementations to use the $collStats aggregation stage instead of the count command. This had the unintended consequence of breaking estimatedDocumentCount on views, and so the change was seen as a backwards-incompatible regression and reverted. The release notes for the driver versions that include the reversion from $collStats back to count MUST document the following:

  • The 5.0-compat release accidentally broke estimatedDocumentCount on views by changing its implementation to use aggregate and a $collStats stage instead of the count command.
  • The new release is fixing estimatedDocumentCount on views by reverting back to using count in its implementation.
  • Due to an oversight, the count command was omitted from the Stable API in server versions 5.0.0 - 5.0.8 and 5.1.0 - 5.3.1, so users of the Stable API with estimatedDocumentCount are recommended to upgrade their MongoDB clusters to 5.0.9 or 5.3.2 (if on Atlas) or set apiStrict: false when constructing their MongoClients.
countDocuments

The countDocuments function is implemented using the $group aggregate pipeline stage with $sum. Applications must be required to pass a value for filter, but an empty document is supported:

pipeline = [{'$match': filter}]
if (skip) {
  pipeline.push({'$skip': skip})
}
if (limit) {
  pipeline.push({'$limit': limit})
}
pipeline.push({'$group': {'_id': 1, 'n': {'$sum': 1}}})

The count of documents is returned in the n field, similar to the count command. countDocuments options other than filter, skip, and limit are added as options to the aggregate command.

In the event this aggregation is run against an empty collection, an empty array will be returned with no n field. Drivers MUST interpret this result as a 0 count.

Combining Limit and Batch Size for the Wire Protocol

The OP_QUERY wire protocol only contains a numberToReturn value which drivers must calculate to get expected limit and batch size behavior. Subsequent calls to OP_GET_MORE should use the user-specified batchSize or default to 0. If the result is larger than the max Int32 value, an error MUST be raised as the computed value is impossible to send to the server. Below is pseudo-code for calculating numberToReturn for OP_QUERY.

function calculateFirstNumberToReturn(options: FindOptions): Int32 {
  Int32 numberToReturn;
  Int32 limit = options.limit || 0;
  Int32 batchSize = options.batchSize || 0;

  if (limit < 0) {
    numberToReturn = limit;
  }
  else if (limit == 0) {
    numberToReturn = batchSize;
  }
  else if (batchSize == 0) {
    numberToReturn = limit;
  }
  else if (limit < batchSize) {
    numberToReturn = limit;
  }
  else {
    numberToReturn = batchSize;
  }

  return numberToReturn;
}

Because of this anomaly in the wire protocol, it is up to the driver to enforce the user-specified limit. Each driver MUST keep track of how many documents have been iterated and stop iterating once the limit has been reached. When the limit has been reached, if the cursor is still open, a driver MUST kill the cursor.

Database-level aggregation

The server supports several collection-less aggregation source stages like $currentOp and $listLocalSessions. The database aggregate command requires a collection name of 1 for collection-less source stages. Drivers support for database-level aggregation will allow users to receive a cursor from these collection-less aggregation source stages.

Write

Insert, Update, Replace, Delete, and Bulk Writes
interface Collection {

  /**
   * Executes multiple write operations.
   *
   * An error MUST be raised if the requests parameter is empty.
   *
   * For servers < 3.4, if a collation was explicitly set for any request, an error MUST be raised
   * and no documents sent.
   *
   * NOTE: see the FAQ about the previous bulk API and how it relates to this.
   * @see https://www.mongodb.com/docs/manual/reference/command/delete/
   * @see https://www.mongodb.com/docs/manual/reference/command/insert/
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   * @throws InvalidArgumentException if requests is empty
   * @throws BulkWriteException
   */
  bulkWrite(requests: WriteModel[], options: Optional<BulkWriteOptions>): Optional<BulkWriteResult>;

  /**
   * Inserts the provided document. If the document is missing an identifier,
   * the driver should generate one.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/insert/
   * @throws WriteException
   */
  insertOne(document: Document, options: Optional<InsertOneOptions>): Optional<InsertOneResult>;

  /**
   * Inserts the provided documents. If any documents are missing an identifier,
   * the driver should generate them.
   *
   * An error MUST be raised if the documents parameter is empty.
   *
   * Note that this uses the bulk insert command underneath and should not
   * use OP_INSERT.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/insert/
   * @throws InvalidArgumentException if documents is empty
   * @throws BulkWriteException
   */
  insertMany(documents: Iterable<Document>, options: Optional<InsertManyOptions>): Optional<InsertManyResult>;

  /**
   * Deletes one document.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/delete/
   * @throws WriteException
   */
  deleteOne(filter: Document, options: Optional<DeleteOptions>): Optional<DeleteResult>;

  /**
   * Deletes multiple documents.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/delete/
   * @throws WriteException
   */
  deleteMany(filter: Document, options: Optional<DeleteOptions>): Optional<DeleteResult>;

  /**
   * Replaces a single document.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   * @throws WriteException
   */
  replaceOne(filter: Document, replacement: Document, options: Optional<ReplaceOptions>): Optional<UpdateResult>;

  /**
   * Updates one document.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   * @throws WriteException
   */
  updateOne(filter: Document, update: (Document | Document[]), options: Optional<UpdateOptions>): Optional<UpdateResult>;

  /**
   * Updates multiple documents.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   * @throws WriteException
   */
  updateMany(filter: Document, update: (Document | Document[]), options: Optional<UpdateOptions>): Optional<UpdateResult>;
}

class BulkWriteOptions {

  /**
   * If true, when a write fails, return without performing the remaining
   * writes. If false, when a write fails, continue with the remaining writes, if any.
   * Defaults to true.
   */
  ordered: Boolean;

  /**
   * If true, allows the write to opt-out of document level validation.
   *
   * This option is sent only if the caller explicitly provides a true value. The default is to not send a value.
   * For servers < 3.2, this option is ignored and not sent as document validation is not available.
   * For unacknowledged writes using OP_INSERT, OP_UPDATE, or OP_DELETE, the driver MUST raise an error if the caller explicitly provides a value.
   */
  bypassDocumentValidation: Optional<Boolean>;

  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   * The comment can be any valid BSON type for server versions 4.4 and above.
   * Server versions prior to 4.4 do not support comment for write operations,
   * and providing one will result in a server-side error.
   */
  comment: Optional<any>;

  /**
   * Map of parameter names and values. Values must be constant or closed expressions that do not
   * reference document fields. Parameters can then be accessed as variables in an aggregate
   * expression context (e.g. "$$var").
   *
   * The let parameter must be a valid Document type for server versions 5.0 and above.
   * Server versions prior to 5.0 do not support the let parameter for bulk write operations,
   * and providing it will result in a server-side error.
   *
   * The value of let will be passed to all update and delete, but not insert, commands.
   */
  let: Optional<Document>;
}

class InsertOneOptions {

  /**
   * If true, allows the write to opt-out of document level validation.
   *
   * This option is sent only if the caller explicitly provides a true value. The default is to not send a value.
   * For servers < 3.2, this option is ignored and not sent as document validation is not available.
   * For unacknowledged writes using OP_INSERT, the driver MUST raise an error if the caller explicitly provides a value.
   */
  bypassDocumentValidation: Optional<Boolean>;

  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   * The comment can be any valid BSON type for server versions 4.4 and above.
   * Server versions prior to 4.4 do not support comment for insert command,
   * and providing one will result in a server-side error.
   */
  comment: Optional<any>;
}

class InsertManyOptions {

  /**
   * If true, allows the write to opt-out of document level validation.
   *
   * This option is sent only if the caller explicitly provides a true value. The default is to not send a value.
   * For servers < 3.2, this option is ignored and not sent as document validation is not available.
   * For unacknowledged writes using OP_INSERT, the driver MUST raise an error if the caller explicitly provides a value.
   */
  bypassDocumentValidation: Optional<Boolean>;

  /**
   * If true, when an insert fails, return without performing the remaining
   * writes. If false, when a write fails, continue with the remaining writes, if any.
   * Defaults to true.
   */
  ordered: Boolean;

  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   * The comment can be any valid BSON type for server versions 4.4 and above.
   * Server versions prior to 4.4 do not support comment for insert command,
   * and providing one will result in a server-side error.
   */
  comment: Optional<any>;
}

class UpdateOptions {

  /**
   * A set of filters specifying to which array elements an update should apply.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.6, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_UPDATE, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  arrayFilters: Optional<Array<Document>>;

  /**
   * If true, allows the write to opt-out of document level validation.
   *
   * This option is sent only if the caller explicitly provides a true value. The default is to not send a value.
   * For servers < 3.2, this option is ignored and not sent as document validation is not available.
   * For unacknowledged writes using OP_UPDATE, the driver MUST raise an error if the caller explicitly provides a value.
   */
  bypassDocumentValidation: Optional<Boolean>;

  /**
   * Specifies a collation.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_UPDATE, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  collation: Optional<Document>;

  /**
   * The index to use. Specify either the index name as a string or the index key pattern.
   * If specified, then the query system will only consider plans using the hinted index.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 4.2. Older servers >= 3.4 will report an error for using this option.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_UPDATE, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_MSG and servers < 4.2, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  hint: Optional<(String | Document)>;

  /**
   * When true, creates a new document if no document matches the query.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  upsert: Optional<Boolean>;


  /**
   * Map of parameter names and values. Values must be constant or closed
   * expressions that do not reference document fields. Parameters can then be
   * accessed as variables in an aggregate expression context (e.g. "$$var").
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 5.0. Older servers >= 2.6 (and possibly earlier) will report an error for using this option.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  let: Optional<Document>;

  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   * The comment can be any valid BSON type for server versions 4.4 and above.
   * Server versions prior to 4.4 do not support comment for update command,
   * and providing one will result in a server-side error.
   */
  comment: Optional<any>;
}

class ReplaceOptions {

  /**
   * If true, allows the write to opt-out of document level validation.
   *
   * This option is sent only if the caller explicitly provides a true value. The default is to not send a value.
   * For servers < 3.2, this option is ignored and not sent as document validation is not available.
   * For unacknowledged writes using OP_UPDATE, the driver MUST raise an error if the caller explicitly provides a value.
   */
  bypassDocumentValidation: Optional<Boolean>;

  /**
   * Specifies a collation.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_UPDATE, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  collation: Optional<Document>;

  /**
   * The index to use. Specify either the index name as a string or the index key pattern.
   * If specified, then the query system will only consider plans using the hinted index.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 4.2. Older servers >= 3.4 will report an error for using this option.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_UPDATE, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_MSG and servers < 4.2, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  hint: Optional<(String | Document)>;

  /**
   * When true, creates a new document if no document matches the query.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  upsert: Optional<Boolean>;

  /**
   * Map of parameter names and values. Values must be constant or closed
   * expressions that do not reference document fields. Parameters can then be
   * accessed as variables in an aggregate expression context (e.g. "$$var").
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 5.0. Older servers >= 2.6 (and possibly earlier) will report an error for using this option.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  let: Optional<Document>;

  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   * The comment can be any valid BSON type for server versions 4.4 and above.
   * Server versions prior to 4.4 do not support comment for update command,
   * and providing one will result in a server-side error.
   */
  comment: Optional<any>;
}

class DeleteOptions {

  /**
   * Specifies a collation.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_DELETE, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/delete/
   */
  collation: Optional<Document>;

  /**
   * The index to use. Specify either the index name as a string or the index key pattern.
   * If specified, then the query system will only consider plans using the hinted index.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 4.4. Older servers >= 3.4 will report an error for using this option.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_DELETE, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_MSG and servers < 4.4, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/delete/
   */
  hint: Optional<(String | Document)>;

  /**
   * Map of parameter names and values. Values must be constant or closed
   * expressions that do not reference document fields. Parameters can then be
   * accessed as variables in an aggregate expression context (e.g. "$$var").
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 5.0. Older servers >= 2.6 (and possibly earlier) will report an error for using this option.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/delete/
   */
  let: Optional<Document>;

  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   * The comment can be any valid BSON type for server versions 4.4 and above.
   * Server versions prior to 4.4 do not support comment for delete command,
   * and providing one will result in a server-side error.
   */
  comment: Optional<any>;
}
Bulk Write Models
interface WriteModel {
  // marker interface for writes that can be batched together.
}

class InsertOneModel implements WriteModel {

  /**
   * The document to insert.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/insert/
   */
  document: Document;
}

class DeleteOneModel implements WriteModel {

  /**
   * The filter to limit the deleted documents.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/delete/
   */
  filter: Document;

  /**
   * Specifies a collation.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_DELETE, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/delete/
   */
  collation: Optional<Document>;

  /**
   * The index to use. Specify either the index name as a string or the index key pattern.
   * If specified, then the query system will only consider plans using the hinted index.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 4.4. Older servers >= 3.4 will report an error for using this option.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_DELETE, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_MSG and servers < 4.4, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/delete/
   */
  hint: Optional<(String | Document)>;
}

class DeleteManyModel implements WriteModel {

  /**
   * The filter to limit the deleted documents.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/delete/
   */
  filter: Document;

  /**
   * Specifies a collation.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_DELETE, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/delete/
   */
  collation: Optional<Document>;

  /**
   * The index to use. Specify either the index name as a string or the index key pattern.
   * If specified, then the query system will only consider plans using the hinted index.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 4.4. Older servers >= 3.4 will report an error for using this option.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_DELETE or OP_MSG, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/delete/
   */
  hint: Optional<(String | Document)>;
}

class ReplaceOneModel implements WriteModel {

  /**
   * The filter to limit the replaced document.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  filter: Document;

  /**
   * The document with which to replace the matched document.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  replacement: Document;

  /**
   * Specifies a collation.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_UPDATE, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  collation: Optional<Document>;

  /**
   * The index to use. Specify either the index name as a string or the index key pattern.
   * If specified, then the query system will only consider plans using the hinted index.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 4.2. Older servers >= 3.4 will report an error for using this option.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_UPDATE, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_MSG and servers < 4.2, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  hint: Optional<(String | Document)>;

  /**
   * When true, creates a new document if no document matches the query.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  upsert: Optional<Boolean>;
}

class UpdateOneModel implements WriteModel {

  /**
   * The filter to limit the updated documents.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  filter: Document;

  /**
   * A document or pipeline containing update operators.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  update: (Document | Document[]);

  /**
   * A set of filters specifying to which array elements an update should apply.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.6, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_UPDATE, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  arrayFilters: Optional<Array<Document>>;

  /**
   * Specifies a collation.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_UPDATE, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  collation: Optional<Document>;

  /**
   * The index to use. Specify either the index name as a string or the index key pattern.
   * If specified, then the query system will only consider plans using the hinted index.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 4.2. Older servers >= 3.4 will report an error for using this option.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_UPDATE, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_MSG and servers < 4.2, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  hint: Optional<(String | Document)>;

  /**
   * When true, creates a new document if no document matches the query.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  upsert: Optional<Boolean>;
}

class UpdateManyModel implements WriteModel {

  /**
   * The filter to limit the updated documents.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  filter: Document;

  /**
   * A document or pipeline containing update operators.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  update: (Document | Document[]);

  /**
   * A set of filters specifying to which array elements an update should apply.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.6, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_UPDATE, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  arrayFilters: Optional<Array<Document>>;

  /**
   * Specifies a collation.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_UPDATE, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  collation: Optional<Document>;

  /**
   * The index to use. Specify either the index name as a string or the index key pattern.
   * If specified, then the query system will only consider plans using the hinted index.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 4.2. Older servers >= 3.4 will report an error for using this option.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_UPDATE, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes using OP_MSG and servers < 4.2, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  hint: Optional<(String | Document)>;

  /**
   * When true, creates a new document if no document matches the query.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  upsert: Optional<Boolean>;
}
Write Results

The acknowledged property is defined for languages/frameworks without a sufficient optional type. Hence, a driver may choose to return an optional result (e.g. Optional<BulkWriteResult>) such that unacknowledged writes don't have a value and acknowledged writes do have a value.

[!NOTE] If you have a choice, consider providing the acknowledged member and raising an error if the other fields are accessed in an unacknowledged write. Instead of users receiving a null reference exception, you have the opportunity to provide an informative error message indicating the correct way to handle the situation. For instance, "The insertedCount member is not available when the write was unacknowledged. Check the acknowledged member to avoid this error."

Any result class with all parameters marked NOT REQUIRED is ultimately NOT REQUIRED as well. For instance, the InsertOneResult has all NOT REQUIRED parameters and is therefore also NOT REQUIRED allowing a driver to use "void" as the return value for the insertOne method.

class BulkWriteResult {

  /**
   * Indicates whether this write result was acknowledged. If not, then all
   * other members of this result will be undefined.
   *
   * NOT REQUIRED: Drivers may choose to not provide this property.
   */
  acknowledged: Boolean;

  /**
   * Number of documents inserted.
   */
  insertedCount: Int64;

  /**
   * Map of the index of the operation to the id of the inserted document.
   *
   * NOT REQUIRED: Drivers may choose to not provide this property.
   */
  insertedIds: Map<Int64, any>;

  /**
   * Number of documents matched for update.
   */
  matchedCount: Int64;

  /**
   * Number of documents modified.
   */
  modifiedCount: Int64;

  /**
   * Number of documents deleted.
   */
  deletedCount: Int64;

  /**
   * Number of documents upserted.
   */
  upsertedCount: Int64;

  /**
   * Map of the index of the operation to the id of the upserted document.
   */
  upsertedIds: Map<Int64, any>;

}

/**
 * Note: this class is "NOT REQUIRED" since all of its fields are marked
 * "NOT REQUIRED". Drivers that do not implement this class SHOULD use void
 * as the return type for insertOne.
 */
class InsertOneResult {

  /**
   * Indicates whether this write result was acknowledged. If not, then all
   * other members of this result will be undefined.
   *
   * NOT REQUIRED: Drivers may choose to not provide this property.
   */
  acknowledged: Boolean;

  /**
   * The identifier that was inserted. If the server generated the identifier, this value
   * will be null as the driver does not have access to that data.
   *
   * NOT REQUIRED: Drivers may choose to not provide this property.
   */
  insertedId: any;

}

/**
 * Note: this class is "NOT REQUIRED" since all of its fields are marked
 * "NOT REQUIRED". Drivers that do not implement this class SHOULD use void
 * as the return type for insertMany.
 */
class InsertManyResult {

  /**
   * Indicates whether this write result was acknowledged. If not, then all
   * other members of this result will be undefined.
   *
   * NOT REQUIRED: Drivers may choose to not provide this property.
   */
  acknowledged: Boolean;

  /**
   * Map of the index of the inserted document to the id of the inserted document.
   *
   * NOT REQUIRED: Drivers may choose to not provide this property.
   */
  insertedIds: Map<Int64, any>;

}

class DeleteResult {

  /**
   * Indicates whether this write result was acknowledged. If not, then all
   * other members of this result will be undefined.
   *
   * NOT REQUIRED: Drivers may choose to not provide this property.
   */
  acknowledged: Boolean;

  /**
   * The number of documents that were deleted.
   */
  deletedCount: Int64;

}

class UpdateResult {

  /**
   * Indicates whether this write result was acknowledged. If not, then all
   * other members of this result will be undefined.
   *
   * NOT REQUIRED: Drivers may choose to not provide this property.
   */
  acknowledged: Boolean;

  /**
   * The number of documents that matched the filter.
   */
  matchedCount: Int64;

  /**
   * The number of documents that were modified.
   */
  modifiedCount: Int64;

  /**
   * The number of documents that were upserted.
   *
   * NOT REQUIRED: Drivers may choose to not provide this property so long as
   * it is always possible to infer whether an upsert has taken place. Since
   * the "_id" of an upserted document could be null, a null "upsertedId" may
   * be ambiguous in some drivers. If so, this field can be used to indicate
   * whether an upsert has taken place.
   */
  upsertedCount: Int64;

  /**
   * The identifier of the inserted document if an upsert took place.
   */
  upsertedId: any;

}
Error Handling

Defined below are error and exception types that should be reported from the various write methods. Since error types across languages would be impossible to reconcile, the below definitions represent the fields and names for the information that should be present. Structure isn't important as long as the information is available.

Drivers SHOULD report errors however they report other server errors: by raising an exception, returning "false" and populating an error struct, or another idiom that is consistent with other server errors.

WriteConcernError
class WriteConcernError {

  /**
   * An integer value identifying the write concern error. Corresponds to the
   * "writeConcernError.code" field in the command response.
   *
   * @see https://www.mongodb.com/docs/manual/reference/method/WriteResult/
   */
  code: Int32;

  /**
   * A document identifying the write concern setting related to the error.
   * Corresponds to the "writeConcernError.errInfo" field in the command
   * response.
   *
   * @see https://www.mongodb.com/docs/manual/reference/method/WriteResult/
   */
  details: Document;

  /**
   * A description of the error. Corresponds to the
   * "writeConcernError.errmsg" field in the command response.
   *
   * @see https://www.mongodb.com/docs/manual/reference/method/WriteResult/
   */
  message: String;

}

Drivers MUST construct a WriteConcernError from a server reply as follows:

  • Set code to writeConcernError.code.
  • Set message to writeConcernError.errmsg if available.
  • Set details to writeConcernError.errInfo if available. Drivers MUST NOT parse inside errInfo.

See writeConcernError Examples in the Read/Write Concern spec for examples of how a server represents write concern errors in replies.

WriteError

Write errors for insert, update, and delete commands are reported as objects within a writeErrors array field in the command response. Drivers MUST construct a WriteError from a server reply as follows (where writeErrors[] refers to a particular element in the array):

  • Set code to writeErrors[].code.
  • Set message to writeErrors[].errmsg if available.
  • Set details to writeErrors[].errInfo if available. Drivers MUST NOT parse inside errInfo.

For single-statement writes (i.e. insertOne, updateOne, updateMany, replaceOne, deleteOne, and deleteMany), a single write error may be reported in the array and writeErrors[0].index will be zero.

For multi-statement writes (i.e. insertMany and bulkWrite), potentially many write errors may be reported in the array and the index property will be set accordingly. Since the reported index is specific to each command, drivers MUST adjust the index accordingly for BulkWriteError.index.

class WriteError {

  /**
   * An integer value identifying the write error. Corresponds to the
   * "writeErrors[].code" field in the command response.
   *
   * @see https://www.mongodb.com/docs/manual/reference/method/WriteResult/
   */
  code: Int32;

  /**
   * A document providing more information about the write error (e.g. details
   * pertaining to document validation). Corresponds to the
   * "writeErrors[].errInfo" field in the command response.
   *
   * @see https://www.mongodb.com/docs/manual/reference/method/WriteResult/
   */
  details: Document;

  /**
   * A description of the error. Corresponds to the "writeErrors[].errmsg"
   * field in the command response.
   *
   * @see https://www.mongodb.com/docs/manual/reference/method/WriteResult/
   */
  message: String;

}

class BulkWriteError : WriteError {

  /**
   * The index of the request that errored. This is derived in part from the
   * "writeErrors[].index" field in the command response; however, drivers
   * MUST adjust the index accordingly for bulk writes that execute multiple
   * writes commands.
   */
  index: Int32;

  /**
   * The request that errored.
   *
   * NOT REQUIRED: Drivers may choose to not provide this property.
   */
  request: Optional<WriteModel>;

}

/**
 * NOTE: Only one of writeConcernError or writeError will be populated at a time. Your driver must present the offending
 * error to the user.
 */
class WriteException {

  /**
   * The error that occurred on account of write concern failure.
   */
  writeConcernError: Optional<WriteConcernError>;

  /**
   * The error that occurred on account of a non-write concern failure.
   */
  writeError: Optional<WriteError>;

}

class BulkWriteException {

  /**
   * The requests that were sent to the server.
   *
   * NOT REQUIRED: Drivers may choose to not provide this property.
   */
  processedRequests: Optional<Iterable<WriteModel>>;

  /**
   * The requests that were not sent to the server.
   *
   * NOT REQUIRED: Drivers may choose to not provide this property.
   */
  unprocessedRequests: Optional<Iterable<WriteModel>>;

  /**
   * The intermediary write result for any operations that succeeded before
   * the bulk write was interrupted.
   *
   * NOT REQUIRED: Drivers may choose to not provide this property.
   */
  writeResult: Optional<BulkWriteResult>;

  /**
   * The error that occurred on account of write concern failure. If the error was a Write Concern related, this field must be present.
   */
  writeConcernError: Optional<WriteConcernError>;

  /**
   * The error that occurred on account of a non-write concern failure. This might be empty if the error was a Write Concern related error.
   */
  writeErrors: Optional<Iterable<BulkWriteError>>;

}
Find And Modify
interface Collection {

  /**
   * Finds a single document and deletes it, returning the original. The document to return may be null.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   * @throws WriteException
   */
  findOneAndDelete(filter: Document, options: Optional<FindOneAndDeleteOptions>): Optional<Document>;

  /**
   * Finds a single document and replaces it, returning either the original or the replaced
   * document. The document to return may be null.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   * @throws WriteException
   */
  findOneAndReplace(filter: Document, replacement: Document, options: Optional<FindOneAndReplaceOptions>): Optional<Document>;

  /**
   * Finds a single document and updates it, returning either the original or the updated
   * document. The document to return may be null.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   * @throws WriteException
   */
  findOneAndUpdate(filter: Document, update: (Document | Document[]), options: Optional<FindOneAndUpdateOptions>): Optional<Document>;

}

enum ReturnDocument {
  /**
   * Indicates to return the document before the update, replacement, or insert occurred.
   */
   BEFORE,
  /**
   * Indicates to return the document after the update, replacement, or insert occurred.
   */
   AFTER
}

class FindOneAndDeleteOptions {

  /**
   * Specifies a collation.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  collation: Optional<Document>;

  /**
   * The index to use. Specify either the index name as a string or the index key pattern.
   * If specified, then the query system will only consider plans using the hinted index.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 4.4. Older servers >= 4.2 will report an error for using this option.
   * For servers < 4.2, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes and servers < 4.4, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  hint: Optional<(String | Document)>;

  /**
   * The maximum amount of time to allow the query to run.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * NOTE: This option is deprecated in favor of timeoutMS.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  maxTimeMS: Optional<Int64>;

  /**
   * Limits the fields to return for all matching documents.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * Note: this option is mapped to the "fields" findAndModify command option.
   *
   * @see https://www.mongodb.com/docs/manual/tutorial/project-fields-from-query-results
   */
  projection: Optional<Document>;

  /**
   * Determines which document the operation modifies if the query selects multiple documents.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  sort: Optional<Document>;

  /**
   * Map of parameter names and values. Values must be constant or closed
   * expressions that do not reference document fields. Parameters can then be
   * accessed as variables in an aggregate expression context (e.g. "$$var").
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 5.0. Older servers >= 2.6 (and possibly earlier) will report an error for using this option.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  let: Optional<Document>;

  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   * The comment can be any valid BSON type for server versions 4.4 and above.
   * Server versions prior to 4.4 do not support comment for findAndModify command,
   * and providing one will result in a server-side error.
   */
  comment: Optional<any>;
}

class FindOneAndReplaceOptions {

  /**
   * If true, allows the write to opt-out of document level validation.
   *
   * This option is sent only if the caller explicitly provides a true value. The default is to not send a value.
   * For servers < 3.2, this option is ignored and not sent as document validation is not available.
   */
  bypassDocumentValidation: Optional<Boolean>;

  /**
   * Specifies a collation.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  collation: Optional<Document>;

  /**
   * The index to use. Specify either the index name as a string or the index key pattern.
   * If specified, then the query system will only consider plans using the hinted index.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 4.4. Older servers >= 4.2 will report an error for using this option.
   * For servers < 4.2, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes and servers < 4.4, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  hint: Optional<(String | Document)>;

  /**
   * The maximum amount of time to allow the query to run.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * NOTE: This option is deprecated in favor of timeoutMS.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  maxTimeMS: Optional<Int64>;

  /**
   * Limits the fields to return for all matching documents.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * Note: this option is mapped to the "fields" findAndModify command option.
   *
   * @see https://www.mongodb.com/docs/manual/tutorial/project-fields-from-query-results
   */
  projection: Optional<Document>;

  /**
   * When ReturnDocument.After, returns the replaced or inserted document rather than the original.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * Note: this option is mapped to the "new" findAndModify boolean field. ReturnDocument.Before represents false,
   * and ReturnDocument.After represents true.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  returnDocument: Optional<ReturnDocument>;

  /**
   * Determines which document the operation modifies if the query selects multiple documents.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  sort: Optional<Document>;

  /**
   * When true, findAndModify creates a new document if no document matches the query.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  upsert: Optional<Boolean>;

  /**
   * Map of parameter names and values. Values must be constant or closed
   * expressions that do not reference document fields. Parameters can then be
   * accessed as variables in an aggregate expression context (e.g. "$$var").
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 5.0. Older servers >= 2.6 (and possibly earlier) will report an error for using this option.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  let: Optional<Document>;

  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   * The comment can be any valid BSON type for server versions 4.4 and above.
   * Server versions prior to 4.4 do not support comment for findAndModify command,
   * and providing one will result in a server-side error.
   */
  comment: Optional<any>;
}

class FindOneAndUpdateOptions {

  /**
   * A set of filters specifying to which array elements an update should apply.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.6, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/update/
   */
  arrayFilters: Optional<Array<Document>>;

  /**
   * If true, allows the write to opt-out of document level validation.
   *
   * This option is sent only if the caller explicitly provides a true value. The default is to not send a value.
   * For servers < 3.2, this option is ignored and not sent as document validation is not available.
   */
  bypassDocumentValidation: Optional<Boolean>;

  /**
   * Specifies a collation.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * For servers < 3.4, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  collation: Optional<Document>;

  /**
   * The index to use. Specify either the index name as a string or the index key pattern.
   * If specified, then the query system will only consider plans using the hinted index.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 4.4. Older servers >= 4.2 will report an error for using this option.
   * For servers < 4.2, the driver MUST raise an error if the caller explicitly provides a value.
   * For unacknowledged writes and servers < 4.4, the driver MUST raise an error if the caller explicitly provides a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  hint: Optional<(String | Document)>;

  /**
   * The maximum amount of time to allow the query to run.
   *
   * NOTE: This option is deprecated in favor of timeoutMS.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  maxTimeMS: Optional<Int64>;

  /**
   * Limits the fields to return for all matching documents.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * Note: this option is mapped to the "fields" findAndModify command option.
   *
   * @see https://www.mongodb.com/docs/manual/tutorial/project-fields-from-query-results
   */
  projection: Optional<Document>;

  /**
   * When ReturnDocument.After, returns the replaced or inserted document rather than the original.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * Note: this option is mapped to the "new" findAndModify boolean field. ReturnDocument.Before represents false,
   * and ReturnDocument.After represents true.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  returnDocument: Optional<ReturnDocument>;

  /**
   * Determines which document the operation modifies if the query selects multiple documents.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  sort: Optional<Document>;

  /**
   * When true, creates a new document if no document matches the query.
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  upsert: Optional<Boolean>;

  /**
   * Map of parameter names and values. Values must be constant or closed
   * expressions that do not reference document fields. Parameters can then be
   * accessed as variables in an aggregate expression context (e.g. "$$var").
   *
   * This option is sent only if the caller explicitly provides a value. The default is to not send a value.
   * This option is only supported by servers >= 5.0. Older servers >= 2.6 (and possibly earlier) will report an error for using this option.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/findAndModify/
   */
  let: Optional<Document>;

  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   *
   * The comment can be any valid BSON type for server versions 4.4 and above.
   * Server versions prior to 4.4 do not support comment for findAndModify command,
   * and providing one will result in a server-side error.
   */
  comment: Optional<any>;
}
Update vs. Replace Validation

The update family of operations require that the update document parameter MUST have only atomic modifiers. In practice, this means that introspection needs to happen on that document to enforce this. However, it is enough to only check the first element in the document. If it begins with a $ sign and the rest of the document's elements do not, the server will throw an error. Note that it is required that an update document have at least one atomic modifier.

The replace family of operations require that the replacement document parameter MUST NOT begin with an atomic modifier. In practice, this means that introspection needs to happen on that document to enforce this. However, it is enough to only check the first element in the document. If it does not begin with a $ sign but an element later on does, the server will throw an error.

Aggregation Pipelines with Write Stages

This section discusses special considerations for aggregation pipelines that contain write stages (e.g. $out, $merge).

Returning a cursor on the output collection

As noted in the documentation for the aggregate helper earlier in this document, $out and $merge are special pipeline stages that cause no results to be returned from the server. As such, drivers MAY setup a cursor to be executed upon iteration against the output collection and return that instead of an iterable that would otherwise have no results.

Drivers that do so for $merge MAY remind users that such a cursor may return more documents than were written by the aggregation (e.g. documents that existed in the collection prior to $merge being executed).

Read preferences and server selection

This section is only applicable if an explicit (i.e. per-operation) or inherited (e.g. from a Collection) read preference is available and it is not a primary read preference (i.e. { "mode": "primary" }).

Historically, only primaries could execute an aggregation pipeline with $out or $merge and drivers never considered a read preference for the operation. As of featureCompatibilityVersion 4.4, secondaries can now execute pipelines with $out or $merge. Since drivers do not track featureCompatibilityVersion, the decision to consider a read preference for such a pipeline will depend on the wire version(s) of the server(s) to which the driver is connected.

If there are one or more available servers and one or more of those servers is pre-5.0 (i.e. wire version < 13), drivers MUST NOT use the available read preference and MUST instead select a server using a primary read preference.

Otherwise, if there are either no available servers, all available servers are 5.0+ (i.e. wire version >= 13), or the topology type is LoadBalanced (we can assume the backing mongos is 5.0+), drivers MUST use the available read preference.

Drivers SHOULD augment their server selection algorithm such that this logic can be enforced within a single server selection attempt.

Drivers MUST discern the read preference used to select a server for the operation, which SHALL be used for specifying the $readPreference global command argument and passing read preference to mongos and load balancers (if applicable).

Test Plan

See the README for tests.

Motivation

Current drivers have chosen slightly different names and semantics for the same operations and options. In addition, not all drivers offer all the same operations and methods. As such, it is difficult to transition from driver to driver making the jobs of polyglot developers, documentation authors, and support engineers more difficult.

Backwards Compatibility

This spec should be mostly backwards compatible as it is very lenient. Drivers finding a backwards compatibility problem should attempt to work around it using an acceptable deviation. In rare cases, a driver may need to break backwards compatibility. This should be done in accordance with a versioning scheme indicating that a backwards compatible break may have occurred in conjunction with release documentation and warnings.

Reference Implementation

See Test Plan

If a driver needs to refer to items in the following list, the below are the accepted forms of those terms and deviations from the Naming section are still permissible.

  • Read Preference: readPreference
  • Read Concern: readConcern
  • Write Concern: writeConcern

Q & A

Q: Why do the names of the fields differ from those defined in the MongoDB manual?
Documentation and commands often refer to same-purposed fields with different names making it difficult to have a cohesive API. In addition, occasionally the name was correct at one point and its purpose has expanded to a point where the initial name doesn't accurately describe its current function.

In addition, responses from the servers are sometimes cryptic and used for the purposes of compactness. In these cases, we felt the more verbose form was desirable for self-documentation purposes.

Q: Where is read preference?
Read preference is about selecting a server with which to perform a read operation, such as a query, a count, or an aggregate. Since all operations defined in this specification are performed on a collection, it's uncommon that two different read operations on the same collection would use a different read preference, potentially getting out-of-sync results. As such, the most natural place to indicate read preference is on the client, the database, or the collection itself and not the operations within it.

However, it might be that a driver needs to expose this selection filter to a user per operation for various reasons. As noted before, it is permitted to specify this, along with other driver-specific options, in some alternative way.

Q: Where is read concern?
Read concern is about indicating how reads are handled. Since all operations defined in this specification are performed on a collection, it's uncommon that two different read operations on the same collection would use a different read concern, potentially causing mismatched and out-of-sync data. As such, the most natural place to indicate read concern is on the client, the database, or the collection itself and not the operations within it.

However, it might be that a driver needs to expose read concern to a user per operation for various reasons. As noted before, it is permitted to specify this, along with other driver-specific options, in some alternative way.

Q: Where is write concern?
Write concern is about indicating how writes are acknowledged. Since all operations defined in this specification are performed on a collection, it's uncommon that two different write operations on the same collection would use a different write concern, potentially causing mismatched and out-of-sync data. As such, the most natural place to indicate write concern is on the client, the database, or the collection itself and not the operations within it. See the Read/Write Concern specification for the API of constructing a read/write concern and associated API.

However, it might be that a driver needs to expose write concern to a user per operation for various reasons. As noted before, it is permitted to specify this, along with other driver-specific options, in some alternative way.

Q: How do I throttle unacknowledged writes now that write concern is no longer defined on a per operation basis?
Some users used to throttle unacknowledged writes by using an acknowledged write concern every X number of operations. Going forward, the proper way to handle this is by using the bulk write API.

Q: What is the logic for adding "One" or "Many" into the method and model names?
If the maximum number of documents affected can only be one, we added "One" into the name. This makes it explicit that the maximum number of documents that could be affected is one vs. infinite.

In addition, the current API exposed by all our drivers has the default value for "one" or "many" set differently for update and delete. This generally causes some issues for new developers and is a minor annoyance for existing developers. The safest way to combat this without introducing discrepancies between drivers/driver versions or breaking backwards compatibility was to use multiple methods, each signifying the number of documents that could be affected.

Q: Speaking of "One", where is findOne?
If your driver wishes to offer a findOne method, that is perfectly fine. If you choose to implement findOne, please keep to the naming conventions followed by the FindOptions and keep in mind that certain things don't make sense like limit (which should be -1), tailable, awaitData, etc...

Q: What considerations have been taken for the eventual merging of query and the aggregation framework?
In the future, it is probable that a new query engine (QE) will look very much like the aggregation framework. Given this assumption, we know that both find and aggregate will be renderable in QE, each maintaining their ordering guarantees for full backwards compatibility.

Hence, the only real concern is how to initiate a query using QE. While find is preferable, it would be a backwards breaking change. It might be decided that find is what should be used, and all drivers will release major revisions with this backwards breaking change. Alternatively, it might be decided that another initiator would be used.

Q: Didn't we just build a bulk API?
Yes, most drivers did just build out a bulk API (fluent-bulk-api). While unfortunate, we felt it better to have the bulk api be consistent with the rest of the methods in the CRUD family of operations. However, the fluent-bulk-api is still able to be used as this change is non-backwards breaking. Any driver which implemented the fluent bulk API should deprecate it and drivers that have not built it should not do so.

Q: What about explain?
Explain has been determined to be not a normal use-case for a driver. We'd like users to use the shell for this purpose. However, explain is still possible from a driver. For find, it can be passed as a modifier. Aggregate can be run using a runCommand method passing the explain option. In addition, server 3.0 offers an explain command that can be run using a runCommand method.

Q: Where did modifiers go in FindOptions?
MongoDB 3.2 introduced the find command. As opposed to using the general "modifiers" field any longer, each relevant option is listed explicitly. Some options, such as "tailable" or "singleBatch" are not listed as they are derived from other fields. Upgrading a driver should be a simple procedure of deprecating the "modifiers" field and introducing the new fields. When a collision occurs, the explicitly specified field should override the value in "modifiers".

Q: Where is save?
Drivers have historically provided a save method, which was syntactic sugar for upserting or inserting a document based on whether it contained an identifier, respectively. While the save method may be convenient for interactive environments, such as the shell, it was intentionally excluded from the CRUD specification for language drivers for several reasons. The save method promotes a design pattern of "fetch, modify, replace" and invites race conditions in application logic. Additionally, the split nature of save makes it difficult to discern at a glance if application code will perform an insert or potentially dangerous full-document replacement. Instead of relying on save, application code should know whether document already has an identifier and explicitly call insertOne or replaceOne with the upsert option.

Q: Where is useCursor in AggregateOptions?
Inline aggregation results are no longer supported in server 3.5.2+. The aggregate command must be provided either the cursor document or the explain boolean. AggregateOptions does not define an explain option. If a driver does support an explain option, the cursor document should be omitted if explain is true. Otherwise a cursor document must be added to the aggregate command. Regardless, useCursor is no longer needed. Removing useCursor is a backwards breaking change, so drivers should first deprecate this option in a minor release, and remove it in a major release.

Q: Where is singleBatch in FindOptions?
Drivers have historically allowed users to request a single batch of results (after which the cursor is closed) by specifying a negative value for the limit option. For servers < 3.2, a single batch may be requested by specifying a negative value in the numberToReturn wire protocol field. For servers >= 3.2, the find command defines limit as a non-negative integer option but introduces a singleBatch boolean option. Rather than introduce a singleBatch option to FindOptions, the spec preserves the existing API for limit and instructs drivers to convert negative values accordingly for servers >= 3.2.

Q: Why are client-side errors raised for some unsupported options?
Server versions before 3.4 were inconsistent about reporting errors for unrecognized command options and may simply ignore them, which means a client-side error is the only way to inform users that such options are unsupported. For unacknowledged writes using OP_MSG, a client-side error is necessary because the server has no chance to return a response (even though a 3.6+ server is otherwise capable of reporting errors for unrecognized options). For unacknowledged writes using legacy opcodes (i.e. OP_INSERT, OP_UPDATE, and OP_DELETE), the message body has no field with which to express these options so a client-side error is the only mechanism to inform the user that such options are unsupported. The spec does not explicitly refer to unacknowledged writes using OP_QUERY primarily because a response document is always returned and drivers generally would not consider using OP_QUERY precisely for that reason.

Q: Why does reverting to using count instead of aggregate with $collStats for estimatedDocumentCount not require a major version bump in the drivers, even though it might break users of the Stable API?
SemVer allows for a library to include a breaking change in a minor or patch version if the change is required to fix another accidental breaking change introduced in a prior version and is not expected to further break a large number of users. Given that the original switch to $collStats was a breaking change due to it not working on views, the number of users using estimatedDocumentCount with apiStrict: true is small, and the server is back-porting the addition of count to the Stable API, it was decided that this change was acceptable to make in minor version releases of the drivers per the aforementioned allowance in the SemVer spec.

Changelog

  • 2024-02-20: Migrated from reStructuredText to Markdown.

  • 2022-10-05: Remove spec front matter and reformat changelog.

  • 2022-04-21: Revert to using the count command for estimatedDocumentCount

  • 2022-02-18: Add let to BulkWriteOptions.

  • 2022-02-10: Specified that getMore command must explicitly send inherited comment.

  • 2022-02-01: Add comment attribute to all helpers.

  • 2022-01-27: Use optional return types for write commands and findAndModify

  • 2022-01-19: Deprecate the maxTimeMS option and require that timeouts be applied
    per the client-side operations timeout spec.

  • 2022-01-14: Add let to ReplaceOptions

  • 2021-11-10: Revise rules for applying read preference for aggregations with
    $out and $merge. Add let to FindOptions, UpdateOptions, DeleteOptions, FindOneAndDeleteOptions, FindOneAndReplaceOptions, FindOneAndUpdateOptions

  • 2021-09-28: Support aggregations with $out and $merge on 5.0+ secondaries

  • 2021-08-31: Allow unacknowledged hints on write operations if supported by
    server (reverts previous change).

  • 2021-06-02: Introduce WriteError.details and clarify WriteError construction

  • 2021-06-01: Add let to AggregateOptions

  • 2021-01-21: Update estimatedDocumentCount to use $collStats stage for servers >= 4.9

  • 2020-04-17: Specify that the driver must raise an error for unacknowledged
    hints on any write operation, regardless of server version.

  • 2020-03-19: Clarify that unacknowledged update, findAndModify, and delete
    operations with a hint option should raise an error on older server versions.

  • 2020-03-06: Added hint option for DeleteOne, DeleteMany, and FindOneAndDelete operations.

  • 2020-01-24: Added hint option for findAndModify update/replace operations.

  • 2020-01-17: Add allowDiskUse to FindOptions.

  • 2020-01-14: Deprecate oplogReplay option for find command

  • 2020-01-10: Clarify client-side error reporting for unsupported options

  • 2020-01-10: Error if hint specified for unacknowledged update using OP_UPDATE
    or OP_MSG for servers < 4.2

  • 2019-10-28: Removed link to old language examples.

  • 2019-09-26: Added hint option for update commands.

  • 2019-06-07: Consistent treatment for aggregate $merge and $out stages

  • 2019-05-01: Specify a document or pipeline for commands with updates in server 4.2+.

  • 2019-02-20: Mark the request field of BulkWriteError as NOT REQUIRED

  • 2018-11-30: Specify maxAwaitTimeMS in AggregateOptions

  • 2018-11-15: Aggregate commands with an $out stage should not specify batchSize

  • 2018-10-25: Note how results are backed for aggregate, distinct, and find operations

  • 2018-07-25: Added upsertedCount to UpdateResult.

  • 2018-06-07: Deprecated the count helper. Added the estimatedDocumentCount and
    countDocuments helpers.

  • 2018-03-05: Deprecate snapshot option

  • 2018-03-01: Deprecate maxScan query option.

  • 2018-02-06: Note that batchSize in FindOptions and AggregateOptions should also
    apply to getMore.

  • 2018-01-26: Only send bypassDocumentValidation option if it's true, don't send false.

  • 2017-10-23: Allow BulkWriteException to provide an intermediary write result.

  • 2017-10-17: Document negative limit for FindOptions.

  • 2017-10-09: Bumped minimum server version to 2.6 and removed references to
    older versions in spec and tests.

  • 2017-10-09: Prohibit empty insertMany() and bulkWrite() operations.

  • 2017-10-09: Split UpdateOptions and ReplaceOptions. Since replaceOne()
    previously used UpdateOptions, this may have BC implications for drivers using option classes.

  • 2017-10-05: Removed useCursor option from AggregateOptions.

  • 2017-09-26: Added hint option to AggregateOptions.

  • 2017-09-25: Added comment option to AggregateOptions.

  • 2017-08-31: Added arrayFilters to bulk write update models.

  • 2017-06-29: Remove requirement of using OP_KILL_CURSOR to kill cursors.

  • 2017-06-27: Added arrayFilters to UpdateOptions and FindOneAndUpdateOptions.

  • 2017-06-26: Added FAQ entry for omission of save method.

  • 2017-05-12: Removed extra "collation" option added to several bulk write models.

  • 2017-01-09: Removed modifiers from FindOptions and added in all options.

  • 2017-01-09: Changed the value type of FindOptions.skip and FindOptions.limit to
    Int64 with a note related to calculating batchSize for opcode writes.

  • 2017-01-09: Reworded description of how default values are handled and when to
    send certain options.

  • 2016-09-23: Included collation option in the bulk write models.

  • 2016-08-05: Added in collation option.

  • 2015-11-05: Typos in comments about bypassDocumentValidation

  • 2015-10-16: Added maxAwaitTimeMS to FindOptions.

  • 2015-10-01: Moved bypassDocumentValidation into BulkWriteOptions and removed it
    from the individual write models.

  • 2015-09-16: Added bypassDocumentValidation.

  • 2015-09-16: Added readConcern notes.

  • 2015-06-17: Added limit/batchSize calculation logic.

Collation

  • Status: Accepted
  • Minimum Server Version: 1.8

Abstract

As of MongoDB server version 3.4 (maxWireVersion 5), a collation option is supported by the query system for matching and sorting on language strings in a locale-aware fashion.

A driver MUST support a Collation option for each of the relevant operations in server versions >= 3.4 (maxWireVersion 5) and MUST throw an error if a user supplies a Collation option for the operation and the selected server has maxWireVersion < 5 or if the user is using opcode-based unacknowledged writes.

The CRUD and Index Management specs include the collation option in descriptions of API elements where it is supported. This document provides more details on the specific driver behavior required to handle the collation option.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Design Rationale

Collation is not a setting on a driver Client, Collection, or Database object

MongoDB supports a server-side collation default on a collection. It is problematic to add a default Collation option on a driver’s Collection or Database object, because automatic collection and index creation would bypass any driver object's Collation setting. Additionally, this would have a number of performance implications in the context of bulk writes: any bulk write on a Collection object having a default Collation would require iterating through each operation and adding a Collation document.

After the initial release of server version 3.4, many users will want to apply Collations to all operations on an existing collection. Such users will have to supply the Collation option to each operation explicitly; however, eventually the majority of users wishing to use Collations on all operations on a collection will create a collection with a server-side default. We chose to favor user verbosity right now over abstracting the feature for short-term gains.

Drivers throw an error if a user supplies Collation and the selected server does not support the feature

Server versions earlier than 3.4 don't always throw an error if an unknown option is supplied to certain operations. Because a Collation defines how documents are matched and sorted for both read and write operations, behavior differences between server versions are significant. Drivers therefore MUST throw an error if a user specifies a Collation and the selected server has a maxWireVersion < 5 or if using opcode-based unacknowledged writes.

Specification

Collation Document Model

For naming and deviation guidance, see the CRUD specification. Defined below are the constructs for drivers. Note that the only required field is locale, but the driver should let the server validate the presence of this field.

class Collation {
  /**
   * The locale.
   */
  locale: String

  /**
   * The case level.
   */
  caseLevel: Optional<Boolean>

  /**
   * The case ordering.
   */
  caseFirst: Optional<String>

  /**
   * The number of comparison levels to use.
   */
  strength: Optional<Integer>

  /**
   * Whether to order numbers based on numerical order and not collation order.
   */
  numericOrdering: Optional<Boolean>

  /**
   * Whether spaces and punctuation are considered base characters.
   */
  alternate: Optional<String>

  /**
   * Which characters are affected by alternate: "shifted".
   */
  maxVariable: Optional<String>

  /**
   * If true, normalizes text into Unicode NFD.
   */
  normalization: Optional<Boolean>

  /**
   * Causes secondary differences to be considered in reverse order, as it is done in the French language.
   */
  backwards: Optional<Boolean>
}

Unknown Options

For forward compatibility, a driver MUST NOT raise an error when a user provides unknown options or values. The driver MUST NOT validate collation document types; the driver relies on the server to validate values and other contents of the collation document.

Generic Command Method

If a driver offers a generic RunCommand method on the database object, the driver MUST NOT validate whether the provided command document contains a "collation" subdocument, and MUST NOT check the server's wire version before sending the command including the "collation" subdocument.

API

The driver helpers that must support a collation option include the create collection helper, any CRUD API components relying on the MongoDB query system (including updates and deletes) and some index management helpers. The CRUD-related commands that take a collation option are:

  • aggregate
  • count
  • distinct
  • find (command only)
  • findAndModify
  • geoNear
  • group
  • mapReduce
  • delete (command only)
  • update (command only)

The collation option is sent to the server in the form of a BSON Document. See the CRUD specification for details on supporting the option in the CRUD API.

Driver helpers manipulating or using indexes MUST support a collation option. These include creating, deleting, and hinting an index. See the Index Management specification for details.

Require maxWireVersion 5

Drivers MUST require the server's maxWireVersion >= 5 to support Collations. When a collation is explicitly specified for a server with maxWireVersion < 5, the driver MUST raise an error.

Opcode-based Unacknowledged Writes

The driver MUST NOT allow collation with opcodes, because the server doesn't support it. If a driver uses opcode-based writes when the write concern is unacknowledged, the driver MUST raise an error if a collation is explicitly set.

Setting a default collation on a collection

Drivers MUST allow the create command to accept a parameter called "collation". For example,

db.command({
    create: "myCollection",
    collation: {locale: "en_US"}
});

BulkWrite API

If maxWireVersion < 5, the driver MUST inspect each BulkWrite operation model for a collation and MUST raise an error and MUST NOT send any operations to the server if a collation is explicitly specified on an operation. For example, the user will provide BulkWrite operation models as in the following example:

db.collection.bulkWrite([
  {insertOne: { ... }},

  {updateOne: { filter: { name: "PING" },
                        update: { $set: { name: "pong" }},
                        collation: { locale: "en_US", strength: 2 }}},
  {updateMany: {..., collation: {...}}},
  {replaceOne: {..., collation: {...}}},
  {deleteOne: {..., collation: {...}}},
  {deleteMany: {..., collation: {...}}}
]);

The driver must inspect each operation for a Collation if maxWireVersion is < 5 and fail the entire bulkWrite if a collation was explicitly specified. In the example above, that means even the insertOne (without Collation) MUST NOT be sent.

Test Plan

There is no specific test plan for driver Collation support; however drivers should test each affected CRUD, Index Management API, and collection creation/modification component to ensure that Collation is a supported option.

In addition, drivers should test that two indexes can be created with identical key patterns and different collations. A custom name must be provided for one of them. Then, the test should ensure that the correct index is dropped when delete_one is called with an index name.

Drivers should also test that errors are raised in each place Collation can be provided to a API method and the selected server has maxWireVersion < 5.

Backwards Compatibility

There should be no backwards compatibility concerns.

Reference Implementation

Reference Implementation:

Q & A

Q: Insert doesn’t take a collation?

A: No, only queries take collation. A collation is a per operation value, it does not affect how the data is stored.

Q: Delete and Update take a collation?

A: Yes, delete and update operations use the query system to match against a provided delete/update filter. Providing a collation when deleting a document matching ObjectID() doesn’t change anything, but matching a string value would.

Q: How do I create a collection with default collation? Does it affect my existing collection creation helper?

A: A collection with a default collation can be created using the create helper and by providing a collation option.

Changelog

  • 2024-02-15: Migrated from reStructuredText to Markdown.
  • 2022-10-05: Remove spec front matter and reformat changelog.
  • 2017-05-15: Minor markup fixes in API section.
  • 2016-08-31: Initial version.

============================ Write Commands Specification

:Status: Accepted :Minimum Server Version: 2.6

.. contents::


Goals

  • Method to do writes (insert/update/delete) that declares the write concern up front
  • Support for batch operations
  • As efficient as possible
  • getLastError is disallowed after anything except old style write ops

Non-Goals

  • We're not specifying nor building any generic mechanism for feature discovery here
  • We're not specifying nor building any generic way to be backwards compatible

Requests Format

We use a very similar request format for all insert, update, and delete commands, described below. A request contains parameters on how to apply the batch items and, of course, the batch items themselves.

Generic Fields


* ``<write op>``: mandatory, with a string type, where <write op> can be ``insert``,
  ``update``, or ``delete`` and the content string is a valid collection name against which the
  write should be directed.  The <write op> must be the first key in the document.

    example: ::

      db.runCommand{ { update: "users", ...

* ``writeConcern``: optional, with a valid write concern BSONObj type, which contains the
  journaling, replication, and/or time out parameters that will be used at the end of the batch
  application. The default write concern for a mongod server can be set as a replica set
  configuration option, or if there is no replica set default the fallback is ``{ w: 1 }``.

    example: ::

      ..., writeConcern: { w: 2 }, ...

  Note that the write concern will be issued at the end of the batch application.
  
  We also support the "unusual" write concern: ``{ w : 0 }``, which means the user is uninterested
  in the result of the write at this time.  In this protocol, there is no provision to ignore the
  response packet.  The protocol, however, sends a very condensed response when it sees that 
  write concern (e.g. omits everything but the ``ok`` field).

* ``ordered``: optional, with a boolean type. If true, applies the batch's items in the same
  order the items appear, ie. sequentially.  If ``ordered`` is false, the server applies the 
  batch items in no particular order -- possibly, in parallel.  The default value is true, 
  sequential.

    example: ::

     ..., ordered: false, ...
     
* ``metadata``: RESERVED FOR MONGOS USE, future use.

* ``failFast``: NOT YET USED, RESERVED FOR FUTURE USE.  Optional, with a boolean type.  If false, allows
  the batch to continue processing even if some elements in the batch have errors.  If true, 
  the batch will stop on first error(s) it detects (write concern will not be applied).  Defaults
  to true for ordered batches, and false for unordered batches.

Batch Items Format

We describe the array of batch items to be applied according to which type of write operation we'd like to perform.

.. _insert:

  • For inserts, documents array: mandatory, with an array of objects type. Objects must be valid for insertion.

    example: ::

    { insert: "coll", documents: [ , , ... ], ... }

.. _update:

  • For updates, an updates array: mandatory, with an array of update objects type. Update objects must contain the query expression q, an update expression u fields, and, optionally, a boolean multi if several documents may be changed, and a boolean upsert if updates can become inserts. Both optional fields default to false.

    example: ::

    { update: "coll",
      updates: [
          { q : <query>, u : <update>, multi : <multi>, upsert : <upsert> },
          ...
      ],
      ...
    }
    

.. _delete:

  • for deletes a deletes array: mandatory, with an array of delete object type. A delete

    example: ::

    { delete: "coll",
      deletes : [
          { q : <query>, limit : <num> },
          ...
      ],
      ...
    }
    

    Note that, to avoid accidentally deleting more documents than intended, we force the limit field to be present all the time. When all documents that satisfy q should be deleted set limit to zero, as opposed to being omitted.

    Note: The only valid values for limit is 1 and 0.

Request Size Limits


Supporting unlimited batch sizes poses two problems - the BSONObj internal size limit is 16 MiB + 16 KiB
(for command overhead), and a small write operation may have a much larger response.  In order to
ensure a batch can be correctly processed, two limits must be respected.

Both of these limits can be found using hello():

* ``maxBsonObjectSize`` : currently 16 MiB, this is the maximum size of writes (excluding command overhead)
  that should be sent to the server.  Documents to be inserted, query documents for updates and
  deletes, and update expression documents must be <= this size.  Once these documents have been
  assembled into a write command the total size may exceed ``maxBsonObjectSize`` by a maximum of
  16 KiB, allowing users to insert documents up to ``maxBsonObjectSize``.

* ``maxWriteBatchSize`` : this is the maximum number of inserts, updates, or deletes that 
  can be included in a write batch.  If more than this number of writes are included, the server cannot
  guarantee space in the response document to reply to the batch.

If the batch is too large in size or bytes, the command may fail.

Response Format
---------------

There are two types of responses to any command:

- a ``command failure``, which indicates the command itself did not complete successfully.  Example
  command failures include failure to authorize, failure to parse, operation aborted by user,
  and unexpected errors during execution (these should be very rare).
   
- successful command execution, which for write commands may include write errors.

Command Failure Fields

All commands have the same format when they fail unexpectedly:

{ ok : 0, code : <error code>, errmsg : <human-readable string> }

When a batch write command fails this way, like other commands, no guarantees are made about the state of the writes which were sent. Particular error codes may indicate more about what occurred, but those codes are outside the scope of this spec.

General Response Fields


Again, like other commands, batch write commands return ``{ ok : 1, ... }`` when they complete
successfully.  Importantly, successful execution of a batch write command may include reporting of
unsuccessful writes (write errors) and write concern application (write concern error).

The main body of a successful response is below:

.. _ok:

* ``ok``: Mandatory field, (double)"1" if operation was executed. Does not mean successfully.
  For example, duplicate key error will still set ok = 1

.. _n:

* ``n``: Mandatory field, with a positive numeric type or zero. This field contains the aggregated 
  number of documents successfully affected by the entire write command. This includes the number of
  documents inserted, upserted, updated, and deleted.  We do not report on the individual number of 
  documents affected by each batch item. If the application would wish so, then the application 
  should issue one-item batches.

.. _writeErrors:

* ``writeErrors``: Optional field, an array of write errors. For every batch write that had an error, there
  is one BSON error document in the array describing the error.
  (See the `Error Document`_ section.)

.. _writeConcernError:

* ``writeConcernError``: Optional field, which may contain a BSON error document indicating an error occurred while
  applying the write concern (or an error indicating that the write concern was not applied).
  (See the `Error Document`_ section.)

Situational Fields
~~~~~~~~~~~~~~~~~~

We use the fields above for all responses, regardless of the request type. But some
request types require additional response information, as described below.

.. _nModified:

* ``nModified``: Optional field, with a positive numeric type or zero.  Zero is the default value.  This
  field is only and always present for batch updates.  ``nModified`` is the physical number of documents
  affected by an update, while ``n`` is the logical number of documents matched by the update's query.
  For example, if we have 100 documents like ::
  
    { bizName: "McD", employees: ["Alice", "Bob", "Carol"] }
    
  and we are adding a single new employee using $addToSet for each business document, ``n`` is useful to
  ensure all businesses have been updated, and ``nModified`` is useful to know which businesses actually
  added a new employee.

.. _upserted:

* ``upserted``: Optional field, with an array type.  If any upserts occurred in the batch,
  the array contains a BSON document listing the ``index`` and ``_id`` of the newly 
  upserted document in the database.

.. _lastOp:

* ``lastOp``: MONGOD ONLY.  Optional field, with a timestamp type, indicating the latest opTime on the
  server after all documents were processed.

* ``electionId``: MONGOD ONLY. Optional ObjectId field representing the last primary election Id.

Error Document
~~~~~~~~~~~~~~

For a write error or a write concern error, the following fields will appear in the error
document:

.. _code:

* ``code``: Mandatory field with integer format.  Contains a numeric code corresponding to a certain
  type of error.

.. _errInfo:

* ``errInfo``: Optional field, with a BSONObj format.  This field contains structured information
  about an error that can be processed programmatically. For example, if a request returns with a
  shard version error, we may report the proper shard version as a sub-field here. For another example,
  if a write concern timeout occurred, the information previously reported on ``wtimeout`` would be
  reported here.
  The format of this field depends on the code above.

.. _errmsg:

* ``errmsg``: Mandatory field, containing a human-readable version of the error.

.. _index:

* ``index``: WRITE ERROR ONLY.  The index of the erroneous batch item relative to request batch order.
  Batch items indexes start with 0.


Examples
--------

Successful case
~~~~~~~~~~~~~~~

Note that ok: 1 by itself does **not** mean that an insert, update, or delete was executed
successfully,
just that the batch was processed successfully.
``ok``: 1 merely means "all operations executed".
``n`` reports how many items from that batch were affected by the operation.

Insert
------

  Request: ::

    { insert: "coll", documents: [ {a: 1} ] }

  Response: ::

    { "ok" : 1, "n" : 1 }


  Request: ::

    { insert: "coll", documents: [ {a: 1}, {b: 2}, {c: 3}, {d: 4} ] }

  Response: ::

    { "ok" : 1, "n" : 4 }


Delete
------

  Request: ::

    { delete: "coll", deletes: [ { q: {b: 2}, limit: 1} ] }

  Response: ::

    { "ok" : 1, "n" : 1 }


  Request: ::

    {
        delete: "coll",
        deletes:
        [
            {
                q: {a: 1},
                limit: 0
            },
            {
                q: {c: 3},
                limit: 1
            }
        ]
    }

  Response: ::

    { "ok" : 1, "n" : 3 }



Update
------

  Request: ::

    {
        update: "coll",
        "updates":
        [
            {
                q: { d: 4 },
                u: { $set: {d: 5} }
            }
        ]
    }

  Response: ::

    { "ok" : 1, "nModified" : 1, "n" : 1 }


Checking if command failed

To check if a write command failed

::

if (ok == 0) { // The command itself failed (authentication failed.., syntax error) } else if (writeErrors is array) { // Couldn't write the data (duplicate key.., out of disk space..) } else if (writeConcernError is object) { // Operation took to long on secondary, hit wtimeout ..., }

Command failure to parse or authenticate


  Request: ::

    { update: "coll",
      updates: [
        { q: {a:1}, x: {$set: {b: 2} } },
        { q: {a:2}, u: {$set: {c: 2} } }
      ]
    }

  Response: ::

    { ok: 0,
      code: <number>,
      errmsg: "Failed to parse batched update request, missing update expression 'u' field"
    }
    
    { ok: 0,
      code: <number>,
      errmsg: "Not authorized to perform update"
    }

Note that no information is given about command execution - if this was performed against a mongos, for example,
the batch may or may not have been partially applied - there is no programmatic way to determine this.

Write concern failure
~~~~~~~~~~~~~~~~~~~~~

  Request: ::

    { insert: "coll", documents: [ {a: 1}, {a:2} ], writeConcern: {w: 3, wtimeout: 100} }

  Response: ::

    { ok: 1,
      n: 2,
      writeConcernError: {
        code : <number>,
        errInfo: { wtimeout : true },
        errmsg: "Could not replicate operation within requested timeout"
      }
    }

Mixed failures
~~~~~~~~~~~~~~

  Request: ::

    db.coll.ensureIndex( {a:1}, {unique: true} )
    { insert: "coll",
      documents: [
        { a: 1 },
        { a: 1 },
        { a: 2 }
      ],
      ordered: false,
      writeConcern: { w: 3, wtimeout: 100 }
    }

  Response: ::

    { ok: 1,
      n: 2,
      writeErrors: [
        { index: 1,
          code: <number>,
          errmsg: "Attempt to insert duplicate key when unique index is present"
        }
      ],
      writeConcernError: {
        code: <number>,
        errInfo : { wtimeout : true },
        errmsg: "Could not replicate operation within requested timeout"
      }
    }

Note that the field ``n`` in the response came back with 2, even though there are three items
in the batch. This means that there must be an entry in ``writeErrors`` for the item that
failed.  Note also that the request turned off ``ordered``, so the write concern error
was hit when trying to replicate batch items 0 and 2.

Just to illustrate the support for ``{w:0}``, here's how the
response would look, had the request asked for that write concern.

  Response: ::

    { ok: 1 }

FAQ
---

Why are ``_id`` values generated client-side by default for new documents?

Though drivers may expose configuration options to prevent this behavior, by default a new ObjectId value will be created client-side before an insert operation.

This design decision primarily stems from the fact that MongoDB is a distributed database and the typical unique auto-incrementing scalar value most RDBMS' use for generating a primary key would not be robust enough, necessitating the need for a more robust data type (ObjectId in this case). These _id values can be generated either on the client or the server, however when done client-side a new document's _id value is immediately available for use without the need for a network round trip.

Prior to MongoDB 3.6, an insert operation would use the OP_INSERT opcode of the wire protocol to send the operation, and retrieve the results subsequently with a getLastError command. If client-side _id values were omitted, this command response wouldn't contain the server-created _id values for new documents. Following MongoDB 3.6 when all commands would be issued using the OP_MSG wire protocol opcode (insert included), the response to the command still wouldn't contain the _id values for inserted documents.

Can a driver still use the OP_INSERT, OP_DELETE, OP_UPDATE?


The `legacy opcodes were removed in MongoDB 6.0 <https://www.mongodb.com/docs/manual/release-notes/6.0-compatibility/#legacy-opcodes-removed>`_. As of MongoDB 3.6 these opcodes were superseded by `OP_MSG <https://www.mongodb.com/docs/manual/reference/mongodb-wire-protocol/#op_msg>`_, however all server versions up until 6.0 continued to support the legacy opcodes.


Can an application still issue requests with write concerns {w: 0}?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Yes. The drivers are still required to serve a {w:0} write concern by returning the
control to the application as soon as possible.  But a driver should send the request to
the server via a write command and should, therefore, take the corresponding response off the
wire -- even if the caller is not interested in that result.


What happens if a driver receives a write request against an old server?

It must convert that request into write operations + gle's and use the old op codes.

Are we discontinuing the use of getLastError?


Yes but as of 2.6 the existing getLastError behavior is supported for backward compatibility.

Changelog
---------

:2024-06-04: Add FAQ entry outlining client-side _id value generation
             Update FAQ to indicate legacy opcodes were removed
:2022-10-05: Revise spec front matter and reformat changelog.
:2022-07-25: Remove outdated value for ``maxWriteBatchSize``
:2021-04-22: Updated to use hello command
:2014-05-15: Removed text related to bulk operations; see the Bulk API spec for
             bulk details. Clarified some paragraphs; re-ordered the response
             field sections.
:2014-05-14: First public version

============= Bulk API Spec

:Status: Deprecated :Minimum Server Version: 2.4

.. contents::

Bulk Operation Builder

Starting a bulk operation can be done in two ways.

.. code:: javascript

initializeUnorderedBulkOp()     -> Bulk   - Initialize an unordered bulk
initializeOrderedBulkOp()       -> Bulk   - Initialize an ordered bulk

Operations Possible On Bulk Instance

Available operations follow the fluent API for insert, update and remove.

.. code:: javascript

/**
 * Update one document matching the selector
 */
bulk.find({a : 1}).updateOne({$inc : { x : 1 }});

/**
 * Update all documents matching the selector
 */
bulk.find({a : 2}).update({$inc : { x : 2 }});

/**
 * Update all documents
 * Note that find() is prohibited; the query is required
 */
bulk.find({}).update({$inc : { x : 2 }});

/**
 * Replace entire document (update with whole doc replace)
 */
bulk.find({a : 3}).replaceOne({ x : 3 });

/**
 * Update one document matching the selector or upsert
 */
bulk.find({a : 1}).upsert().updateOne({$inc : { x : 1 }});

/**
 * Update all documents matching the selector or upsert
 */
bulk.find({a : 2}).upsert().update({$inc : { x : 2 }});

/**
 * Replaces a single document matching the selector or upsert
 */
bulk.find({a : 3}).upsert().replaceOne({ x : 3 });

/**
 * Remove a single document matching the selector
 */
bulk.find({a : 4}).removeOne();

/**
 * Remove all documents matching the selector
 */
bulk.find({a : 5}).remove();

/**
 * Remove all documents
 * Note that find() is prohibited; the query is required
 */
bulk.find({}).remove();

/**
 * Insert a document
 */
bulk.insert({ x : 4 });

/**
 * Execute the bulk operation, with an optional writeConcern
 * overwriting the default w:1. A descriptive error should be
 * raised if execute is called more than once or before any
 * operations have been added.
 */
writeConcern = { w: 1, wtimeout: 100 }
bulk.execute(writeConcern);

Current shell implementation

The shell implementation serves as a guide only. One main difference between the shell implementation and a proper driver implementation is that unordered bulk operations are not optimized by re-ordering the writes; only the execution semantics are kept correct. You can find it here:

https://github.com/mongodb/mongo/blob/master/src/mongo/shell/bulk_api.js

If you need more information about the actual write command you can find the specification at the following location

https://github.com/mongodb/specifications/blob/master/source/server_write_commands.rst

Modes of Execution

The write commands have a new option called ordered that is a boolean. ordered replaces continueOnError but with slightly different semantics.

ordered = true

If the driver sets ordered = true all operations will be executed serially in the write command and the operation will abort on the first error. So given the 3 following operations.

.. code:: javascript

bulk.insert({a:1})
bulk.insert({a:2})
bulk.find({a:2}).update({$set: {a:1}}) // Clashes with unique index
bulk.find({a:1}).remove()

With ordered = true the bulk operation will terminate after the update as it errors out. With ordered = true the driver will receive only a single error.

ordered = false

If the driver sets ordered = false all operations might be applied in parallel by the server. The server will execute all the operations and return all errors created by the operations. So given the 4 following operations.

.. code:: javascript

bulk.insert({a:1})
bulk.insert({a:2})
bulk.find({a:2}).update({$set: {a:1}}) // Might clash with unique index
bulk.find({a:3}).remove
bulk.find({a:2}.update({$set: {a:1}}) // Might clash with unique index

Due to the write operations potentially happening concurrently there is no way to determine the final state of the bulk operation above. If insert({a:1}) happens before the two updates we will get 2 duplicate key index errors from the two update operations. If one of the updates happen first the insert will error out. By setting ordered to false we are trading off guaranteed order of execution for increased parallelization.

Ordered Bulk Operations

To start the ordered bulk operation call.

.. code:: javascript

initializeOrderedBulkOp() -> bulk

The ordered bulk operation guarantees the order of writes for a mixed set of operations. This means the driver needs to ensure that all operations are performed in the order they were added.

Consider the following set of operations:

.. code:: javascript

var bulk = db.c.initializeOrderedBulkOp()
bulk.insert({a : 1})
bulk.insert({a : 2})
bulk.insert({a : 3})
bulk.find({a : 2}).upsert().updateOne({$set : { a : 4 }});
bulk.find({a : 1}).removeOne();
bulk.insert({a : 5})
bulk.execute({w : 1})

This will generate the following sequence of writes to the server.

  1. Insert write command
  2. Update write command
  3. Remove write command
  4. Insert write command

One thing to note is that if a write command goes over the maximum number of documents or maxBSONMessageSize for an individual write command it needs to be split into multiple as for unordered bulks.

.. NOTE:: ContinueOnError Ordered operations are synonymous with continueOnError = false. There is no way to specify a different value for continueOnError.

Unordered Bulk Operations

To start the unordered bulk operation call:

.. code:: javascript

initializeUnorderedBulkOp() -> bulk

The unordered bulk operation does not guarantee order of execution of any added write operations. If you have the following code.

.. code:: javascript

var bulk = db.c.initializeUnorderedBulkOp()
bulk.insert({_id : 1})
bulk.find({_id : 2}).updateOne({$inc : { x : 1 }});
bulk.find({_id : 3}).removeOne();
bulk.insert({_id : 4})
bulk.find({_id : 5}).updateOne({$inc : { x : 1 }});
bulk.find({_id : 6}).removeOne();
bulk.execute({w:1})

Internally the driver will execute 3 write commands. One each for the inserts, updates and removes. It's important to note that the write commands could be executed in any order.

.. NOTE:: ContinueOnError Unordered operations are synonymous with continueOnError = true. There is no way to specify a different value for continueOnError.

Request Size Limits

Supporting unlimited batch sizes poses two problems - the BSONObj internal size limit is 16 MiB + small overhead (for commands), and a small write operation may have a much larger response. In order to ensure a batch can be correctly processed, two limits must be respected.

Both of these limits can be found using hello or legacy hello:

  • maxBsonObjectSize : currently 16 MiB, this is the maximum size of writes (excepting command overhead) that should be sent to the server. Documents to be inserted, query documents for updates and deletes, and update expression documents must be <= this size.

    Batches containing more than one insert, update, or delete must be less than maxBsonObjectSize. Note that this means a single-item batch can exceed maxBsonObjectSize. The additional overhead of the command itself is guaranteed not to trigger an error from the server, except in the case of SERVER-12305 <https://jira.mongodb.org/browse/SERVER-12305>_.

  • maxWriteBatchSize : currently 100,000 (as of MongoDB 3.6 via SERVER-13061 <https://jira.mongodb.org/browse/SERVER-13061>__), this is the maximum number of inserts, updates, or deletes that can be included in a write batch. If more than this number of writes are included, the server cannot guarantee space in the response document to reply to the batch.

If the batch is too large in size or bytes, the command may fail. The bulk API should ensure that this does not happen by splitting a batch into multiple batches of the same type if any of the limits are hit.

On Bulk Execution

A descriptive error should be raised if execute is called more than once.

A descriptive error should be raise if execute is called before any operations have been added.

Possible Implementation

One possible solution for serialization of bulk operations is to serialize some preamble and then to incrementally serialize and append each document to the collected message. The command wrapper has the array argument last so that the programmer can append to it. As you incrementally serialize, first check the collected size plus the incremental size.

If it is less than maxBsonObjectSize (incorporate additional headroom for your implementation), you can safely append and the message will be accepted by the server. If it exceeds maxBsonObjectSize, you should "finish" (without appending, updating the BSON array and document sizes) and execute the collected message. A new write command message will be serialized.

The un-appended incremental serialization from before will be appended on the new message. Continue incremental serialization, appending, and execution as above until the bulk operation is completed. Your implementation may need additional headroom for whatever is not already in the preamble, e.g., write concern, finish bytes, etc.

There are two maximum write command sizes a driver needs to take into account. The first one is maxBsonObjectSize that defines the maximum size of a single write command. The current tolerance is maxBsonObjectSize + 16K. If the driver sends a message that overflows this tolerance the server will respond with an error.

The second value is the maxWriteBatchSize value which specifies the maximum number of operations allowed in a single write command. In 2.6 this is currently set to 1000 operations. If the driver sends a write command with more than maxWriteBatchSize operations in it, the server will error out.

To avoid these errors the driver needs to split write commands if they overflow the two cases into one or more new write commands.

On Errors

It's important to understand the way the processing of commands happens on the server to understand the possible error scenarios. Let's look at the processing pipeline.

Validate BSON/Auth -> Write Operations -> Apply Write Concern

The first step will abort the operation completely with no changes applied to Mongo. An error at this stage will be top level an will mean no attempt was made to process the command. This also uniquely sets ok to 0.

.. code:: javascript

{ "ok" : 0, "code" : 13, "errmsg": "Authentication error" }

If the first step passes with no errors there might be a command level error such as a duplicate key error. This is a write error and will return an error results containing the writeErrors field.

.. code:: javascript

{ "ok" : 1, "n" : 0, "writeErrors" : [ { "index" : 0, "code" : 11000, "errmsg" : "DuplicateKey insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.test.$a_1 dup key: { : 1 }" } ], }

In the case of an ordered bulk operation you'll only ever get a single write error as the execution of the command will stop at the first error. In an unordered bulk operation you might have more than one.

The last step of applying the write concern can also cause an error. Given a write concern {w:5, wtimeout:1000} where there is only 3 servers in the replicaset, the write concern can never be fulfilled and will return an error. An example server response might be:

.. code:: javascript

{ "ok" : 1, "n" : 1, "writeConcernError" : { "code" : 64, "errmsg" : "...." } }

Notice how write concern is just a single error for the whole command. Getting a writeConcernError does not mean the items were not applied, it means the write concern could not be fulfilled. In the example above n is still 1.

It's fair to consider the server response field writeErrors to be hard errors while writeConcernError is a soft error.

Merging Results

Handling errors

Handling the merging of errors is most easily expressed with some examples.

Handling Write Errors


Consider the following following bulk write operation:

.. code:: javascript

  collection.ensureIndex({a:1}, {unique:true})
  var bulk = collection.initializeOrderedBulkOp()
  bulk.insert({a:1})
  bulk.insert({a:2})
  bulk.find({a:2}).upsert().update({$set:{a:1}})
  bulk.insert({a:3})
  bulk.execute()

This operation will only execute the three first operations (the first two inserts and an upsert)
before stopping due to a duplicate key error. The merged result would look something like this:

.. code:: javascript

  {
    "nInserted" : 2,
    "nUpserted" : 0,
    "nMatched" : 0,
    "nModified" : 0,
    "nRemoved" : 1,
    "upserted" : [ ]
    "writeErrors" : [
      {
        "index" : 2,
        "code" : 11000,
        "errmsg" : "DuplicateKey insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.test.$a_1  dup key: { : 1 }"
      }
    ]
    "writeConcernErrors": []
  }

In this situation the client should throw a single error and stop processing.

Handling Write Concern Errors

Write concern is applied after the server side execution of the write operations.

This means that replication failure or other forms of writeConcernErrors should not affect the execution of the batch but simply serve as an indication that the write concern has not been met.

If there is no write concern error the bulk result's "writeConcernErrors" array is empty.

When the bulk operation is implemented using legacy opcodes, no server error code is available. The server's getlasterror response is like:

.. code:: javascript

{ "ok" : 1, "wtimeout": true, "err": "timeout" }

In this case the driver must construct a writeConcernErrors array containing one error document with code 64, and the "err" field from the getlasterror response as the errmsg.

An example with "w: 5" and fewer than 5 replicas:

.. code:: javascript

var bulk = collection.initializeOrderedBulkOp() bulk.insert({a:1}) bulk.execute({w:5, wtimeout:1})

The expected result from these operations are.

.. code:: javascript

{ "nInserted" : 1, "nUpserted" : 0, "nMatched" : 0, "nModified" : 0, "nRemoved" : 0, "upserted" : [ ] "writeErrors" : [ ], "writeConcernErrors": [{ "code": 64, "waiting for replication timed out", }] }

.. note:: The example output is from MongoDB 2.6. In MongoDB 2.4 the driver supplies the error code 64, and the error message is "timeout". Starting in MongoDB 3.0, the writeConcernError code is 100 and the message is "Not enough data-bearing nodes".

General rule of handling errors

  1. A top level error means the whole command failed and should cause a command failure error.
  2. For unordered bulk operations all write Errors should be rewritten and merged together.
  3. For ordered bulk operations the returned write Error should be rewritten and returned.
  4. Write Concern errors should not halt the processing of ordered bulk operations.

Merging write errors

A bulk operation might involve multiple write commands. Each write command could potentially return write errors and/or a write concern error. Each error in the writeErrors array contains an index pointing to the original document position in the write command document that caused it.

Consider the following bulk operation

.. code:: javascript

collection.ensureIndex({a:1}, {unique:true})
var bulk = db.c.initializeOrderedBulkOp()
bulk.insert({a:1})
bulk.insert({a:2})
bulk.find({a:2}).updateOne({$set : { a : 1 }});
bulk.find({a:4}).removeOne();
bulk.execute({w:1})

The operation

.. code:: javascript

bulk.find({a:2}).updateOne({$set : { a : 1 }});

causes an error

.. code:: javascript

{ "ok" : 1, "nModified" : 0, "n" : 0, "writeErrors" : [ { "index" : 0, "code" : 11000, "errmsg" : "DuplicateKey insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.test.$a_1 dup key: { : 1 }" } ] }

In the returned result, the index variable of the error points to document 0 from the update where it failed during execution. However in the original chain of operations the update operation is the third (index 2). To correctly correlate the errors to the user-provided order we need to rewrite the error to point to the correct index so the user can identify what document caused the error. So in this the error aspect of the final result will look like.

.. code:: javascript

{ "ok" : 1, "nModified" : 0, "n" : 2, "writeErrors" : [ { "index" : 2, "code" : 11000, "errmsg" : "DuplicateKey insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.test.$a_1 dup key: { : 1 }" } ] }

Notice the index: 2 correctly pointing to the original document.

To correctly handle the merging the driver needs to keep track of the original indexes and how they map to the errors returned by the write commands. There might be a need to keep an index in memory to be able to correctly handle the mapping.

Write concern errors

Each writeConcernError document received from a server operation (either a write command or legacy write) is appended to the bulk result's "writeConcernErrors" array:

.. code:: javascript

var bulk = db.c.initializeOrderedBulkOp()
bulk.insert({a:1})
bulk.insert({a:2})
bulk.find({a:1}).remove()
bulk.execute({w:5, wtimeout:100})

The outcome on MongoDB 2.6 with fewer than 5 replicas is similar to:

.. code:: javascript

{ "nInserted" : 2, "nUpserted" : 0, "nMatched" : 0, "nModified" : 0, "nRemoved" : 1, "upserted" : [ ] "writeErrors" : [ ], "writeConcernErrors": [{ "code": 64, "waiting for replication timed out", }, { "code": 64, "waiting for replication timed out", }] }

If there is no write concern error the bulk result's "writeConcernErrors" array is empty.

.. note:: Previous versions of this spec were ambiguous about reporting writeConcernErrors. Some clients include a singular field "writeConcernError" in bulk results; the singular form is now deprecated and an array called "writeConcernErrors" is required.

Handling upserts

When performing updates with upsert true the write command might return an upserted field. If it's a single document update command that causes an upsert it will look like.

.. code:: javascript

{
    ok: 1
  , nModified: 0
  , n: 1      
  , upserted: {index:0, _id:ObjectId(".....")}
}

On the other hand if we are upserting a series of documents the upserted field will contain an array of the results. Given an update command that causes 2 upserts the result will look like.

.. code:: javascript

{
    ok: 1
  , nModified: 0
  , n: 2
  , upserted: [
      {index:0, _id:ObjectId(".....")}
    , {index:1, _id:ObjectId(".....")}
  ]
}

As in the case of errors the driver needs to rewrite the indexes for the upserted values to merge the results together into the final result so they reflect the initial order of the updates in the user specified batch.

Reporting Errors

Exceptions

Depending on the language and platform there are different semantics on how to raise errors. For languages that usually raise exceptions it's recommended that an exception be raised when an ordered bulk operation fails.

Given the following scenario

.. code:: javascript

collection.ensureIndex({a:1}, {unique:true}) var bulk = collection.initializeOrderedBulkOp() bulk.insert({a:1}) bulk.insert({a:1}) bulk.execute({w:5, wtimeout:1})

In languages where the rule is to report errors by throwing an exception the duplicate insert should cause an exception to be raised when execute is called.

In the case of an unordered bulk operation the exception should be raised after the bulk has finished executing. It's important to differentiate between a write error and write concern error if exceptions are used to differentiate between the hard error of a write error and the soft error caused by a write concern error.

Callbacks

Callback based languages or platform should return a results object containing the aggregated state of the bulk operations. Some platforms like Node.js supports callbacks with the format function(err, object). In this case the result object should be returned as the err field if it contains any errors, only returning in the object field if no write or write concern errors happened.

Results Object

The shell and Node.js implements the result as a custom object wrapping the results. This is to simplify the access to the internal state of the merged results. It serves mostly as an example as different languages might implement the results differently depending on their chosen error mechanism. F.ex it might make sense to throw an exception if the command fails at the authentication stage versus a duplicate key error on one of the operations in a bulk operation.

It keeps track of several aggregated values

========= ============================================================= field description ========= ============================================================= nInserted Number of inserted documents nUpserted Number of upserted documents nMatched Number of documents matched for update nModified Number of documents actually changed by update nRemoved Number of documents removed ========= =============================================================

nMatched is equivalent to the "n" field in the getLastError response after a legacy update. nModified is quite different from "n". nModified is incremented only when an update operation actually changes a document.

For example, if a document has x: 1 and we update it with {$set: {x: 1}}, nModified is not incremented for that document.

The WriteError's are wrapped in their own wrapper that also contains the operation that caused the error to happen. Similarly the WriteConcernError is a simple wrapper around the result to ensure it's read only.

A client may optionally provide a method to merge writeConcernErrors into one, analogous to how mongos does.

.. code:: javascript

var WRITE_CONCERN_ERROR = 64;

/**

  • Wraps the error */ var WriteError = function(err) { if(!(this instanceof WriteError)) return new WriteError(err);
// Define properties
defineReadOnlyProperty(this, "code", err.code);
defineReadOnlyProperty(this, "index", err.index);
defineReadOnlyProperty(this, "errmsg", err.errmsg);

//
// Define access methods
this.getOperation = function() {
  return err.op;
}

}

/**

  • Wraps a write concern error */ var WriteConcernError = function(err) { if(!(this instanceof WriteConcernError)) return new WriteConcernError(err);
// Define properties
defineReadOnlyProperty(this, "code", err.code);
defineReadOnlyProperty(this, "errmsg", err.errmsg);

}

/**

  • Wraps the result for the commands */ var BulkWriteResult = function(bulkResult) { // Define properties defineReadOnlyProperty(this, "ok", bulkResult.ok); defineReadOnlyProperty(this, "nInserted", bulkResult.nInserted); defineReadOnlyProperty(this, "nUpserted", bulkResult.nUpserted); defineReadOnlyProperty(this, "nMatched", bulkResult.nMatched); defineReadOnlyProperty(this, "nModified", bulkResult.nModified); defineReadOnlyProperty(this, "nRemoved", bulkResult.nRemoved);
//
// Define access methods
this.getUpsertedIds = function() {
  return bulkResult.upserted;
}

this.getUpsertedIdAt = function(index) {
  return bulkResult.upserted[index];
}

this.getRawResponse = function() {
  return bulkResult;
}

this.hasWriteErrors = function() {
  return bulkResult.writeErrors.length > 0;
}

this.getWriteErrorCount = function() {
  return bulkResult.writeErrors.length;
}

this.getWriteErrorAt = function(index) {
  if(index < bulkResult.writeErrors.length) {
    return bulkResult.writeErrors[index];
  }
  return null;
}

this.hasWriteConcernError = function() {
  return bulkResult.writeConcernErrors.length > 0;
}

//
// Determine if we have any errors
this.hasErrors = function() {
  return this.hasWriteErrors() || this.hasWriteConcernError();
}

//
// Get all errors
this.getWriteErrors = function() {
  return bulkResult.writeErrors;
}

this.getWriteConcernError = function() {
  if(bulkResult.writeConcernErrors.length == 0) {
    return null;
  } else if(bulkResult.writeConcernErrors.length == 1) {
    // Return the error
    bulkResult.writeConcernErrors[0];
  } else {

    // Combine the errors
    var errmsg = "";
    for(var i = 0; i < bulkResult.writeConcernErrors.length; i++) {
      var err = bulkResult.writeConcernErrors[i];
      if (i != 0) {
        errmsg = errmsg + " and ";
      }

      errmsg = errmsg + '"' + err.errmsg + '"';
    }

    return new WriteConcernError({ errmsg : errmsg, code : WRITE_CONCERN_ERROR });
  }
}

this.isOK = function() {
  return bulkResult.ok == 1;
}

}

Pre 2.6 Support

The batch API is required to work with pre 2.6. This means detecting in the driver if the server supports the new write commands and downgrading to existing OP_INSERT/OP_UPDATE/OP_REMOVE if it does not.

Legacy servers don't report nModified for updates, and it is impossible for the driver to simulate it: nModified must be equal to the number of documents that are actually different after an update, but legacy servers only report the number of documents matched. The driver must therefore set the result's nModified field to null, or omit the field, when it executes a bulk operation against a legacy server. In static languages where nModified is an integer-type property on bulk results, an exception must be thrown if a user accesses the nModified property after executing a bulk operation on a legacy server.

One important aspect to keep in mind is that the existing bulk insert operation cannot be used as you need to retrieve the getLastError results for each individual operation. Thus the driver must execute inserts one by one.

Another important aspect to keep in mind is that a replication error can be signaled several ways by the getLastError result. The following error codes for the field code are an indicator of a replication error.

========= ============================================================= Code Description ========= ============================================================= 50 Operation exceeded time limit. 13475 Operation exceeded time limit. 16986 Operation exceeded time limit. 16712 Operation exceeded time limit. ========= =============================================================

Thee are also some some errors only detectable by inspecting the errmsg field.

==================== ============================================================= ErrMsg Description ==================== ============================================================= exceeded time limit Operation exceeded time limit. execution terminated Operation exceeded time limit. ==================== =============================================================

If an error does not return a code the driver can set the returned value to 8 (unknown error). A BSON serializing error should be marked with 22 (illegal BSON).

There are some codes that don't match up between the 2.6 and existing servers. The suggestion is to not attempt to rewrite these errors as it will make the code very brittle. Some slight differences in error codes and error messages between the write commands and the legacy operations are acceptable.

nModified

The 2.6 server includes "nModified" in its response to an "update" command. The server increments nModified only when an "update" command has actually changed a document. For example, if a document already has x: 1 and you update it with {$set: {x: 1}}, nModified is not incremented. nModified is impossible to simulate with OP_UPDATE, which returns only "n", the number of matched documents.

Legacy writes: The result of a bulk operation that uses legacy opcodes must set nModified to NULL, or omit the field. If your language is constrained such that you must include the field, then user code should get an exception when accessing the field if you're talking to a legacy server.

Mixed-version sharded cluster: When a client executes an "update" command on a 2.6 mongos, and mongos executes it against some 2.4 mongods, mongos omits nModified from the response, or sets nModified to NULL. (We don't yet know which: SERVER-13001_) If the client does a series of "update" commands within the same bulk operation against the same mongos, some responses could include nModified and some won't, depending on which mongods the mongos sent the operation to. The driver algorithm for merging results, when using write commands, in pseudocode:

.. code:: javascript

full_result = {
    "writeErrors": [],
    "writeConcernErrors": [],
    "nInserted": 0,
    "nUpserted": 0,
    "nMatched": 0,
    "nModified": 0,
    "nRemoved": 0,
    "upserted": [],
}

::

for each server response in all bulk operations' responses:
    if the operation is an update:
        if the response has a non-NULL nModified:
            if full_result has a non-NULL nModified:
                full_result['nModified'] += response['nModified']
        else:
            # If any call does not return nModified we can't report
            # a valid final count so omit the field completely.
            remove nModified from full_result, or set to NULL

.. _SERVER-13001: https://jira.mongodb.org/browse/SERVER-13001

Questions & Answers

Question: I'm writing my own driver should I support legacy downgrading. Answer: Legacy downgrading is explained to help people support pre 2.6 servers but is not mandated for anything but the official drivers.

Answer: Changes where made to GLE in 2.6 that makes the error reporting more consistent. Downgrading will only correctly work against 2.4.X or earlier.

Question: My downgrading code breaks with 2.6

Answer: Changes where made to GLE in 2.6 that makes the error reporting more consistent. Downgrading will only correctly work against 2.4.X or earlier.

Question: Will there be any way for a user to set the number of wire operations a bulk operation will take (for debugging purposes).

Answer: No.

Question: Will there be support for .explain() with the bulk interface?

Answer: Not for 2.6. It may be added with a later release along with server support for mixed operations in a single bulk command

Question: The definition for unordered introduces indeterminism to the operation. For example, what is the state of the collection after:

.. code:: javascript

var bulk = db.c.initializeBulkOp()
bulk.insert({_id : 1, x : 1})
bulk.find({_id : 1}).updateOne({$inc : { x : 1 }});
bulk.find({_id : 1}).removeOne();
bulk.execute({w:1})

You could end up with either {_id : 1, x : 1}, {_id : 1, x : 2}, or no document at all, depending on the order that the operations are performed in.

Answer: This is by design and definition. If the order matters then don't use an unordered bulk operation. No order will be defined or respected in an unordered operation.

Question: What should the driver do when an ordered bulk command is split into multiple write commands and an error happens?

Answer: If it's an ordered bulk command that is split into multiple write commands the driver should not send any remaining write commands after encountering the first error.

Question: What should the driver do when an unordered bulk command is split into multiple write commands and an error happens?

Answer: It's important to note that if the command is an unordered bulk command and it's split into multiple write command it should continue processing all the write commands even if there are errors.

Question: Does the driver need to merge errors from split write commands?

Answer: Yes

Question: Is find() with no argument allowed?

Answer: No, a selector is required for find() in the Bulk API.

Question: Is find({}) with an empty selector allowed?

Answer: Yes, updating or removing all documents using find({}) is allowed.

Question: My unordered bulk operation got split into multiple batches that all reported a write concern error. Should I report all of the write concern errors ?

Answer: Yes, combined into an array called "writeConcernErrors".

Detailed Test Cases

These Test cases in this section serve the purpose of helping you validate the correctness of your Bulk API implementation.

INSERT

Test Case 1: Dynamic languages: raise error if wrong arg type initializeUnorderedBulkOp().insert('foo') throws a reasonable error

    initializeUnorderedBulkOp().insert([{}, {}]) throws a reasonable error: we can’t do a bulk insert with an array
    
    Same for initializeOrderedBulkOp().

Test Case 2: Insert not allowed with find({}): initializeUnorderedBulkOp().find({}).insert({}) is a type error.

    Same for initializeOrderedBulkOp().    

Test Case 3: Removed

Test Case 4: Inserting a document succeeds and returns 'nInserted’ of 1: Empty collection.

    .. code:: javascript

        batch = initializeUnorderedBulkOp()
        batch.insert({_id: 1})
        batch.execute() == {
            "writeErrors" : [ ],
            "writeConcernErrors" : [ ],
            "nInserted" : 1,
            "nUpserted" : 0,
            "nMatched" : 0,
            "nModified" : 0,
            "nRemoved" : 0,
            "upserted" : [ ]
        }

    Collection contains only {_id: 1}.
    
    Same for initializeOrderedBulkOp().

Test Case 5: The driver generates _id client-side for inserted documents: Empty collection.

    .. code:: javascript

        batch = initializeUnorderedBulkOp()
        batch.insert({})
        batch.execute() == {
            "writeErrors" : [ ],
            "writeConcernErrors" : [ ],
            "nInserted" : 1,
            "nUpserted" : 0,
            "nMatched" : 0,
            "nModified" : 0,
            "nRemoved" : 0,
            "upserted" : [ ]
        }

        _id = collection.findOne()._id
        // pid = bytes 7 and 8 (counting from zero) of _id, as big-endian unsigned short
        pid == my PID

    Alternatively, just watch the server log or mongosniff and manually verify the _id was sent to the server.

    Same for initializeOrderedBulkOp().

Test Case 6: Insert doesn’t accept an array of documents: initializeUnorderedBulkOp().insert([{}, {}]) throws

    Same for initializeOrderedBulkOp().  

FIND

Test Case 1: Dynamic languages: find() with no args is prohibited: .. code:: javascript

        batch = initializeUnorderedBulkOp()
        batch.find() raises error immediately

    Same for initializeOrderedBulkOp().

UPDATE

Test Case 1: Dynamic languages: raise error if wrong arg type .. code:: javascript

        initializeUnorderedBulkOp().find({}).update('foo') throws a reasonable error

    Same for initializeOrderedBulkOp().

Test Case 2: Dynamic languages: Update requires find() first: .. code:: javascript

        initializeUnorderedBulkOp().update({$set: {x: 1}}) is a type error

    Same for initializeOrderedBulkOp().

Test Case 3: Key validation, all top-level keys must be $-operators: These throw errors, even without calling execute():

    .. code:: javascript

        initializeUnorderedBulkOp().find({}).update({key: 1})
        initializeUnorderedBulkOp().find({}).update({key: 1, $key: 1})

    Same for initializeOrderedBulkOp().

Test Case 4: update() updates all matching documents, and reports nMatched correctly: Collection has {key: 1}, {key: 2}.

    .. code:: javascript

        batch = initializeUnorderedBulkOp()
        batch.find({}).update({$set: {x: 3}})
        batch.execute() == {
            "writeErrors" : [ ],
            "writeConcernErrors" : [ ],
            "nInserted" : 0,
            "nUpserted" : 0,
            "nMatched" : 2,
            "nModified" : 2,
            "nRemoved" : 0,
            "upserted" : [ ]
        }

    nModified is NULL or omitted if legacy server.
    
    Collection has:
        .. code:: javascript

            {key: 1, x: 3}
            {key: 2, x: 3}

    Same for initializeOrderedBulkOp().

Test Case 5: update() only affects documents that match the preceding find(): Collection has {key: 1}, {key: 2}.

    .. code:: javascript

        batch = initializeUnorderedBulkOp()
        batch.find({key: 1}).update({$set: {x: 1}})
        batch.find({key: 2}).update({$set: {x: 2}})
        batch.execute() == {
            "writeErrors" : [ ],
            "writeConcernErrors" : [ ],
            "nInserted" : 0,
            "nUpserted" : 0,
            "nMatched" : 2,
            "nModified" : 2,
            "nRemoved" : 0,
            "upserted" : [ ]
        }

    nModified is NULL or omitted if legacy server.

    Collection has:
        .. code:: javascript

            {key: 1, x: 1}
            {key: 2, x: 2}

UPDATE_ONE

Test Case 1: Dynamic languages: raise error if wrong arg type initializeUnorderedBulkOp().find({}).updateOne('foo') throws a reasonable error

    Same for initializeOrderedBulkOp().

Test Case 2: Dynamic languages: Update requires find() first: initializeUnorderedBulkOp().updateOne({$set: {x: 1}}) is a type error

    Same for initializeOrderedBulkOp().

Test Case 3: Key validation: These throw errors; all top-level keys must be $-operators:

    .. code:: javascript

        initializeUnorderedBulkOp().find({}).updateOne({key: 1})
        initializeUnorderedBulkOp().find({}).updateOne({key: 1, $key: 1})

    Same for initializeOrderedBulkOp().

Test Case 4: Basic: Collection has {key: 1}, {key: 2}.

    .. code:: javascript

        batch = initializeUnorderedBulkOp()
        batch.find({}).updateOne({}, {$set: {key: 3}})
        batch.execute() == {
            "writeErrors" : [ ],
            "writeConcernErrors" : [ ],
            "nInserted" : 0,
            "nUpserted" : 0,
            "nMatched" : 1,
            "nModified" : 1,
            "nRemoved" : 0,
            "upserted" : [ ]
        }

    nModified is NULL or omitted if legacy server.

    .. code:: javascript

        collection.find({key: 3}).count() == 1.

    Same for initializeOrderedBulkOp().

REPLACE

Test Case 1: Dynamic languages: There is no replace. initializeUnorderedBulkOp().find({}).replace() is a type error

    Same for initializeOrderedBulkOp().

REPLACE_ONE

Test Case 1: Dynamic languages: raise error if wrong arg type initializeUnorderedBulkOp().find({}).replaceOne('foo') throws a reasonable error

    Same for initializeOrderedBulkOp().

Test Case 2: Dynamic languages: replaceOne requires find() first: initializeUnorderedBulkOp().replaceOne({key: 1}) is a type error

    Same for initializeOrderedBulkOp().

Test Case 3: Key validation: These throw errors; no top-level keys can be $-operators:

    .. code:: javascript

        initializeUnorderedBulkOp().find({}).replaceOne({$key: 1})
        initializeUnorderedBulkOp().find({}).replaceOne({$key: 1, key: 1})

    Same for initializeOrderedBulkOp().

Test Case 4: If find() matches multiple documents, replaceOne() replaces exactly one of them: Collection has {key: 1}, {key: 1}.

    .. code:: javascript

        batch = initializeUnorderedBulkOp()
        batch.find({key: 1}).replaceOne({key: 3})
        batch.execute() == {
            "writeErrors" : [ ],
            "writeConcernErrors" : [ ],
            "nInserted" : 0,
            "nUpserted" : 0,
            "nMatched" : 1,
            "nModified" : 1,
            "nRemoved" : 0,
            "upserted" : [ ]
        }

    nModified is NULL or omitted if legacy server.

    .. code:: javascript

        collection.distinct('key') == [1, 3].

    Same for initializeOrderedBulkOp().

UPSERT-UPDATE

Test Case 1: upsert() requires find() first: initializeOrderedBulkOp().upsert() is a type error

    upsert().update() upserts a document, and doesn’t affect non-upsert updates in the same bulk operation. 'nUpserted’ is set:

    Empty collection.

    .. code:: javascript

        batch = initializeUnorderedBulkOp()
        batch.find({key: 1}).update({$set: {x: 1}})  // not an upsert
        batch.find({key: 2}).upsert().update({$set: {x: 2}})
        batch.execute() == {
            "writeErrors" : [ ],
            "writeConcernErrors" : [ ],
            "nInserted" : 0,
            "nUpserted" : 1,
            "nMatched" : 0,
            "nModified" : 0,
            "nRemoved" : 0,
            "upserted" : [{ "index" : 1, "_id" : ObjectId(...)}]
        }

    nModified is NULL or omitted if legacy server.
    
    collection has only {_id: ObjectId(...), key: 2, x: 2}.

    Repeat the whole batch. Now nMatched == 1, nUpserted == 0.

    Same for initializeOrderedBulkOp().

Test Case 2: upsert().update() updates all matching documents: Collection starts with {key: 1}, {key: 1}.

    .. code:: javascript

        batch = initializeUnorderedBulkOp()
        batch.find({key: 1}).upsert().update({$set: {x: 1}})
        batch.execute() == {
            "writeErrors" : [ ],
            "writeConcernErrors" : [ ],
            "nInserted" : 0,
            "nUpserted" : 0,
            "nMatched" : 2,
            "nModified" : 2,
            "nRemoved" : 0,
            "upserted" : [ ]
        }

    nModified is NULL or omitted if legacy server.
    
    collection has only {key: 1, x: 1}, {key: 1, x: 1}.

    Same for initializeOrderedBulkOp().

    We can upsert() a 16 MiB document—the driver can make a command document slightly larger than the max document size.

    Empty collection.
    
    .. code:: javascript

        var bigstring = "string of length 16 MiB - 30 bytes"
        batch = initializeUnorderedBulkOp()
        batch.find({key: 1}).upsert().update({$set: {x: bigstring}})
        batch.execute() succeeds.

    Same for initializeOrderedBulkOp().

UPSERT-UPDATE_ONE

Test Case 1: upsert().updateOne() upserts a document, and doesn’t affect non-upsert updateOnes in the same bulk operation. 'nUpserted’ is set: Empty collection.

    .. code:: javascript

        batch = initializeUnorderedBulkOp()
        batch.find({key: 1}).updateOne({$set: {x: 1}})  // not an upsert
        batch.find({key: 2}).upsert().updateOne({$set: {x: 2}})
        batch.execute() == {
            "writeErrors" : [ ],
            "writeConcernErrors" : [ ],
            "nInserted" : 0,
            "nUpserted" : 1,
            "nMatched" : 0,
            "nModified" : 0,
            "nRemoved" : 0,
            "upserted" : [{ "index" : 1, "_id" : ObjectId(...)} ]
        }

    nModified is NULL or omitted if legacy server.

    collection contains only {key: 2, x: 2}.

    Same for initializeOrderedBulkOp().

Test Case 2: upsert().updateOne() only updates one matching document: Collection starts with {key: 1}, {key: 1}.

    .. code:: javascript

        batch = initializeUnorderedBulkOp()
        batch.find({key: 1}).upsert().updateOne({$set: {x: 1}})
        batch.execute() == {
            "writeErrors" : [ ],
            "writeConcernErrors" : [ ],
            "nInserted" : 0,
            "nUpserted" : 0,
            "nMatched" : 1,
            "nModified" : 1,
            "nRemoved" : 0,
            "upserted" : [ ]
        }

    nModified is NULL or omitted if legacy server.

    collection has only {key: 1, x: 1}, {key: 1}.

UPSERT-REPLACE_ONE

Test Case 1: upsert().replaceOne() upserts a document, and doesn’t affect non-upsert replaceOnes in the same bulk operation. 'nUpserted’ is set: Empty collection.

    .. code:: javascript

        batch = initializeUnorderedBulkOp()
        batch.find({key: 1}).replaceOne({x: 1})  // not an upsert
        batch.find({key: 2}).upsert().replaceOne({x: 2})

        batch.execute() == {
            "writeErrors" : [ ],
            "writeConcernErrors" : [ ],
            "nInserted" : 0,
            "nUpserted" : 1,
            "nMatched" : 0,
            "nModified" : 0,
            "nRemoved" : 0,
            "upserted" : [{ "index" : 1, "_id" : ObjectId(...)}  ]
        }

    nModified is NULL or omitted if legacy server.
    
    collection contains {x: 2}.

    Same for initializeOrderedBulkOp().

Test Case 2: upsert().replaceOne() only replaces one matching document: Collection starts with {key: 1}, {key: 1}.

    .. code:: javascript

      batch = initializeUnorderedBulkOp()
      batch.find({key: 1}).upsert().replaceOne({x: 1})
      batch.execute() == {
          "writeErrors" : [ ],
          "writeConcernErrors" : [ ],
          "nInserted" : 0,
          "nUpserted" : 0,
          "nMatched" : 1,
          "nModified" : 1,
          "nRemoved" : 0,
          "upserted" : []
      }

    nModified is NULL or omitted if legacy server.

    collection has only {x: 1}, {key: 1}.

REMOVE

Test Case 1: remove() requires find() first: initializeUnorderedBulkOp().remove() is a type error

    Same for initializeOrderedBulkOp().

Test Case 1: Remove() with empty selector removes all documents: Collection starts with {key: 1}, {key: 1}.

    .. code:: javascript

        batch = initializeUnorderedBulkOp()
        batch.find({}).remove()
        batch.execute() == {
            "writeErrors" : [ ],
            "writeConcernErrors" : [ ],
            "nInserted" : 0,
            "nUpserted" : 0,
            "nMatched" : 0,
            "nModified" : 0,
            "nRemoved" : 2,
            "upserted" : [ ]
        }

    nModified is NULL or omitted if legacy server.

    Collection is now empty.

    Same for initializeOrderedBulkOp().

Test Case 2: Remove() with empty selector removes only matching documents: Collection starts with {key: 1}, {key: 2}.

    .. code:: javascript

        batch = initializeUnorderedBulkOp()
        batch.find({key: 1}).remove()
        batch.execute() == {
            "writeErrors" : [ ],
            "writeConcernErrors" : [ ],
            "nInserted" : 0,
            "nUpserted" : 0,
            "nMatched" : 0,
            "nModified" : 0,
            "nRemoved" : 1,
            "upserted" : [ ]
        }

    nModified is NULL or omitted if legacy server.

    Collection contains only {key: 2}.

    Same for initializeOrderedBulkOp().

REMOVE_ONE

Test Case 1: removeOne() requires find() first: initializeUnorderedBulkOp().removeOne() is a type error

    Same for initializeOrderedBulkOp().

Test Case 2: If several documents match find(), removeOne() removes one: Collection has {key: 1}, {key: 1}.

    .. code:: javascript

        batch = initializeUnorderedBulkOp()
        batch.find({}).removeOne()
        batch.execute() == {
            "writeErrors" : [ ],
            "writeConcernErrors" : [ ],
            "nInserted" : 0,
            "nUpserted" : 0,
            "nMatched" : 0,
            "nModified" : 0,
            "nRemoved" : 1,
            "upserted" : [ ]
        }

    nModified is NULL or omitted if legacy server.

    collection.count() == 1.

    Same for initializeOrderedBulkOp().

MIXED OPERATIONS, UNORDERED

nMatched, nModified, nUpserted, nInserted, nRemoved are properly counted with an unordered bulk operation. The list of upserted documents is returned, with upserts’ indexes correctly rewritten. Collection contains {a: 1}, {a: 2}.

.. code:: javascript

batch = initializeUnorderedBulkOp()
batch.find({a: 1}).update({$set: {b: 1}})
batch.find({a: 2}).remove()
batch.insert({a: 3})
batch.find({a: 4}).upsert().updateOne({$set: {b: 4}})
result = batch.execute()

result['nMatched'] == 1
result['nModified'] == 1 // (nModified is NULL or omitted if legacy server.)
result['nUpserted'] == 1
result['nInserted'] == 1
result['nRemoved'] == 1

result['upserted'].length == 1 and result['upserted'][0]['index'] == 3.
result['upserted'][0]['_id'] is an ObjectId.

MIXED OPERATIONS, ORDERED

nMatched, nModified, nUpserted, nInserted, nRemoved are properly counted with an unordered bulk operation. The list of upserted documents is returned, with upserts’ indexes correctly rewritten.

Empty collection.

.. code:: javascript

batch = initializeOrderedBulkOp()
batch.insert({a: 1})
batch.find({a: 1}).updateOne({$set: {b: 1}})
batch.find({a: 2}).upsert().updateOne({$set: {b: 2}})
batch.insert({a: 3})
batch.find({a: 3}).remove()
result = batch.execute()

result['nInserted'] == 2
result['nUpserted'] == 1
result['nMatched'] == 1
result['nModified'] == 1 (nModified is NULL or omitted if legacy server.)
result['nRemoved'] == 1

result['upserted'].length == 1 and result['upserted'][0]['index'] == 2.
result['upserted'][0]['_id'] is an ObjectId.

MIXED OPERATIONS, AUTH

Verify that auth failures are handled gracefully, especially in conjunction with other errors, such as write concern or normal write errors.

Example: Using user defined roles (UDR) create a user who can do insert but not remove and run an ordered batch performing both of these operations.

An ordered batch is expected to stop executing when the error is encountered, then raise the appropriate authentication error. If there have been write concern errors they may be lost. The behavior of an unordered batch is unspecified in the face of auth failure.

UNORDERED BATCH WITH ERRORS

nMatched, nModified, nUpserted, nInserted, nRemoved are properly counted with an unordered bulk operation that includes a write error. The list of upserted documents is returned, with upserts’ indexes correctly rewritten.

Empty collection, unique index on 'a’.

.. code:: javascript

batch = initializeUnorderedBulkOp()
batch.insert({b: 1, a: 1})
// one or two of these upserts fails:
batch.find({b: 2}).upsert().updateOne({$set: {a: 1}})
batch.find({b: 3}).upsert().updateOne({$set: {a: 2}})
batch.find({b: 2}).upsert().updateOne({$set: {a: 1}})
batch.insert({b: 4, a: 3})
// this and / or the first insert fails:
batch.insert({b: 5, a: 1})

batch.execute() should raise an error with some details:
error_details['nInserted'] == 2
error_details['nUpserted'] == 1
nMatched, nModified, nRemoved are 0.
(nModified is NULL or omitted if legacy server.)

error_details['upserted'].length == 1
error_details['upserted'][0]['index'] == 2
error_details['upserted'][0]['_id'] is an ObjectId
error_details['writeErrors'].length == 3
collection.distinct('a') == [1, 2, 3]

ORDERED BATCH WITH ERRORS

nMatched, nModified, nUpserted, nInserted, nRemoved are properly counted with an ordered bulk operation that includes a write error. The list of upserted documents is returned, with upserts’ indexes correctly rewritten.

Empty collection, unique index on 'a’.

.. code:: javascript

batch = initializeOrderedBulkOp()
batch.insert({b: 1, a: 1})
batch.find({b: 2}).upsert().updateOne({$set: {a: 1}})
batch.find({b: 3}).upsert().updateOne({$set: {a: 2}})
batch.find({b: 2}).upsert().updateOne({$set: {a: 1}})  // will fail
batch.insert({b: 4, a: 3})
batch.insert({b: 5, a: 1})

batch.execute() should raise an error with some details:
nUpserted, nMatched, nModified, nRemoved are 0
error_details['nInserted'] == 1
error_details['writeErrors'].length == 1
error = error_details['writeErrors'][0]
error['code'] == 11000
error['errmsg'] is a string.
error['index'] == 1
error['op'] == {q: {b: 2}, u: {$set: {a: 1}}, upsert: true}
collection.count() == 1  // subsequent inserts weren’t attempted

BATCH SPLITTING: maxBsonObjectSize

More than 16 MiB worth of inserts are split into multiple messages, and error indexes are rewritten. An unordered batch continues on error and returns the error after all messages are sent.

Empty collection.

.. code:: javascript

// Verify that the driver splits inserts into 16-MiB messages:
batch = initializeOrderedBulkOp()
for (i = 0; i < 6; i++) {
batch.insert({_id: i, a: '4 MiB STRING'});
}

batch.insert({_id: 0})  // will fail
batch.insert({_id: 100})

batch.execute() fails with error details

error_details['nInserted'] == 6
error_details['writeErrors'].length == 1
error = error_details['writeErrors'][0]
error['code'] == 11000  // duplicate key
error['errmsg'] is a string.
error['index'] == 6  // properly rewritten error index

collection.count() == 6

Same for initializeUnorderedBulkOp, except:

.. code:: javascript

error_details['nInserted'] == 7
collection.count() == 7

BATCH SPLITTING: maxWriteBatchSize

More than 1000 documents to be inserted, updated, or removed are split into multiple messages, and error indexes are rewritten. An unordered batch continues on error and returns the error after all messages are sent. Similar test to the maxBsonObjectSize test. Note the server doesn’t yet enforce the maxWriteBatchSize limit, so incorrect code will appear to succeed.

RE-RUNNING A BATCH

A batch can only be executed once.

.. code:: javascript

batch = initializeOrderedBulkOp()
batch.insert({})
batch.execute()
batch.execute() a second time raises reasonable error.

Same for initializeUnorderedBulkOp().

EMPTY BATCH

execute() throws if the batch is empty.

.. code:: javascript

batch = initializeOrderedBulkOp()
batch.execute() with no operations raises a reasonable error.

Same for initializeUnorderedBulkOp().

NO JOURNAL

Attempting the 'j’ write concern with a write command on mongod 2.6 is an error if mongod is started with --nojournal. This applies to bulk operations with mongod 2.4 as well, even though it returns {ok: 1, jnote: "journaling not enabled on this server"}; the driver must detect this message and turn it into an error.

mongod started with --nojournal.

.. code:: javascript

batch = initializeOrderedBulkOp()
batch.insert({})
batch.execute({j: 1}) raises error.

Same for initializeUnorderedBulkOp().

W > 1 AGAINST STANDALONE

On both 2.4 and 2.6, attempting write concern w > 1 against a non-replica-set mongod is an error.

Standalone mongod.

.. code:: javascript

batch = initializeOrderedBulkOp()
batch.insert({})
batch.execute({w: 2}) raises error.

Same for initializeUnorderedBulkOp().

WTIMEOUT PLUS DUPLICATE KEY ERROR

A single unordered batch can report both writeErrors and writeConcernErrors.

2-node replica set.

Empty collection.

.. code:: javascript

batch = initializeUnorderedBulkOp()   
batch.insert({_id: 1})
batch.insert({_id: 1})
batch.execute({w: 3, wtimeout: 1}) raises error with details.
error_details['nInserted'] == 1
error_details['writeErrors'].length == 1
error_details['writeErrors'][0]['index'] == 1
// code 64, "timed out" in 2.4
// code 64, "waiting for replication timed out" in 2.6
// code 100, "Not enough data-bearing nodes" in 3.0
error_details['writeConcernErrors'][0]['code'] either 64 or 100
error_details['writeConcernErrors'][0]['errmsg'] not empty

WTIMEOUT WITH MULTIPLE OPERATIONS

Multiple write concern errors are all reported.

2-node replica set.

Empty collection.

.. code:: javascript

batch = initializeOrderedBulkOp()
batch.insert({_id: 1})
batch.find({}).remove()
batch.execute({w: 3, wtimeout: 1}) raises error with details.
error_details['nInserted'] == 1
error_details['nRemoved'] == 1
error_details['writeErrors'].length == 0
// code 64, "timed out" in 2.4
// code 64, "waiting for replication timed out" in 2.6
// code 100, "Not enough data-bearing nodes" in 3.0

wc_errors = error_details['writeConcernErrors']
wc_errors.length == 2
for (i = 0; i < 2; i++) {
  wc_errors[i]['code'] either 64 or 100
  wc_errors[i]['errmsg'] is not empty
}

W = 0

A batch with w: 0 doesn’t report write errors.

Empty collection.

.. code:: javascript

batch = initializeOrderedBulkOp()
batch.insert({_id: 1})
batch.insert({_id: 1})
batch.execute({w: 0}) raises no error.
collection.count() == 1.

Same for initializeUnorderedBulkOp(), except collection.count() == 2.

FAILOVER WITH MIXED VERSIONS

The driver detects when the primary’s max wire protocol version increases or decreases, and the driver correctly switches between using write commands and using legacy write operations.

2-node replica set. One node runs 2.4, the other runs 2.6.

.. code:: javascript

client = MongoReplicaSetClient()
// Switch primary:
client.admin.command({replSetStepDown: 5})  // can’t be primary for 5 seconds

batch = client.db.collection.initializeOrderedBulkOp()
batch.insert({_id: 1})
batch.execute() should succeed

sleep 6 seconds

.. code:: javascript

// Switch back to original primary
client.admin.command({replSetStepDown: 5})
batch = client.db.collection.initializeOrderedBulkOp()
batch.insert({_id: 2}).execute() should succeed

Changelog

:2023-10-17: Updated maxWriteBatchSize default to 100,000 from 1000 :2022-10-05: Remove spec front matter and reformat changelog. Consolidated changelog entries prior to the first published version of this document, since exact dates were unavailable. :2021-05-27: Removed "Test Case 3: Key validation, no $-prefixed keys allowed" for insert. :2015-10-23: Clarify that "writeConcernErrors" field is plural :2015-05-22: * First public version of the specification. * Deprecated this specification in favor of CRUD API. * Merged in Test Cases from QA tickets. * Specification cleanup and increased precision. * Suggested error handling for languages using commonly raising exceptions. * Narrowed writeConcern reporting requirement. * Renamed nUpdated to nMatched as to reflect that it's the number of matched documents not the number of modified documents.

Bulk Write

  • Status: Accepted
  • Minimum Server Version: 8.0

Abstract

This specification defines the driver API for the bulkWrite server command introduced in MongoDB 8.0. The API defined in this specification allows users to perform insert, update, and delete operations against mixed namespaces in a minimized number of round trips, and to receive detailed results for each operation performed. This API is distinct from the collection-level bulkWrite method defined in the CRUD specification and the deprecated bulk write specification.

Specification

[!NOTE] The BulkWriteOptions, BulkWriteResult, and BulkWriteException types defined in this specification are similar to those used for the MongoCollection.bulkWrite method. Statically typed drivers MUST NOT reuse their existing definitions for these types for the MongoClient.bulkWrite API and MUST introduce new types. If naming conflicts arise, drivers SHOULD prepend "Client" to the new type names (e.g. ClientBulkWriteOptions).

MongoClient.bulkWrite Interface

interface MongoClient {
    /**
     * Executes a list of mixed write operations.
     *
     * @throws BulkWriteException
     */
    bulkWrite(models: NamespaceWriteModelPair[], options: Optional<BulkWriteOptions>): BulkWriteResult;
}

Write Models

A WriteModel defines a single write operation to be performed as part of a bulk write.

/**
 * Unifying interface for the various write model types. Drivers may also use an enum with
 * variants for each write model for this type.
 */
interface WriteModel {}

class InsertOneModel implements WriteModel {
    /**
     * The document to insert.
     */
    document: Document;
}

class UpdateOneModel implements WriteModel {
    /**
     * The filter to apply.
     */
    filter: Document;

    /**
     * The update document or pipeline to apply to the selected document.
     */
    update: (Document | Document[]);

    /**
     * A set of filters specifying to which array elements an update should apply.
     *
     * This option is sent only if the caller explicitly provides a value.
     */
    arrayFilters: Optional<Document[]>;

    /**
     * Specifies a collation.
     *
     * This option is sent only if the caller explicitly provides a value.
     */
    collation: Optional<Document>;

    /**
     * The index to use. Specify either the index name as a string or the index key pattern. If
     * specified, then the query system will only consider plans using the hinted index.
     *
     * This option is only sent if the caller explicitly provides a value.
     */
    hint: Optional<Document | String>;

    /**
     * When true, creates a new document if no document matches the query.
     *
     * This option is only sent if the caller explicitly provides a value. The server's default
     * value is false.
     */
    upsert: Optional<Boolean>;
}

class UpdateManyModel implements WriteModel {
    /**
     * The filter to apply.
     */
    filter: Document;

    /**
     * The update document or pipeline to apply to the selected documents.
     */
    update: (Document | Document[]);

    /**
     * A set of filters specifying to which array elements an update should apply.
     *
     * This option is sent only if the caller explicitly provides a value.
     */
    arrayFilters: Optional<Document[]>;

    /**
     * Specifies a collation.
     *
     * This option is sent only if the caller explicitly provides a value.
     */
    collation: Optional<Document>;

    /**
     * The index to use. Specify either the index name as a string or the index key pattern. If
     * specified, then the query system will only consider plans using the hinted index.
     *
     * This option is only sent if the caller explicitly provides a value.
     */
    hint: Optional<Document | String>;

    /**
     * When true, creates a new document if no document matches the query.
     *
     * This option is only sent if the caller explicitly provides a value. The server's default
     * value is false.
     */
    upsert: Optional<Boolean>;
}

class ReplaceOneModel implements WriteModel {
    /**
     * The filter to apply.
     */
    filter: Document;

    /**
     * The replacement document.
     */
    replacement: Document;

    /**
     * Specifies a collation.
     *
     * This option is sent only if the caller explicitly provides a value.
     */
    collation: Optional<Document>;

    /**
     * The index to use. Specify either the index name as a string or the index key pattern. If
     * specified, then the query system will only consider plans using the hinted index.
     *
     * This option is only sent if the caller explicitly provides a value.
     */
    hint: Optional<Document | String>;

    /**
     * When true, creates a new document if no document matches the query.
     *
     * This option is only sent if the caller explicitly provides a value. The server's default
     * value is false.
     */
    upsert: Optional<Boolean>;
}

class DeleteOneModel implements WriteModel {
    /**
     * The filter to apply.
     */
    filter: Document;

    /**
     * Specifies a collation.
     *
     * This option is sent only if the caller explicitly provides a value.
     */
    collation: Optional<Document>;

    /**
     * The index to use. Specify either the index name as a string or the index key pattern. If
     * specified, then the query system will only consider plans using the hinted index.
     *
     * This option is only sent if the caller explicitly provides a value.
     */
    hint: Optional<Document | String>;
}

class DeleteManyModel implements WriteModel {
    /**
     * The filter to apply.
     */
    filter: Document;

    /**
     * Specifies a collation.
     *
     * This option is sent only if the caller explicitly provides a value.
     */
    collation: Optional<Document>;

    /**
     * The index to use. Specify either the index name as a string or the index key pattern. If
     * specified, then the query system will only consider plans using the hinted index.
     *
     * This option is only sent if the caller explicitly provides a value.
     */
    hint: Optional<Document | String>;
}

Each write model provided to MongoClient.bulkWrite in the models parameter MUST have a corresponding namespace that defines the collection on which the operation should be performed. Drivers SHOULD design this pairing in whichever way is most idiomatic for its language. For example, drivers may:

  • Include a required namespace field on each WriteModel variant and accept a list of WriteModel objects for the models parameter.
  • Accept a list of (Namespace, WriteModel) tuples for models.
  • Define the following pair class:
class NamespaceWriteModelPair {
    /**
     * The namespace on which to perform the write.
     */
    namespace: Namespace;

    /**
     * The write to perform.
     */
    model: WriteModel;
}

Drivers MUST throw an exception if the list provided for models is empty.

Update vs. replace document validation

Update documents provided in UpdateOne and UpdateMany write models are required only to contain atomic modifiers (i.e. keys that start with "$"). Drivers MUST throw an error if an update document is empty or if the document's first key does not start with "$". Drivers MUST rely on the server to return an error if any other entries in the update document are not atomic modifiers. Drivers are not required to perform validation on update pipelines.

Replacement documents provided in ReplaceOne write models are required not to contain atomic modifiers. Drivers MUST throw an error if a replacement document is nonempty and its first key starts with "$". Drivers MUST rely on the server to return an error if any other entries in the replacement document are atomic modifiers.

Options

class BulkWriteOptions {
    /**
     * Whether the operations in this bulk write should be executed in the order in which they were
     * specified. If false, writes will continue to be executed if an individual write fails. If
     * true, writes will stop executing if an individual write fails.
     *
     * Defaults to true.
     */
    ordered: Optional<Boolean>;

    /**
     * If true, allows the writes to opt out of document-level validation.
     *
     * This option is only sent if the caller explicitly provides a value. The server's default
     * value is false.
     */
    bypassDocumentValidation: Optional<Boolean>;

    /**
     * A map of parameter names and values to apply to all operations within the bulk write. Value
     * must be constant or closed expressions that do not reference document fields. Parameters can
     * then be accessed as variables in an aggregate expression context (e.g. "$$var").
     *
     * This option is only sent if the caller explicitly provides a value.
     */
    let: Optional<Document>;

    /**
     * The write concern to use for this bulk write.
     */
    writeConcern: Optional<WriteConcern>;

    /**
     * Enables users to specify an arbitrary comment to help trace the operation through
     * the database profiler, currentOp and logs.
     *
     * This option is only sent if the caller explicitly provides a value.
     */
    comment: Optional<BSON value>;

    /**
     * Whether detailed results for each successful operation should be included in the returned
     * BulkWriteResult.
     *
     * Defaults to false. This value corresponds inversely to the errorsOnly field in the bulkWrite
     * command.
     */
    verboseResults: Optional<Boolean>;
}

Result

class BulkWriteResult {
    /**
     * Indicates whether this write result was acknowledged. If not, then all other members of this
     * result will be undefined.
     *
     * NOT REQUIRED TO IMPLEMENT. See below for more guidance on modeling unacknowledged results.
     */
    acknowledged: Boolean;

    /**
     * Indicates whether the results are verbose. If false, the insertResults, updateResults, and
     * deleteResults fields in this result will be undefined.
     *
     * NOT REQUIRED TO IMPLEMENT. See below for other ways to differentiate summary results from
     * verbose results.
     */
    hasVerboseResults: Boolean;

    /**
     * The total number of documents inserted across all insert operations.
     */
    insertedCount: Int64;

    /**
     * The total number of documents upserted across all update operations.
     */
    upsertedCount: Int64;

    /**
     * The total number of documents matched across all update operations.
     */
    matchedCount: Int64;

    /**
     * The total number of documents modified across all update operations.
     */
    modifiedCount: Int64;

    /**
     * The total number of documents deleted across all delete operations.
     */
    deletedCount: Int64;

    /**
     * The results of each individual insert operation that was successfully performed.
     */
    insertResults: Map<Int64, InsertOneResult>;

    /**
     * The results of each individual update operation that was successfully performed.
     */
    updateResults: Map<Int64, UpdateResult>;

    /**
     * The results of each individual delete operation that was successfully performed.
     */
    deleteResults: Map<Int64, DeleteResult>;
}

class InsertOneResult {
    /**
     * The _id of the inserted document.
     */
    insertedId: Any;
}

class UpdateResult {
    /**
     * The number of documents that matched the filter.
     */
    matchedCount: Int64;

    /**
     * The number of documents that were modified.
     */
    modifiedCount: Int64;

    /**
     * The _id field of the upserted document if an upsert occurred.
     *
     * It MUST be possible to discern between a BSON Null upserted ID value and this field being
     * unset. If necessary, drivers MAY add a didUpsert boolean field to differentiate between
     * these two cases.
     */
    upsertedId: Optional<BSON value>;
}

class DeleteResult {
    /**
     * The number of documents that were deleted.
     */
    deletedCount: Int64;
}

Unacknowledged results

BulkWriteResult has an optional acknowledged field to indicate whether the result was acknowledged. This is not required to implement. Drivers should follow the guidance in the CRUD specification here to determine how to model unacknowledged results.

Summary vs. verbose results

Users MUST be able to discern whether a BulkWriteResult contains summary or verbose results without inspecting the value provided for verboseResults in BulkWriteOptions. Drivers MUST implement this in one of the following ways:

  • Expose the hasVerboseResults field in BulkWriteResult as defined above. Document that insertResults, updateResults, and deleteResults will be undefined when hasVerboseResults is false. Raise an error if a user tries to access one of these fields when hasVerboseResults is false.
  • Implement the insertResults, updateResults, and deleteResults fields as optional types and document that they will be unset when verboseResults is false.
  • Introduce separate SummaryBulkWriteResult and VerboseBulkWriteResult types. VerboseBulkWriteResult MUST have all of the required fields defined on BulkWriteResult above. SummaryBulkWriteResult MUST have all of the required fields defined on BulkWriteResult above except insertResults, updateResults, and deleteResults.

Individual results

The InsertOneResult, UpdateResult, and DeleteResult classes are the same as or similar to types of the same name defined in the CRUD specification. Drivers MUST redefine these classes if their existing result classes deviate from the definitions in this specification (e.g. if they contain acknowledgement information, which is not applicable for individual bulk write operations). Drivers MAY reuse their existing types for these classes if they match the ones defined here exactly.

Exception

class BulkWriteException {
    /**
     * A top-level error that occurred when attempting to communicate with the server or execute
     * the bulk write. This value may not be populated if the exception was thrown due to errors
     * occurring on individual writes.
     */
    error: Optional<Error>;

    /**
     * Write concern errors that occurred while executing the bulk write. This list may have
     * multiple items if more than one server command was required to execute the bulk write.
     */
    writeConcernErrors: WriteConcernError[];

    /**
     * Errors that occurred during the execution of individual write operations. This map will
     * contain at most one entry if the bulk write was ordered.
     */
    writeErrors: Map<Int64, WriteError>;

    /**
     * The results of any successful operations that were performed before the error was
     * encountered.
     */
    partialResult: Optional<BulkWriteResult>;
}

Building a bulkWrite Command

The bulkWrite server command has the following format:

{
    "bulkWrite": 1,
    "ops": <Array>,
    "nsInfo": <Array>,
    "errorsOnly": Optional<Boolean>,
    "ordered": Optional<Boolean>,
    "bypassDocumentValidation": Optional<Boolean>,
    "comment": Optional<BSON value>,
    "let": Optional<Document>,
    ...additional operation-agnostic fields
}

Drivers MUST use document sequences (OP_MSG payload type 1) for the ops and nsInfo fields.

The bulkWrite command is executed on the "admin" database.

Operations

The ops field is a list of write operation documents. The first entry in each document has the name of the operation (i.e. "insert", "update", or "delete") as its key and the index in the nsInfo array of the namespace on which the operation should be performed as its value. The documents have the following format:

Insert

{
    "insert": <Int32>,
    "document": <Document>
}

If the document to be inserted does not contain an _id field, drivers MUST generate a new ObjectId and add it as the _id field at the beginning of the document.

Update

{
    "update": <Int32>,
    "filter": <Document>,
    "updateMods": <Document | Array>,
    "multi": Optional<Boolean>,
    "upsert": Optional<Boolean>,
    "arrayFilters": Optional<Array>,
    "hint": Optional<Document | String>
}

Delete

{
    "delete": <Int32>,
    "filter": <Document>,
    "multi": Optional<Boolean>,
    "hint": Optional<Document | String>,
    "collation": Optional<Document>
}

Namespace Information

The nsInfo field is an array containing the namespaces on which the write operations should be performed. Drivers MUST NOT include duplicate namespaces in this list. The documents in the nsInfo array have the following format:

{
    "ns": <String>
}

errorsOnly and verboseResults

The errorsOnly field indicates whether the results cursor returned in the bulkWrite response should contain only errors and omit individual results. If false, both individual results for successful operations and errors will be returned. This field is optional and defaults to false on the server.

errorsOnly corresponds inversely to the verboseResults option defined on BulkWriteOptions. If the user specified a value for verboseResults, drivers MUST define errorsOnly as the opposite of verboseResults. If the user did not specify a value for verboseResults, drivers MUST define errorsOnly as true.

ordered

The ordered field defines whether writes should be executed in the order in which they were specified, and, if an error occurs, whether the server should halt execution of further writes. It is optional and defaults to true on the server. Drivers MUST explicitly define ordered as true in the bulkWrite command if a value is not specified in BulkWriteOptions. This is required to avoid inconsistencies between server and driver behavior if the server default changes in the future.

Size Limits

The server reports a maxBsonObjectSize in its hello response. This value defines the maximum size for documents that are inserted into the database. Documents that are sent to the server but are not intended to be inserted into the database (e.g. command documents) have a size limit of maxBsonObjectSize + 16KiB. When an acknowledged write concern is used, drivers MUST NOT perform any checks related to these size limits and MUST rely on the server to raise an error if a limit is exceeded. However, when an unacknowledged write concern is used, drivers MUST raise an error if one of the following limits is exceeded:

  • The size of a document to be inserted MUST NOT exceed maxBsonObjectSize. This applies to the document field of an InsertOneModel and the replacement field of a ReplaceOneModel.
  • The size of an entry in the ops array MUST NOT exceed maxBsonObjectSize + 16KiB.
  • The size of the bulkWrite command document MUST NOT exceed maxBsonObjectSize + 16KiB.

See SERVER-10643 for more details on these size limits.

Auto-Encryption

If MongoClient.bulkWrite is called on a MongoClient configured with AutoEncryptionOpts, drivers MUST return an error with the message: "bulkWrite does not currently support automatic encryption".

This is expected to be removed once DRIVERS-2888 is implemented.

Command Batching

Drivers MUST accept an arbitrary number of operations as input to the MongoClient.bulkWrite method. Because the server imposes restrictions on the size of write operations, this means that a single call to MongoClient.bulkWrite may require multiple bulkWrite commands to be sent to the server. Drivers MUST split bulk writes into separate commands when the user's list of operations exceeds one or more of these maximums: maxWriteBatchSize, maxBsonObjectSize (for OP_MSG payload type 0), and maxMessageSizeBytes (for OP_MSG payload type 1). Each of these values can be retrieved from the selected server's hello command response. Drivers MUST merge results from multiple batches into a single BulkWriteResult or BulkWriteException to return from MongoClient.bulkWrite.

When constructing the nsInfo array for a bulkWrite batch, drivers MUST only include the namespaces that are referenced in the ops array for that batch.

Number of Writes

maxWriteBatchSize defines the total number of writes allowed in one command. Drivers MUST split a bulk write into multiple commands if the user provides more than maxWriteBatchSize operations in the argument for models.

Total Message Size

Drivers MUST ensure that the total size of the OP_MSG built for each bulkWrite command does not exceed maxMessageSizeBytes.

The upper bound for the size of an OP_MSG includes opcode-related bytes (e.g. the OP_MSG header) and operation-agnostic command field bytes (e.g. txnNumber, lsid). Drivers MUST limit the combined size of the bulkWrite command document (excluding command-agnostic fields), ops document sequence, and nsInfo document sequence to maxMessageSizeBytes - 1000 to account for this overhead. The following pseudocode demonstrates how to apply this limit in batch-splitting logic:

MESSAGE_OVERHEAD_BYTES = 1000

bulkWriteCommand = Document { "bulkWrite": 1 }
bulkWriteCommand.appendOptions(bulkWriteOptions)

maxOpsNsInfoBytes = maxMessageSizeBytes - (MESSAGE_OVERHEAD_BYTES + bulkWriteCommand.numBytes())

while (writeModels.hasNext()) {
    ops = DocumentSequence {}
    nsInfo = DocumentSequence {}
    while (true) {
        if (!writeModels.hasNext()) {
            break
        }
        model = writeModels.next()

        modelDoc = writeModel.toOpsDoc()
        bytesAdded = modelDoc.numBytes()

        nsInfoDoc = null
        if (!nsInfo.contains(model.namespace)) {
            nsInfoDoc = model.namespace.toNsInfoDoc()
            bytesAdded += nsInfoDoc.numBytes()
        }

        newSize = ops.numBytes() + nsInfo.numBytes() + bytesAdded
        if (newSize > maxOpsNsInfoBytes) {
            break
        } else {
            ops.push(modelDoc)
            if (nsInfoDoc != null) {
                nsInfo.push(nsInfoDoc)
            }
        }
    }

    // construct and send OP_MSG
}

See this Q&A entry for more details on how the overhead allowance was determined.

Drivers MUST return an error if there is not room to add at least one operation to ops.

Handling the bulkWrite Server Response

The server's response to bulkWrite has the following format:

{
    "ok": <0 | 1>,
    "cursor": {
        "id": <Int64>,
        "firstBatch": <Array>,
        "ns": <String>
    },
    "nErrors": <Int32>,
    "nInserted": <Int32>,
    "nUpserted": <Int32>,
    "nMatched": <Int32>,
    "nModified": <Int32>,
    "nDeleted": <Int32>,
    ...additional command-agnostic fields
}

If any operations were successful (i.e. nErrors is less than the number of operations that were sent), drivers MUST record the summary count fields in a BulkWriteResult to be returned to the user or embedded in a BulkWriteException. Drivers MUST NOT populate the partialResult field in BulkWriteException if no operations were successful.

Drivers MUST attempt to consume the contents of the cursor returned in the server's bulkWrite response before returning to the user. This is required regardless of whether the user requested verbose or summary results, as the results cursor always contains any write errors that occurred. If the cursor contains a nonzero cursor ID, drivers MUST perform getMore until the cursor has been exhausted. Drivers MUST use the same session used for the bulkWrite command for each getMore call. When connected to a load balancer, drivers MUST use the connection used for the bulkWrite command to create the cursor to ensure the same server is targeted.

The documents in the results cursor have the following format:

{
    "ok": <0 | 1>,
    "idx": Int32,
    "code": Optional<Int32>,
    "errmsg": Optional<String>,
    "errInfo": Optional<Document>,
    "n": <Int32>,
    "nModified": Optional<Int32>,
    "upserted": Optional<Document with "_id" field>
}

If an error occurred (i.e. the value for ok is 0), the code, errmsg, and optionally errInfo fields will be populated with details about the failure.

If the write succeeded, (i.e. the value for ok is 1), n, nModified, and upsertedId will be populated with the following values based on the type of write:

Response FieldInsertUpdateDelete
nThe number of documents that were inserted.The number of documents that matched the filter.The number of documents that were deleted.
nModifiedNot present.The number of documents that were modified.Not present.
upsertedNot present.A document containing the _id value for the upserted document. Only present if an upsert took place.Not present.

Note that the responses do not contain information about the type of operation that was performed. Drivers may need to maintain the user's list of write models to infer which type of result should be recorded based on the value of idx.

Handling Insert Results

Unlike the other result types, InsertOneResult contains an insertedId field that is generated driver-side, either by recording the _id field present in the user's insert document or creating and adding one. Drivers MUST only record these insertedId values in a BulkWriteResult when a successful response for the insert operation (i.e. { "ok": 1, "n": 1 }) is received in the results cursor. This ensures that drivers only report an insertedId when it is confirmed that the insert succeeded.

Handling Errors

Top-Level Errors

A top-level error is any error that occurs that is not the result of a single write operation failing or a write concern error. Examples include network errors that occur when communicating with the server, command errors ({ "ok": 0 }) returned from the server, client-side errors, and errors that occur when attempting to perform a getMore to retrieve results from the server.

When a top-level error is caused by a command error (i.e. an { "ok": 0 } server response), drivers MUST provide access to the raw server reply in the error returned to the user.

When a top-level error is encountered and individual results and/or errors have already been observed, drivers MUST embed the top-level error within a BulkWriteException as the error field to retain this information. Otherwise, drivers MAY throw an exception containing only the top-level error.

Encountering a top-level error MUST halt execution of a bulk write for both ordered and unordered bulk writes. This means that drivers MUST NOT attempt to retrieve more responses from the cursor or execute any further bulkWrite batches and MUST immediately throw an exception. If the results cursor has not been exhausted on the server when a top-level error occurs, drivers MUST send the killCursors command to attempt to close it. The result returned from the killCursors command MAY be ignored.

Write Concern Errors

Write concern errors are recorded in the writeConcernErrors field on BulkWriteException. When a write concern error is encountered, it should not terminate execution of the bulk write for either ordered or unordered bulk writes. However, drivers MUST throw an exception at the end of execution if any write concern errors were observed.

Individual Write Errors

Individual write errors retrieved from the cursor are recorded in the writeErrors field on BulkWriteException. If an individual write error is encountered during an ordered bulk write, drivers MUST record the error in writeErrors and immediately throw the exception. Otherwise, drivers MUST continue to iterate the results cursor and execute any further bulkWrite batches.

Test Plan

The majority of tests for MongoClient.bulkWrite are written in the Unified Test Format and reside in the CRUD unified tests directory.

Additional prose tests are specified here. These tests require constructing very large documents to test batch splitting, which is not feasible in the unified test format at the time of writing this specification.

Future Work

Retry bulkWrite when getMore fails with a retryable error

When a getMore fails with a retryable error when attempting to iterate the results cursor, drivers could retry the entire bulkWrite command to receive a fresh cursor and retry iteration. This work was omitted to minimize the scope of the initial implementation and testing of the new bulk write API, but may be revisited in the future.

Q&A

Why are we adding a new bulk write API rather than updating the MongoCollection.bulkWrite implementation?

The new bulkWrite command is only available in MongoDB 8.0+, so it cannot function as a drop-in replacement for the existing bulk write implementation that uses the insert, update, and delete commands. Additionally, because the new bulkWrite command allows operations against multiple collections and databases, MongoClient is a more appropriate place to expose its functionality.

Why can't drivers reuse existing bulk write types?

This specification introduces several types that are similar to existing types used in the MongoCollection.bulkWrite API. Although these types are similar now, they may diverge in the future with the introduction of new options and features to the bulkWrite command. Introducing new types also provides more clarity to users on the existing differences between the collection-level and client-level bulk write APIs. For example, the verboseResults option is only available for MongoClient.bulkWrite.

Why are bulk write operation results returned in a cursor?

Returning results via a cursor rather than an array in the bulkWrite response allows full individual results and errors to be returned without the risk of the response exceeding the maximum BSON object size. Using a cursor also leaves open the opportunity to add findAndModify to the list of supported write operations in the future.

Why was the verboseResults option introduced, and why is its default false?

The bulkWrite command returns top-level summary result counts and, optionally, individual results for each operation. Compiling the individual results server-side and consuming these results driver-side is less performant than only recording the summary counts. We expect that most users are not interested in the individual results of their operations and that most users will rely on defaults, so verboseResults defaults to false to improve performance in the common case.

Why should drivers send bypassDocumentValidation: false for bulkWrite?

DRIVERS-450 introduced a requirement that drivers only send a value for bypassDocumentValidation on write commands if it was specified as true. The original motivation for this change is not documented. This specification requires that drivers send bypassDocumentValidation in the bulkWrite command if it is set by the user in BulkWriteOptions, regardless of its value.

Explicitly defining bypassDocumentValidation: false aligns with the server's default to perform schema validation and thus has no effect. However, checking the value of an option that the user specified and omitting it from the command document if it matches the server's default creates unnecessary work for drivers. Always sending the user's specified value also safeguards against the unlikely event that the server changes the default value for bypassDocumentValidation in the future.

Why is providing access to the raw server response when a command error occurs required?

This allows users to access new error fields that the server may add in the future without needing to upgrade their driver version. See DRIVERS-2385 for more details.

Why are drivers required to send nsInfo as a document sequence?

nsInfo could exceed maxBsonObjectSize if a user is doing maxWriteBatchSize operations, each operation is on a unique namespace, and each namespace is near the maximum length allowed for namespaces given the values for these limits at the time of writing this specification. Providing nsInfo as a document sequence reduces the likelihood that a driver would need to batch split a user's bulk write in this scenario.

How was the OP_MSG overhead allowance determined?

The Command Batching Total Message Size section uses a 1000 byte overhead allowance to approximate the number of non-bulkWrite-specific bytes contained in an OP_MSG sent for a bulkWrite batch. This number was determined by constructing OP_MSG messages with various fields attached to the command, including startTransaction, autocommit, and apiVersion. Additional room was allocated to allow for future additions to the OP_MSG structure or the introduction of new command-agnostic fields.

Drivers are required to use this value even if they are capable of determining the exact size of the message prior to batch-splitting to standardize implementations across drivers and simplify batch-splitting testing.

Changelog

  • 2024-05-17: Update specification status to "Accepted".

  • 2024-05-10: Improve rendered format for JSON-like code blocks.

  • 2024-05-08: Bulk write specification created.

.. role:: javascript(code) :language: javascript

====================== Read and Write Concern

:Status: Accepted :Minimum Server Version: 2.4

.. contents::


Abstract

A driver must support configuring and sending read concern and write concerns to a server. This specification defines the API drivers must implement as well as how that API is translated into messages for communication with the server.

META

The keywords “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119 <https://www.ietf.org/rfc/rfc2119.txt>_.

Terminology

MaxWireVersion The maxWireVersion value reported by the hello command. Server Selection The process of selecting a server to read or write from. See the server selection specification <https://github.com/mongodb/specifications/tree/master/source/server-selection>_.

Specification

This specification includes guidance for implementing Read Concern_ and Write Concern_ in a driver. It does not define how read and write concern behave or are implemented on the server.


Read Concern

For naming and deviation guidance, see the CRUD specification <../crud/crud.md#naming>__. Defined below are the constructs for drivers.

.. code:: typescript

enum ReadConcernLevel { /** * This is rendered as "local" (lower-case) on the wire. */ local,

  /**
   * This is rendered as "majority" (lower-case) on the wire.
   */
  majority,

  /**
   * This is rendered as "linearizable" (lower-case) on the wire.
   */
  linearizable,

  /**
   * This is rendered as "available" (lower-case) on the wire.
   */
  available,

  /**
   * This is rendered as "snapshot" (lower-case) on the wire.
   */
  snapshot

}

class ReadConcern { /** * The level of the read concern. */ level: Optional<ReadConcernLevel | String> }

The read concern option is available for the following operations:

  • aggregate command
  • count command
  • distinct command
  • find command
  • mapReduce command where the out option is { inline: 1 }
  • parallelCollectionScan command
  • geoNear command
  • geoSearch command

Starting in MongoDB 4.2, an aggregate command with a write stage (e.g. $out, $merge) supports a readConcern; however, it does not support the "linearizable" level (attempting to do so will result in a server error).

Server versions before 4.2 do not support a readConcern at all for aggregate commands with a write stage.

The mapReduce command where the out option is anything other than { inline: 1 } does not support a readConcern.

Unknown Levels and Additional Options for String Based ReadConcerns

For forward compatibility, a driver MUST NOT raise an error when a user provides an unknown level or additional options. The driver relies on the server to validate levels and other contents of the read concern.

Server’s Default Read Concern

When a ReadConcern is created but no values are specified, it should be considered the server’s default ReadConcern.

:javascript:readConcern: { } is not the same as :javascript:readConcern: { level=“local” }. The former is the server’s default ReadConcern while the latter is the user explicitly specifying a ReadConcern with a level of “local”.

Snapshot Read Concern

When a ReadConcern level snapshot is used, atClusterTime may be specified to indicate the desired point in time for reading. find, aggregate and distinct operations executed with ReadConcern snapshot but without atClusterTime will return atClusterTime timestamp in the server response. The obtained atClusterTime timestamp can be used for subsequent read operations. ReadConcern level snapshot with clusterTime is supported in find, aggregate and distinct operations.

On the Wire

Read Commands


Read commands that support ``ReadConcern`` take a named parameter spelled
(case-sensitively) ``readConcern``. See command documentation for further
examples.

If the ``Client``, ``Database``, or ``Collection`` being operated on either
has no ``ReadConcern`` set, or has the server default ``ReadConcern``
:javascript:`readConcern: { }`:

- If the  ``ReadConcern`` specified for the command is the server default
  :javascript:`readConcern: { }`, the driver MUST omit it when sending the command.
- If the ``ReadConcern`` specified for the command is any ``ReadConcern``
  besides the server default, including an explicitly specified ``ReadConcern``
  of :javascript:`readConcern: { level: "local" }`, the driver MUST include
  the ``ReadConcern`` when sending the command.

If the ``Client``, ``Database``, or ``Collection`` being operated on has
a non-default ``ReadConcern`` specified, then the driver MUST include the
command's ``ReadConcern`` when sending the command. This includes if the
command specifies the server default ``ReadConcern``, so that the command
can override the ``Client``, ``Database``, or ``Collection``'s ``ReadConcern``
to use the server default instead.


Generic Command Method

If your driver offers a generic RunCommand method on your database object, ReadConcern MUST NOT be applied automatically to any command. A user wishing to use a ReadConcern in a generic command must supply it manually.

Errors


``ReadConcern`` errors from a server MUST NOT be handled by a driver. There is
nothing a driver can do about them and any such errors will get propagated to
the user via normal error handling.


Location Specification
----------------------

Via Code

ReadConcern SHOULD be specifiable at the Client, Database, and Collection levels. Unless specified, the value MUST be inherited from its parent and SHOULD NOT be modifiable on an existing Client, Database or Collection. In addition, a driver MAY allow it to be specified on a per-operation basis in accordance with the CRUD specification.

For example:

.. code:: typescript

var client = new MongoClient({ readConcern: { level: "local" } });

// db1's readConcern level is "local".
var db1 = client.getDatabase("db1");

// col1's readConcern level is "local"
var col1 = db1.getCollection("col_name");

// db2's readConcern level is "majority".
var db2 = client.getDatabase("db_name", { readConcern: { level: "majority" } });

// col2's readConcern level is "majority"
var col2 = db2.getCollection("col_name");

// col3's readConcern level is the server’s default read concern
var col3 = db2.getCollection("col_name", { readConcern: { } });

Via Connection String


Options
    * ``readConcernLevel`` - defines the level for the read concern.

For example:

.. code::

    mongodb://server:27017/db?readConcernLevel=majority

Errors
------

MaxWireVersion < 4
    Only the server’s default ``ReadConcern`` is support by ``MaxWireVersion``
    < 4. When using other ``readConcernLevels`` with clients reporting
    ``MaxWireVersion`` < 4, the driver MUST raise an error. This check MUST
    happen after server selection has occurred in the case of mixed version
    clusters. It is up to users to appropriately define a ``ReadPreference``
    such that intermittent errors do not occur.

.. note::

   ``ReadConcern`` is only supported for commands.

-------------
Write Concern
-------------

When a driver sends a write concern document to the server, the structure
of the write concern document MUST be as follows:

.. code:: typescript

  class WriteConcern {
    /**
     * If true, wait for the the write operation to get committed to the
     * journal. When unspecified, a driver MUST NOT send "j".
     *
     * @see https://www.mongodb.com/docs/manual/core/write-concern/#journaled
     */
    j: Optional<Boolean>,

    /**
     * When an integer, specifies the number of nodes that should acknowledge
     * the write and MUST be greater than or equal to 0.
     * When a string, indicates tags. "majority" is defined, but users
     * could specify other custom error modes.
     * When not specified, a driver MUST NOT send "w".
     */
    w: Optional<Int32 | String>,

    /**
     * If provided, and the write concern is not satisfied within the
     * specified timeout (in milliseconds), the server will return an error
     * for the operation. When unspecified, a driver SHOULD NOT send "wtimeout".
     *
     * The value, if provided, MUST be greater than or equal to 0.
     *
     * @see https://www.mongodb.com/docs/manual/core/write-concern/#timeouts
     */
    wtimeout: Optional<Int64>
  }

When a driver provides a way for the application to specify the write concern,
the following data structure SHOULD be used. For acceptable naming and
deviation guidance, see the `CRUD specification
<../crud/crud.md#naming>`__.

.. code:: typescript

  class WriteConcern {
    /**
     * Corresponds to the "j" field in the WriteConcern document sent to
     * the server.
     */
    journal: Optional<Boolean>,

    /**
     * Corresponds to the "w" field in the WriteConcern document sent to
     * the server.
     */
    w: Optional<Int32 | String>,

    /**
     * Corresponds to the "wtimeout" field in the WriteConcern document sent to
     * the server.
     *
     * NOTE: This option is deprecated in favor of timeoutMS.
     */
    wtimeoutMS: Optional<Int64>
  }


FSync
-----

FSync SHOULD be considered deprecated.  Those drivers supporting the deprecated
``fsync`` option SHOULD treat ``fsync`` identically to ``journal`` in terms of
consistency with ``w`` and whether a ``WriteConcern`` that specifies ``fsync``
is acknowledged or unacknowledged.


wtimeoutMS
----------

``wtimeoutMS`` MUST be considered deprecated in favor of `timeoutMS
<../client-side-operations-timeout/client-side-operations-timeout.md#timeoutms>`__.


Server’s Default WriteConcern
-----------------------------

When a ``WriteConcern`` is created but no values are specified, it should be
considered the server’s default ``WriteConcern``.

The server has a settings field called ``getLastErrorDefaults`` which allows
a user to customize the default behavior of a ``WriteConcern``. Because of
this, :javascript:`writeConcern: { }` is not the same as
:javascript:`writeConcern: {w: 1}`. Sending :javascript:`{w:1}` overrides
that default. As another example, :javascript:`writeConcern: { }` is not the
same as :javascript:`writeConcern: {journal: false}`.


Inconsistent WriteConcern
-------------------------

Drivers MUST raise an error when an inconsistent ``WriteConcern`` is
specified. The following is an exhaustive list of inconsistent ``WriteConcerns``:

.. code:: typescript

   writeConcern = { w: 0, journal: true };


Unacknowledged WriteConcern
---------------------------

An ``Unacknowledged WriteConcern`` is when (``w`` equals 0) AND (``journal``
is not set or is ``false``).

These criteria indicates that the user does not care about errors from the server.

Examples:

.. code:: typescript

   writeConcern = { w: 0 }; // Unacknowledged
   writeConcern = { w: 0, journal: false }; // Unacknowledged
   writeConcern = { w: 0, wtimeoutMS: 100 }; // Unacknowledged


On the Wire
-----------

OP_INSERT, OP_DELETE, OP_UPDATE

WriteConcern is implemented by sending the getLastError (GLE) command directly after the operation. Drivers SHOULD piggy-back the GLE onto the same buffer as the operation. Regardless, GLE MUST be sent on the same connection as the initial write operation.

When a user has not specified a WriteConcern or has specified the server’s default WriteConcern, drivers MUST send the GLE command without arguments. For example: :javascript:{ getLastError: 1 }

Drivers MUST NOT send a GLE for an Unacknowledged WriteConcern. In this instance, the server will not send a reply.

See the getLastError command documentation for other formatting.

Write Commands


The ``insert``, ``delete``, and ``update`` commands take a named parameter,
``writeConcern``. See the command documentation for further examples.

When a user has not specified a ``WriteConcern`` or has specified the server’s
default ``WriteConcern``, drivers MUST omit the ``writeConcern`` parameter from
the command.

All other ``WriteConcerns``, including the ``Unacknowledged WriteConcern``,
MUST be sent with the ``writeConcern`` parameter.

.. note::
    Drivers MAY use ``OP_INSERT``, ``OP_UPDATE``, and ``OP_DELETE`` when an
    ``Unacknowledged WriteConcern`` is used.

Generic Command Method

If your driver offers a generic RunCommand method on your database object, WriteConcern MUST NOT be applied automatically to any command. A user wishing to use a WriteConcern in a generic command must manually include it in the command document passed to the method.

The generic command method MUST NOT check the user's command document for a WriteConcern nor check whether the server is new enough to support a write concern for the command. The method simply sends the user's command to the server as-is.

Find And Modify


The ``findAndModify`` command takes a named parameter, ``writeConcern``. See
command documentation for further examples.

If writeConcern is specified for the Collection, ``writeConcern`` MUST be
omitted when sending ``findAndModify`` with MaxWireVersion < 4.

If the findAndModify helper accepts writeConcern as a parameter, the driver
MUST raise an error with MaxWireVersion < 4.

.. note::

    Driver documentation SHOULD include a warning in their server 3.2
    compatible releases that an elevated ``WriteConcern`` may cause
    performance degradation when using ``findAndModify``. This is because
    ``findAndModify`` will now be honoring a potentially high latency setting
    where it did not before.

Other commands that write

Command helper methods for commands that write, other than those discussed above, MAY accept a write concern or write concern options in their parameter list. If the helper accepts a write concern, the driver MUST error if the selected server's MaxWireVersion < 5 and a write concern has explicitly been specified.

Helper methods that apply the write concern inherited from the Collection or Database, SHOULD check whether the selected server's MaxWireVersion >= 5 and if so, include the inherited write concern in the command on the wire. If the selected server's MaxWireVersion < 5, these methods SHOULD silently omit the write concern from the command on the wire.

These commands that write are:

  • aggregate with write stage (e.g. $out, $merge)
  • copydb
  • create
  • createIndexes
  • drop
  • dropDatabase
  • dropIndexes
  • mapReduce where the out option is not { inline: 1 }
  • clone
  • cloneCollection
  • cloneCollectionAsCapped
  • collMod
  • convertToCapped
  • renameCollection
  • createUser
  • updateUser
  • dropUser

Errors


In general, server errors associated with ``WriteConcern`` return successful (``"ok": 1``) responses
with a ``writeConcernError`` field indicating the issue. For example,

.. code:: typescript

    rs0:PRIMARY> db.runCommand({insert: "foo", documents: [{x:1}], writeConcern: { w: "blah"}})
    {
      n: 1,
      opTime: {
        ts: Timestamp(1583026145, 1),
        t: NumberLong(5)
      },
      electionId: ObjectId("7fffffff0000000000000005"),
      ok: 1,
      writeConcernError: {
        code: 79,
        codeName: "UnknownReplWriteConcern",
        errmsg: "No write concern mode named 'blah' found in replica set configuration",
        errInfo: {
          writeConcern: {
            w: "blah",
            wtimeout: 0,
            provenance: "clientSupplied"
          }
        }
      },
      $clusterTime: {
        clusterTime: Timestamp(1583026145, 1),
        signature: {
          hash: BinData(0, "AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
          keyId: NumberLong(0)
        }
      },
      operationTime: Timestamp(1583026145, 1)
    }

Drivers SHOULD parse server replies for a "writeConcernError" field and report
the error only in the command-specific helper methods for commands that write,
from the list above. For example, helper methods for "findAndModify" or
"aggregate" SHOULD parse the server reply for "writeConcernError".

Drivers SHOULD report writeConcernErrors however they report other server
errors: by raising an exception, returning "false", or another idiom that is
consistent with other server errors. Drivers SHOULD report writeConcernErrors
with a ``WriteConcernError`` defined in the
`CRUD specification <../crud/crud.md#error-handling>`__.

Drivers SHOULD NOT parse server replies for "writeConcernError" in generic
command methods.

(Reporting of writeConcernErrors is more complex for bulk operations,
see the Bulk API Spec.)

writeConcernError Examples
^^^^^^^^^^^^^^^^^^^^^^^^^^

The set of possible writeConcernErrors is quite large because they can include
errors caused by shutdown, stepdown, interruption, maxTimeMS, and wtimeout.
This section attempts to list all known error codes that may appear
within a writeConcernError but may not be exhaustive. Note that some errors
have been abbreviated:

- ``{ok:1, writeConcernError: {code: 91, codeName: "ShutdownInProgress"}}``
- ``{ok:1, writeConcernError: {code: 189, codeName: "PrimarySteppedDown"}}``
- ``{ok:1, writeConcernError: {code: 11600, codeName: "InterruptedAtShutdown"}}``
- ``{ok:1, writeConcernError: {code: 11601, codeName: "Interrupted"}}``
- ``{ok:1, writeConcernError: {code: 11602, codeName: "InterruptedDueToReplStateChange"}}``
- ``{ok:1, writeConcernError: {code: 64, codeName: "WriteConcernFailed", errmsg: "waiting for replication timed out", errInfo: {wtimeout: True}}}``
- ``{ok:1, writeConcernError: {code: 64, codeName: "WriteConcernFailed", errmsg: "multiple errors reported : {...} at shardName1 :: and :: {...} at shardName2"}}`` [#]_
- ``{ok:1, writeConcernError: {code: 50, codeName: "MaxTimeMSExpired"}}``
- ``{ok:1, writeConcernError: {code: 100, codeName: "UnsatisfiableWriteConcern", errmsg: "Not enough data-bearing nodes"}}``
- ``{ok:1, writeConcernError: {code: 79, codeName: "UnknownReplWriteConcern"}}``

Note also that it is possible for a writeConcernError to be attached to a
command failure. For example:

- ``{ok:0, code: 251, codeName: "NoSuchTransaction", writeConcernError: {code: 91, codeName: "ShutdownInProgress"}}`` [#]_

.. [#] This is only possible in a sharded cluster. When a write is routed to
       multiple shards and more than one shard returns a writeConcernError,
       then mongos will construct a new writeConcernError with the
       "WriteConcernFailed" error code and an errmsg field contains the
       stringified writeConcernError from each shard. Note that each shard may
       return a different writeConcernError.

.. [#] See https://jira.mongodb.org/browse/SERVER-38850

Location Specification
----------------------

Via Code

WriteConcern SHOULD be specifiable at the Client, Database, and Collection levels. Unless specified, the value MUST be inherited from its parent and SHOULD NOT be modifiable on an existing Client, Database or Collection. In addition, a driver MAY allow it to be specified on a per-operation basis in accordance with the CRUD specification.

For example:

.. code:: typescript

var client = new MongoClient({ writeConcern: { w: 2 } });

// db1's writeConcern is {w: 2}.
var db1 = client.getDatabase("db1");

// col1's writeConcern is {w: 2}.
var col1 = db1.getCollection("col_name");

// db2's writeConcern is {journal: true}.
var db2 = client.getDatabase("db_name", { writeConcern: { journal: true } });

// col2's writeConcern {journal: true}.
var col2 = db2.getCollection("col_name");

// col3's writeConcern is the server’s default write concern.
var col3 = db2.getCollection("col_name", { writeConcern: { } });

// Override col3's writeConcern.
col3.drop({ writeConcern: { w: 3 } });

Via Connection String


Options
    * ``w`` - corresponds to ``w`` in the class definition.
    * ``journal`` - corresponds to ``journal`` in the class definition.
    * ``wtimeoutMS`` - corresponds to ``wtimeoutMS`` in the class definition.

For example:

.. code::

    mongodb://server:27017/db?w=3

    mongodb://server:27017/db?journal=true

    mongodb://server:27017/db?wtimeoutMS=1000

    mongodb://server:27017/db?w=majority&wtimeoutMS=1000



Backwards Compatibility
=======================

There should be no backwards compatibility concerns. This specification merely
deals with how to specify read and write concerns.

Test Plan
=========

Yaml tests are located here: https://github.com/mongodb/specifications/tree/master/source/read-write-concern/tests

Below are English descriptions of other items that should be tested:

-----------
ReadConcern
-----------

1. Commands supporting a read concern MUST raise an error when MaxWireVersion
   is less than 4 and a non-default, non-local read concern is specified.
2. Commands supporting a read concern MUST NOT send the default read concern
   to the server.
3. Commands supporting a read concern MUST send any non-default read concern
   to the server.

------------
WriteConcern
------------

1. Commands supporting a write concern MUST NOT send the default write concern
   to the server.
2. Commands supporting a write concern MUST send any non-default acknowledged
   write concern to the server, either in the command or as a getLastError.
3. On ServerVersion less than 2.6, drivers MUST NOT send a getLastError command
   for an Unacknowledged write concern.
4. FindAndModify helper methods MUST NOT send a write concern when the
   MaxWireVersion is less than 4.
5. Helper methods for other commands that write MUST NOT send a write concern
   when the MaxWireVersion is less than 5.

Reference Implementation
========================

These are currently under construction.


Q & A
=====

Q: Why is specifying a non-default ``ReadConcern`` for servers < 3.2 an error while a non-default write concern gets ignored in ``findAndModify``?
  ``findAndModify`` is an existing command and since ``WriteConcern`` may be
  defined globally, anyone using ``findAndModify`` in their applications with
  a non-default ``WriteConcern`` defined globally would have all their
  ``findAndModify`` operations fail.

Q: Why does a driver send :javascript:`{ readConcern: { level: “local” } }` to the server when that is the server’s default?
  First, to mirror how ``WriteConcern`` already works, ``ReadConcern() does not
  equal ReadConcern(level=local)`` in the same way that ``WriteConcern() does
  not equal WriteConcern(w=1)``. This is true for ``WriteConcern`` because
  the server’s default could be set differently. While this setting does not
  currently exist for ``ReadConcern``, it is a possible eventuality and it
  costs a driver nothing to be prepared for it. Second, it makes sense that
  if a user doesn’t specify a ``ReadConcern``, we don’t send one and if a
  user does specify a ``ReadConcern``, we do send one. If the user specifies
  level=”local”, for instance, we send it.

Changelog
=========

:2015-10-16: ReadConcern of local is no longer allowed to be used when talking
             with MaxWireVersion < 4.
:2016-05-20: Added note about helpers for commands that write accepting a
             writeConcern parameter.
:2016-06-17: Added "linearizable" to ReadConcern levels.
:2016-07-15: Command-specific helper methods for commands that write SHOULD
             check the server's MaxWireVersion and decide whether to send
             writeConcern. Advise drivers to parse server replies for
             writeConcernError and raise an exception if found, only in
             command-specific helper methods that take a writeConcern parameter,
             not in generic command methods. Don't mention obscure commands with
             no helpers.
:2016-08-06: Further clarify that command-specific helper methods for commands
             that write take write concern options in their parameter lists, and
             relax from SHOULD to MAY.
:2017-03-13: reIndex silently ignores writeConcern in MongoDB 3.4 and returns an
             error if writeConcern is included with MongoDB 3.5+. See
             `SERVER-27891 <https://jira.mongodb.org/browse/SERVER-27891>`_.
:2017-11-17: Added list of commands that support readConcern
:2017-12-18: Added "available" to Readconcern level.
:2017-05-29: Added user management commands to list of commands that write
:2019-01-29: Added section listing all known examples of writeConcernError.
:2019-06-07: Clarify language for aggregate and mapReduce commands that write.
:2019-10-31: Explicitly define write concern option mappings.
:2020-02-13: Inconsistent write concern must be considered an error.
:2021-04-07: Updated to use hello command.
:2021-06-15: Added "snapshot" to Readconcern level
:2021-07-12: Add missing commas after ReadConcernLevel enum values
:2022-01-19: Deprecate wTimeoutMS in favor of timeoutMS.
:2022-10-05: Remove spec front matter and reformat changelog.

Change Streams

  • Status: Accepted
  • Minimum Server Version: 3.6

Abstract

As of version 3.6 of the MongoDB server a new $changeStream pipeline stage is supported in the aggregation framework. Specifying this stage first in an aggregation pipeline allows users to request that notifications are sent for all changes to a particular collection. This specification defines the means for creating change streams in drivers, as well as behavior during failure scenarios.

Specification

Definitions

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Terms

Resumable Error

An error is considered resumable if it meets any of the criteria listed below. For any criteria with wire version constraints, the driver MUST use the wire version of the connection used to do the initial aggregate or the getMore that resulted in the error. Drivers MUST NOT check the wire version of the server after the command has been executed when checking these constraints.

  • Any error encountered which is not a server error (e.g. a timeout error or network error)

  • A server error with code 43 (CursorNotFound)

  • For servers with wire version 9 or higher (server version 4.4 or higher), any server error with the ResumableChangeStreamError error label.

  • For servers with wire version less than 9, a server error with one of the following codes:

    Error NameError Code
    HostUnreachable6
    HostNotFound7
    NetworkTimeout89
    ShutdownInProgress91
    PrimarySteppedDown189
    ExceededTimeLimit262
    SocketException9001
    NotWritablePrimary10107
    InterruptedAtShutdown11600
    InterruptedDueToReplStateChange11602
    NotPrimaryNoSecondaryOk13435
    NotPrimaryOrSecondary13436
    StaleShardVersion63
    StaleEpoch150
    StaleConfig13388
    RetryChangeStream234
    FailedToSatisfyReadPreference133

An error on an aggregate command is not a resumable error. Only errors on a getMore command may be considered resumable errors.

Guidance

For naming and deviation guidance, see the CRUD specification.

Server Specification

Response Format

NOTE: The examples in this section are provided for illustrative purposes, and are subject to change without warning. Drivers that provide a static type to represent ChangeStreamDocument MAY include additional fields in their API.

If an aggregate command with a $changeStream stage completes successfully, the response contains documents with the following structure:

class ChangeStreamDocument {
  /**
   * The id functions as an opaque token for use when resuming an interrupted
   * change stream.
   */
  _id: Document;

  /**
   * Describes the type of operation represented in this change notification.
   *
   * @note: Whether a change is reported as an event of the operation type
   * `update` or `replace` is a server implementation detail.
   *
   * @note: The server will add new `operationType` values in the future and drivers
   * MUST NOT err when they encounter a new `operationType`. Unknown `operationType`
   * values may be represented by "unknown" or the literal string value.
   */
  operationType: "insert" 
                | "update" 
                | "replace" 
                | "delete" 
                | "invalidate" 
                | "drop" 
                | "dropDatabase" 
                | "rename" 
                | "createIndexes"
                | "dropIndexes" 
                | "modify"
                | "create" 
                | "shardCollection" 
                | "refineCollectionShardKey" 
                | "reshardCollection";

  /**
   * Contains two fields: "db" and "coll" containing the database and
   * collection name in which the change happened.
   *
   * @note: Drivers MUST NOT err when extra fields are encountered in the `ns` document
   * as the server may add new fields in the future such as `viewOn`.
   */
  ns: Document;

  /**
   * Only present for ops of type 'rename'.
   *
   * The namespace, in the same format as `ns`, that a collection has been renamed to.
   */
  to: Optional<Document>;

  /**
   * Only present for ops of type 'rename', 'create', 'modify', 'createIndexes', 'dropIndexes', 'shardCollection', 'reshardCollection', 'refineCollectionShardKey'.
   * Only present when the `showExpandedEvents` change stream option is enabled.
   *
   * A description of the operation.
   * 
   * @since 6.0.0
   */
  operationDescription: Optional<Document>

  /**
   * Only present for ops of type 'insert', 'update', 'replace', and
   * 'delete'.
   *
   * For unsharded collections this contains a single field, _id, with the
   * value of the _id of the document updated.  For sharded collections,
   * this will contain all the components of the shard key in order,
   * followed by the _id if the _id isn't part of the shard key.
   */
  documentKey: Optional<Document>;

  /**
   * Only present for ops of type 'update'.
   */
  updateDescription: Optional<UpdateDescription>;

  /**
   * Always present for operations of type 'insert' and 'replace'. Also
   * present for operations of type 'update' if the user has specified
   * 'updateLookup' for the 'fullDocument' option when creating the change
   * stream.
   *
   * For operations of type 'insert' and 'replace', this key will contain the
   * document being inserted or the new version of the document that is
   * replacing the existing document, respectively.
   *
   * For operations of type 'update', this key will contain a copy of the full
   * version of the document from some point after the update occurred. If the
   * document was deleted since the updated happened, it will be null.
   *
   * Contains the point-in-time post-image of the modified document if the
   * post-image is available and either 'required' or 'whenAvailable' was
   * specified for the 'fullDocument' option when creating the change stream.
   * A post-image is always available for 'insert' and 'replace' events.
   */
  fullDocument: Document | null;

  /**
   * Contains the pre-image of the modified or deleted document if the
   * pre-image is available for the change event and either 'required' or
   * 'whenAvailable' was specified for the 'fullDocumentBeforeChange' option
   * when creating the change stream. If 'whenAvailable' was specified but the
   * pre-image is unavailable, this will be explicitly set to null.
   */
  fullDocumentBeforeChange: Document | null;

  /**
   * The wall time from the mongod that the change event originated from.
   * Populated for server versions 6.0 and above.
   */
  wallTime: Optional<DateTime>;

  /**
   * The `ui` field from the oplog entry corresponding to the change event.
   * 
   * Only present when the `showExpandedEvents` change stream option is enabled and for the following events
   *  - 'insert'
   *  - 'update'
   *  - 'delete'
   *  - 'createIndexes'
   *  - 'dropIndexes'
   *  - 'modify'
   *  - 'drop'
   *  - 'create'
   *  - 'shardCollection'
   *  - 'reshardCollection'
   *  - 'refineCollectionShardKey'
   *  
   * This field is a value of binary subtype 4 (UUID).
   *  
   * @since 6.0.0
   */
  collectionUUID: Optional<Binary>;

  /**
   * The cluster time at which the change occurred.
   */
  clusterTime: Timestamp;

}

class UpdateDescription {
  /**
   * A document containing key:value pairs of names of the fields that were
   * changed (excluding the fields reported via `truncatedArrays`), and the new value for those fields.
   *
   * Despite array fields reported via `truncatedArrays` being excluded from this field,
   * changes to fields of the elements of the array values may be reported via this field.
   * Example:
   *   original field:
   *     "arrayField": ["foo", {"a": "bar"}, 1, 2, 3]
   *   updated field:
   *     "arrayField": ["foo", {"a": "bar", "b": 3}]
   *   a potential corresponding UpdateDescription:
   *     {
   *       "updatedFields": {
   *         "arrayField.1.b": 3
   *       },
   *       "removedFields": [],
   *       "truncatedArrays": [
   *         {
   *           "field": "arrayField",
   *           "newSize": 2
   *         }
   *       ]
   *     }
   *
   * Modifications to array elements are expressed via the dot notation (https://www.mongodb.com/docs/manual/core/document/#document-dot-notation).
   * Example: an `update` which sets the element with index 0 in the array field named arrayField to 7 is reported as
   *   "updatedFields": {"arrayField.0": 7}
   */
  updatedFields: Document;

  /**
   * An array of field names that were removed from the document.
   */
  removedFields: Array<String>;

  /**
   * Truncations of arrays may be reported using one of the following methods:
   * either via this field or via the 'updatedFields' field. In the latter case the entire array is considered to be replaced.
   *
   * The structure of documents in this field is
   *   {
   *      "field": <string>,
   *      "newSize": <int>
   *   }
   * Example: an `update` which shrinks the array arrayField.0.nestedArrayField from size 8 to 5 may be reported as
   *   "truncatedArrays": [{"field": "arrayField.0.nestedArrayField", "newSize": 5}]
   *
   * @note The method used to report a truncation is a server implementation detail.
   * @since 4.7.0
   */
  truncatedArrays: Array<Document>;

  /**
   * A document containing a map that associates an update path to an array containing the path components used in the update document. This data
   * can be used in combination with the other fields in an `UpdateDescription` to determine the actual path in the document that was updated. This is 
   * necessary in cases where a key contains dot-separated strings (i.e., `{ "a.b": "c" }`) or a document contains a numeric literal string key
   * (i.e., `{ "a": { "0": "a" } }`. Note that in this scenario, the numeric key can't be the top level key, because `{ "0": "a" }` is not ambiguous - 
   * update paths would simply be `'0'` which is unambiguous because BSON documents cannot have arrays at the top level.).
   * 
   * Each entry in the document maps an update path to an array which contains the actual path used when the document was updated.  
   * For example, given a document with the following shape `{ "a": { "0": 0 } }` and an update of `{ $inc: { "a.0": 1 } }`, `disambiguatedPaths` would
   * look like the following:
   *   {
   *      "a.0": ["a", "0"]
   *   }
   * 
   * In each array, all elements will be returned as strings with the exception of array indices, which will be returned as 32 bit integers.
   * 
   * @since 6.1.0
   */
  disambiguatedPaths: Optional<Document>
}

The responses to a change stream aggregate or getMore have the following structures:

/**
 * Response to a successful aggregate.
 */
{
    ok: 1,
    cursor: {
       ns: String,
       id: Int64,
       firstBatch: Array<ChangeStreamDocument>,
       /**
        * postBatchResumeToken is returned in MongoDB 4.0.7 and later.
        */
       postBatchResumeToken: Document
    },
    operationTime: Timestamp,
    $clusterTime: Document,
}

/**
 * Response to a successful getMore.
 */
{
    ok: 1,
    cursor: {
       ns: String,
       id: Int64,
       nextBatch: Array<ChangeStreamDocument>
       /**
        * postBatchResumeToken is returned in MongoDB 4.0.7 and later.
        */
       postBatchResumeToken: Document
    },
    operationTime: Timestamp,
    $clusterTime: Document,
}

Driver API

interface ChangeStream extends Iterable<Document> {
  /**
   * The cached resume token
   */
  private resumeToken: Document;

  /**
   * The pipeline of stages to append to an initial ``$changeStream`` stage
   */
  private pipeline: Array<Document>;

  /**
   * The options provided to the initial ``$changeStream`` stage
   */
  private options: ChangeStreamOptions;

  /**
   * The read preference for the initial change stream aggregation, used
   * for server selection during an automatic resume.
   */
  private readPreference: ReadPreference;
}

interface Collection {
  /**
   * @returns a change stream on a specific collection.
   */
  watch(pipeline: Document[], options: Optional<ChangeStreamOptions>): ChangeStream;
}

interface Database {
  /**
   * Allows a client to observe all changes in a database.
   * Excludes system collections.
   * @returns a change stream on all collections in a database
   * @since 4.0
   * @see https://www.mongodb.com/docs/manual/reference/system-collections/
   */
  watch(pipeline: Document[], options: Optional<ChangeStreamOptions>): ChangeStream;
}

interface MongoClient {
  /**
   * Allows a client to observe all changes in a cluster.
   * Excludes system collections.
   * Excludes the "config", "local", and "admin" databases.
   * @since 4.0
   * @returns a change stream on all collections in all databases in a cluster
   * @see https://www.mongodb.com/docs/manual/reference/system-collections/
   */
  watch(pipeline: Document[], options: Optional<ChangeStreamOptions>): ChangeStream;
}

class ChangeStreamOptions {
  /**
   * Allowed values: 'default', 'updateLookup', 'whenAvailable', 'required'.
   *
   * The default is to not send a value, which is equivalent to 'default'. By
   * default, the change notification for partial updates will include a delta
   * describing the changes to the document.
   *
   * When set to 'updateLookup', the change notification for partial updates
   * will include both a delta describing the changes to the document as well
   * as a copy of the entire document that was changed from some time after
   * the change occurred.
   *
   * When set to 'whenAvailable', configures the change stream to return the
   * post-image of the modified document for replace and update change events
   * if the post-image for this event is available.
   *
   * When set to 'required', the same behavior as 'whenAvailable' except that
   * an error is raised if the post-image is not available.
   *
   * For forward compatibility, a driver MUST NOT raise an error when a user
   * provides an unknown value. The driver relies on the server to validate
   * this option.
   *
   * @note this is an option of the `$changeStream` pipeline stage.
   */
  fullDocument: Optional<String>;

  /**
   * Allowed values: 'whenAvailable', 'required', 'off'.
   *
   * The default is to not send a value, which is equivalent to 'off'.
   *
   * When set to 'whenAvailable', configures the change stream to return the
   * pre-image of the modified document for replace, update, and delete change
   * events if it is available.
   *
   * When set to 'required', the same behavior as 'whenAvailable' except that
   * an error is raised if the pre-image is not available.
   *
   * For forward compatibility, a driver MUST NOT raise an error when a user
   * provides an unknown value. The driver relies on the server to validate
   * this option.
   *
   * @note this is an option of the `$changeStream` pipeline stage.
   */
  fullDocumentBeforeChange: Optional<String>;

  /**
   * Specifies the logical starting point for the new change stream.
   *
   * @note this is an option of the `$changeStream` pipeline stage.
   */
  resumeAfter: Optional<Document>;

  /**
   * The maximum amount of time for the server to wait on new documents to satisfy
   * a change stream query.
   *
   * This is the same field described in FindOptions in the CRUD spec.
   *
   * @see https://github.com/mongodb/specifications/blob/master/source/crud/crud.md#read
   * @note this option is an alias for `maxTimeMS`, used on `getMore` commands
   * @note this option is not set on the `aggregate` command nor `$changeStream` pipeline stage
   */
  maxAwaitTimeMS: Optional<Int64>;

  /**
   * The number of documents to return per batch.
   *
   * This option is sent only if the caller explicitly provides a value. The
   * default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/aggregate
   * @note this is an aggregation command option
   */
  batchSize: Optional<Int32>;

  /**
   * Specifies a collation.
   *
   * This option is sent only if the caller explicitly provides a value. The
   * default is to not send a value.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/aggregate
   * @note this is an aggregation command option
   */
  collation: Optional<Document>;

  /**
   * The change stream will only provide changes that occurred at or after the
   * specified timestamp. Any command run against the server will return
   * an operation time that can be used here.
   *
   * @since 4.0
   * @see https://www.mongodb.com/docs/manual/reference/method/db.runCommand/
   * @note this is an option of the `$changeStream` pipeline stage.
   */
  startAtOperationTime: Optional<Timestamp>;

  /**
   * Similar to `resumeAfter`, this option takes a resume token and starts a
   * new change stream returning the first notification after the token.
   * This will allow users to watch collections that have been dropped and recreated
   * or newly renamed collections without missing any notifications.
   *
   * The server will report an error if `startAfter` and `resumeAfter` are both specified.
   *
   * @since 4.1.1
   * @see https://www.mongodb.com/docs/manual/changeStreams/#change-stream-start-after
   * @note this is an option of the `$changeStream` pipeline stage.
   */
   startAfter: Optional<Document>;

  /**
   * Enables users to specify an arbitrary comment to help trace the operation through
   * the database profiler, currentOp and logs. The default is to not send a value.
   *
   * The comment can be any valid BSON type for server versions 4.4 and above.
   * Server versions prior to 4.4 only support string as comment,
   * and providing a non-string type will result in a server-side error.
   *
   * If a comment is provided, drivers MUST attach this comment to all
   * subsequent getMore commands run on the same cursor for server
   * versions 4.4 and above. For server versions below 4.4 drivers MUST NOT
   * attach a comment to getMore commands.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/aggregate
   * @note this is an aggregation command option
   */
  comment: Optional<any>

  /**
   * Enables the server to send the 'expanded' list of change stream events.
   * The list of additional events included with this flag set are
   * - createIndexes
   * - dropIndexes
   * - modify
   * - create
   * - shardCollection
   * - reshardCollection
   * - refineCollectionShardKey
   * 
   * This flag is available in server versions greater than 6.0.0. `reshardCollection` and
   * `refineCollectionShardKey` events are not available until server version 6.1.0.
   * 
   * @note this is an option of the change stream pipeline stage
   */
  showExpandedEvents: Optional<Boolean>
}

NOTE: The set of ChangeStreamOptions may grow over time.

Helper Method

The driver API consists of a ChangeStream type, as well as three helper methods. All helpers MUST return a ChangeStream instance. Implementers MUST document that helper methods are preferred to running a raw aggregation with a $changeStream stage, for the purpose of supporting resumability.

The helper methods must construct an aggregation command with a REQUIRED initial $changeStream stage. A driver MUST NOT throw a custom exception if multiple $changeStream stages are present (e.g. if a user also passed $changeStream in the pipeline supplied to the helper), as the server will return an error.

The helper methods MUST determine a read concern for the operation in accordance with the Read and Write Concern specification. The initial implementation of change streams on the server requires a 'majority' read concern or no read concern. Drivers MUST document this requirement. Drivers SHALL NOT throw an exception if any other read concern is specified, but instead should depend on the server to return an error.

The stage has the following shape:

{ $changeStream: ChangeStreamOptions }

The first parameter of the helpers specifies an array of aggregation pipeline stages which MUST be appended to the initial stage. Drivers MUST support an empty pipeline. Languages which support default parameters MAY specify an empty array as the default value for this parameter. Drivers SHOULD otherwise make specification of a pipeline as similar as possible to the aggregate CRUD method.

Additionally, implementers MAY provide a form of these methods which require no parameters, assuming no options and no additional stages beyond the initial $changeStream stage:

for change in db.collection.watch():
    print(change)

Presently change streams support only a subset of available aggregation stages:

  • $match
  • $project
  • $addFields
  • $replaceRoot
  • $redact

A driver MUST NOT throw an exception if any unsupported stage is provided, but instead depend on the server to return an error.

A driver MUST NOT throw an exception if a user adds, removes, or modifies fields using $project. The server will produce an error if _id is projected out, but a user should otherwise be able to modify the shape of the change stream event as desired. This may require the result to be deserialized to a BsonDocument or custom-defined type rather than a ChangeStreamDocument. It is the responsibility of the user to ensure that the deserialized type is compatible with the specified $project stage.

The aggregate helper methods MUST have no new logic related to the $changeStream stage. Drivers MUST be capable of handling TAILABLE_AWAIT cursors from the aggregate command in the same way they handle such cursors from find.

Collection.watch helper

Returns a ChangeStream on a specific collection

Command syntax:

{
  aggregate: 'collectionName'
  pipeline: [{$changeStream: {...}}, ...],
  ...
}
Database.watch helper
  • Since: 4.0

Returns a ChangeStream on all collections in a database.

Command syntax:

{
  aggregate: 1
  pipeline: [{$changeStream: {...}}, ...],
  ...
}

Drivers MUST use the ns returned in the aggregate command to set the collection option in subsequent getMore commands.

MongoClient.watch helper
  • Since: 4.0

Returns a ChangeStream on all collections in all databases in a cluster

Command syntax:

{
  aggregate: 1
  pipeline: [{$changeStream: {allChangesForCluster: true, ...}}, ...],
  ...
}

The helper MUST run the command against the admin database

Drivers MUST use the ns returned in the aggregate command to set the collection option in subsequent getMore commands.

ChangeStream

A ChangeStream is an abstraction of a TAILABLE_AWAIT cursor, with support for resumability. Implementers MAY choose to implement a ChangeStream as an extension of an existing tailable cursor implementation. If the ChangeStream is implemented as a type which owns a tailable cursor, then the implementer MUST provide a manner of closing the change stream, as well as satisfy the requirements of extending Iterable<Document>. If your language has an idiomatic way of disposing of resources you MAY choose to implement that in addition to, or instead of, an explicit close method.

A change stream MUST track the last resume token, per Updating the Cached Resume Token.

Drivers MUST raise an error on the first document received without a resume token (e.g. the user has removed _id with a pipeline stage), and close the change stream. The error message SHOULD resemble "Cannot provide resume functionality when the resume token is missing".

A change stream MUST attempt to resume a single time if it encounters any resumable error per Resumable Error. A change stream MUST NOT attempt to resume on any other type of error.

In addition to tracking a resume token, change streams MUST also track the read preference specified when the change stream was created. In the event of a resumable error, a change stream MUST perform server selection with the original read preference using the rules for server selection before attempting to resume.

Single Server Topologies

Presently, change streams cannot be initiated on single server topologies as they do not have an oplog. Drivers MUST NOT throw an exception in this scenario, but instead rely on an error returned from the server. This allows for the server to seamlessly introduce support for this in the future, without need to make changes in driver code.

startAtOperationTime
  • Since: 4.0

startAtOperationTime specifies that a change stream will only return changes that occurred at or after the specified Timestamp.

The server expects startAtOperationTime as a BSON Timestamp. Drivers MUST allow users to specify a startAtOperationTime option in the watch helpers. They MUST allow users to specify this value as a raw Timestamp.

startAtOperationTime, resumeAfter, and startAfter are all mutually exclusive; if any two are set, the server will return an error. Drivers MUST NOT throw a custom error, and MUST defer to the server error.

The ChangeStream MUST save the operationTime from the initial aggregate response when the following criteria are met:

  • None of startAtOperationTime, resumeAfter, startAfter were specified in the ChangeStreamOptions.
  • The max wire version is >= 7.
  • The initial aggregate response had no results.
  • The initial aggregate response did not include a postBatchResumeToken.
resumeAfter

resumeAfter is used to resume a ChangeStream that has been stopped to ensure that only changes starting with the log entry immediately after the provided token will be returned. If the resume token specified does not exist, the server will return an error.

Resume Process

Once a ChangeStream has encountered a resumable error, it MUST attempt to resume one time. The process for resuming MUST follow these steps:

  • Perform server selection.
  • Connect to selected server.
  • If there is a cached resumeToken:
    • If the ChangeStream was started with startAfter and has yet to return a result document:
      • The driver MUST set startAfter to the cached resumeToken.
      • The driver MUST NOT set resumeAfter.
      • The driver MUST NOT set startAtOperationTime. If startAtOperationTime was in the original aggregation command, the driver MUST remove it.
    • Else:
      • The driver MUST set resumeAfter to the cached resumeToken.
      • The driver MUST NOT set startAfter. If startAfter was in the original aggregation command, the driver MUST remove it.
      • The driver MUST NOT set startAtOperationTime. If startAtOperationTime was in the original aggregation command, the driver MUST remove it.
  • Else if there is no cached resumeToken and the ChangeStream has a saved operation time (either from an originally specified startAtOperationTime or saved from the original aggregation) and the max wire version is >= 7:
    • The driver MUST NOT set resumeAfter.
    • The driver MUST NOT set startAfter.
    • The driver MUST set startAtOperationTime to the value of the originally used startAtOperationTime or the one saved from the original aggregation.
  • Else:
    • The driver MUST NOT set resumeAfter, startAfter, or startAtOperationTime.
    • The driver MUST use the original aggregation command to resume.

When resumeAfter is specified the ChangeStream will return notifications starting with the oplog entry immediately after the provided token.

If the server supports sessions, the resume attempt MUST use the same session as the previous attempt's command.

A driver MUST ensure that consecutive resume attempts can succeed, even in the absence of any changes received by the cursor between resume attempts.

A driver SHOULD attempt to kill the cursor on the server on which the cursor is opened during the resume process, and MUST NOT attempt to kill the cursor on any other server. Any exceptions or errors that occur during the process of killing the cursor should be suppressed, including both errors returned by the killCursor command and exceptions thrown by opening, writing to, or reading from the socket.

Exposing All Resume Tokens
  • Since: 4.0.7

Users can inspect the _id on each ChangeDocument to use as a resume token. But since MongoDB 4.0.7, aggregate and getMore responses also include a postBatchResumeToken. Drivers use one or the other when automatically resuming, as described in Resume Process.

Drivers MUST expose a mechanism to retrieve the same resume token that would be used to automatically resume. It MUST be possible to use this mechanism after iterating every document. It MUST be possible for users to use this mechanism periodically even when no documents are getting returned (i.e. getMore has returned empty batches). Drivers have two options to implement this.

Option 1: ChangeStream::getResumeToken()
interface ChangeStream extends Iterable<Document> {
  /**
   * Returns the cached resume token that will be used to resume
   * after the most recently returned change.
   */
  public getResumeToken() Optional<Document>;
}

This MUST be implemented in synchronous drivers. This MAY be implemented in asynchronous drivers.

Option 2: Event Emitted for Resume Token

Allow users to set a callback to listen for new resume tokens. The exact interface is up to the driver, but it MUST meet the following criteria:

  • The callback is set in the same manner as a callback used for receiving change documents.
  • The callback accepts a resume token as an argument.
  • The callback (or event) MAY include an optional ChangeDocument, which is unset when called with resume tokens sourced from postBatchResumeToken.

A possible interface for this callback MAY look like:

interface ChangeStream extends Iterable<Document> {
  /**
   * Returns a resume token that should be used to resume after the most
   * recently returned change.
   */
  public onResumeTokenChanged(ResumeTokenCallback:(Document resumeToken) => void);
}

This MUST NOT be implemented in synchronous drivers. This MAY be implemented in asynchronous drivers.

Updating the Cached Resume Token

The following rules describe how to update the cached resumeToken:

  • When the ChangeStream is started:
    • If startAfter is set, cache it.
    • Else if resumeAfter is set, cache it.
    • Else, resumeToken remains unset.
  • When aggregate or getMore returns:
    • If an empty batch was returned and a postBatchResumeToken was included, cache it.
  • When returning a document to the user:
    • If it's the last document in the batch and a postBatchResumeToken is included, cache it.
    • Else, cache the _id of the document.
Not Blocking on Iteration

Synchronous drivers MUST provide a way to iterate a change stream without blocking until a change document is returned. This MUST give the user an opportunity to get the most up-to-date resume token, even when the change stream continues to receive empty batches in getMore responses. This allows users to call ChangeStream::getResumeToken() after iterating every document and periodically when no documents are getting returned.

Although the implementation of tailable awaitData cursors is not specified, this MAY be implemented with a tryNext method on the change stream cursor.

All drivers MUST document how users can iterate a change stream and receive all resume token updates. Why do we allow access to the resume token to users shows an example. The documentation MUST state that users intending to store the resume token should use this method to get the most up to date resume token.

Timeouts

Drivers MUST apply timeouts to change stream establishment, iteration, and resume attempts per Client Side Operations Timeout: Change Streams.

Notes and Restrictions

1. fullDocument: updateLookup can result in change documents larger than 16 MiB

There is a risk that if there is a large change to a large document, the full document and delta might result in a document larger than the 16 MiB limitation on BSON documents. If that happens the cursor will be closed, and a server error will be returned.

2. Users can remove the resume token with aggregation stages

It is possible for a user to specify the following stage:

{ $project: { _id: 0 } }

Similar removal of the resume token is possible with the $redact and $replaceRoot stages. While this is not technically illegal, it makes it impossible for drivers to support resumability. Users may explicitly opt out of resumability by issuing a raw aggregation with a $changeStream stage.

Rationale

Why Helper Methods?

Change streams are a first class concept similar to CRUD or aggregation; the fact that they are initiated via an aggregation pipeline stage is merely an implementation detail. By requiring drivers to support top-level helper methods for this feature we not only signal this intent, but also solve a number of other potential problems:

Disambiguation of the result type of this special-case aggregation pipeline (ChangeStream), and an ability to control the behaviors of the resultant cursor

More accurate support for the concept of a maximum time the user is willing to wait for subsequent queries to complete on the resultant cursor (maxAwaitTimeMs)

Finer control over the options pertaining specifically to this type of operation, without polluting the already well-defined AggregateOptions

Flexibility for future potentially breaking changes for this feature on the server

Why are ChangeStreams required to retry once on a resumable error?

User experience is of the utmost importance. Errors not originating from the server are generally network errors, and network errors can be transient. Attempting to resume an interrupted change stream after the initial error allows for a seamless experience for the user, while subsequent network errors are likely to be an outage which can then be exposed to the user with greater confidence.

Why do we allow access to the resume token to users

Imagine a scenario in which a user wants to process each change to a collection at least once, but the application crashes during processing. In order to overcome this failure, a user might use the following approach:

localChange = getChangeFromLocalStorage()
resumeToken = getResumeTokenFromLocalStorage()

if localChange:
  processChange(localChange)

try:
    change_stream = db.collection.watch([...], resumeAfter=resumeToken)
    while True:
        change = change_stream.try_next()
        persistResumeTokenToLocalStorage(change_stream.get_resume_token())
        if change:
          persistChangeToLocalStorage(change)
          processChange(change)
except Exception:
    log.error("...")

In this case the current change is always persisted locally, including the resume token, such that on restart the application can still process the change while ensuring that the change stream continues from the right logical time in the oplog. It is the application's responsibility to ensure that processChange is idempotent, this design merely makes a reasonable effort to process each change at least once.

Why is there no example of the desired user experience?

The specification used to include this overspecified example of the "desired user experience":

try:
    for change in db.collection.watch(...):
        print(change)
except Exception:
    # We know for sure it's unrecoverable:
    log.error("...")

It was decided to remove this example from the specification for the following reasons:

  • Tailable + awaitData cursors behave differently in existing supported drivers.
  • There are considerations to be made for languages that do not permit interruptible I/O (such as Java), where a change stream which blocks forever in a separate thread would necessitate killing the thread.
  • There is something to be said for an API that allows cooperation by default. The model in which a call to next only blocks until any response is returned (even an empty batch), allows for interruption and cooperation (e.g. interaction with other event loops).

Why is an allow list of error codes preferable to a deny list?

Change streams originally used a deny list of error codes to determine which errors were not resumable. However, this allowed for the possibility of infinite resume loops if an error was not correctly deny listed. Due to the fact that all errors aside from transient issues such as failovers are not resumable, the resume behavior was changed to use an allow list. Part of this change was to introduce the ResumableChangeStreamError label so the server can add new error codes to the allow list without requiring changes to drivers.

Why is CursorNotFound special-cased when determining resumability?

With the exception of CursorNotFound, a server error on version 4.4 or higher is considered resumable if and only if it contains the ResumableChangeStreamError label. However, this label is only added by the server if the cursor being created or iterated is a change stream. CursorNotFound is returned when a getMore is done with a cursor ID that the server isn't aware of and therefore can't determine if the cursor is a change stream. Marking all CursorNotFound errors resumable in the server regardless of cursor type could be confusing as a user could see the ResumableChangeStreamError label when iterating a non-change stream cursor. To workaround this, drivers always treat this error as resumable despite it not having the proper error label.

Why do we need to send a default startAtOperationTime when resuming a ChangeStream?

startAtOperationTime allows a user to create a resumable change stream even when a result (and corresponding resumeToken) is not available until a later point in time.

For example:

  • A client creates a ChangeStream, and calls watch
  • The ChangeStream sends out the initial aggregate call, and receives a response with no initial values. Because there are no initial values, there is no latest resumeToken.
  • The client's network is partitioned from the server, causing the client's getMore to time out
  • Changes occur on the server.
  • The network is unpartitioned
  • The client attempts to resume the ChangeStream

In the above example, not sending startAtOperationTime will result in the change stream missing the changes that occurred while the server and client are partitioned. By sending startAtOperationTime, the server will know to include changes from that previous point in time.

Why do we need to expose the postBatchResumeToken?

Resume tokens refer to an oplog entry. The resume token from the _id of a document corresponds the oplog entry of the change. The postBatchResumeToken represents the oplog entry the change stream has scanned up to on the server (not necessarily a matching change). This can be a much more recent oplog entry, and should be used to resume when possible.

Attempting to resume with an old resume token may degrade server performance since the server needs to scan through more oplog entries. Worse, if the resume token is older than the last oplog entry stored on the server, then resuming is impossible.

Imagine the change stream matches a very small percentage of events. On a getMore the server scans the oplog for the duration of maxAwaitTimeMS but finds no matching entries and returns an empty response (still containing a postBatchResumeToken). There may be a long sequence of empty responses. Then due to a network error, the change stream tries resuming. If we tried resuming with the most recent _id, this throws out the oplog scanning the server had done for the long sequence of getMores with empty responses. But resuming with the last postBatchResumeToken skips the unnecessary scanning of unmatched oplog entries.

Test Plan

See tests/README.md

Backwards Compatibility

There should be no backwards compatibility concerns.

Reference Implementations

  • NODE (NODE-1055)
  • PYTHON (PYTHON-1338)
  • RUBY (RUBY-1228)

Changelog

  • 2024-02-09: Migrated from reStructuredText to Markdown.

  • 2023-08-11: Update server versions for $changeStreamSplitLargeEvent test.

  • 2023-05-22: Add spec test for $changeStreamSplitLargeEvent.

  • 2022-10-20: Reformat changelog.

  • 2022-10-05: Remove spec front matter.

  • 2022-08-22: Add clusterTime to ChangeStreamDocument.

  • 2022-08-17: Support disambiguatedPaths in UpdateDescription.

  • 2022-05-19: Support new change stream events with showExpandedEvents.

  • 2022-05-17: Add wallTime to ChangeStreamDocument.

  • 2022-04-13: Support returning point-in-time pre and post-images with
    fullDocumentBeforeChange and fullDocument.

  • 2022-03-25: Do not error when parsing change stream event documents.

  • 2022-02-28: Add to to ChangeStreamDocument.

  • 2022-02-10: Specify that getMore command must explicitly send inherited
    comment.

  • 2022-02-01: Add comment to ChangeStreamOptions.

  • 2022-01-19: Require that timeouts be applied per the client-side operations
    timeout specification.

  • 2021-09-01: Clarify that server selection during resumption should respect
    normal server selection rules.

  • 2021-04-29: Add load-balanced to test topology requirements.

  • 2021-04-23: Update to use modern terminology.

  • 2021-02-08: Add the UpdateDescription.truncatedArrays field.

  • 2020-02-10: Change error handling approach to use an allow list.

  • 2019-07-15: Clarify resume process for change streams started with the
    startAfter option.

  • 2019-07-09: Change fullDocument to be an optional string.

  • 2019-07-02: Fix server version for startAfter.

  • 2019-07-01: Clarify that close may be implemented with more idiomatic
    patterns instead of a method.

  • 2019-06-20: Fix server version for addition of postBatchResumeToken.

  • 2019-04-12: Clarify caching process for resume token.

  • 2019-04-03: Update the lowest server version that supports
    postBatchResumeToken.

  • 2019-01-10: Clarify error handling for killing the cursor.

  • 2018-11-06: Add handling of postBatchResumeToken.

  • 2018-12-14: Add startAfter to change stream options.

  • 2018-09-09: Add dropDatabase to change stream operationType.

  • 2018-07-30: Remove redundant error message checks for resumable errors.

  • 2018-07-27: Add drop to change stream operationType.

  • 2018-06-14: Clarify how to calculate startAtOperationTime.

  • 2018-05-24: Change startAtClusterTime to startAtOperationTime.

  • 2018-04-18: Add helpers for Database and MongoClient, and add
    startAtClusterTime option.

  • 2018-04-17: Clarify that the initial aggregate should not be retried.

  • 2017-12-13: Default read concern is also accepted, not just "majority".

  • 2017-11-06: Defer to Read and Write concern spec for determining a read
    concern for the helper method.

  • 2017-09-26: Clarify that change stream options may be added later.

  • 2017-09-21: Clarify that we need to close the cursor on missing token.

  • 2017-09-06: Remove desired user experience example.

  • 2017-08-22: Clarify killing cursors during resume process.

  • 2017-08-16: Fix formatting of resume process.

  • 2017-08-16: Add clarification regarding Resumable errors.

  • 2017-08-07: Fix typo in command format.

  • 2017-08-03: Initial commit.

.. role:: javascript(code) :language: javascript

======================================= Find, getMore and killCursors commands.

:Status: Accepted :Minimum Server Version: 3.2

.. contents::

Abstract

The Find, GetMore and KillCursors commands in MongoDB 3.2 or later replace the use of the legacy OP_QUERY, OP_GET_MORE and OP_KILL_CURSORS wire protocol messages. This specification lays out how drivers should interact with the new commands when compared to the legacy wire protocol level messages.

Definitions

Meta

The keywords "MUST", “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119_.

.. _RFC 2119: https://www.ietf.org/rfc/rfc2119.txt

Terms

Document ^^^^^^^^

::

The term Document refers to the implementation in the driver's language of a BSON document.

Command ^^^^^^^

::

A BSON document containing the fields making up a MongoDB server command.

Wire Protocol ^^^^^^^^^^^^^

::

The binary protocol used to talk with MongoDB over a socket. It’s made up by the OP_QUERY, OP_GET_MORE, OP_KILL_CURSORS, OP_INSERT, OP_UPDATE and OP_DELETE.

Guidance

Documentation

The documentation provided in code below is merely for driver authors and SHOULD NOT be taken as required documentation for the driver.

Implementation

If the hello command returns maxWireVersion >= 4, drivers:

  • MUST implement queries with the find command instead of OP_QUERY.

  • MUST implement cursor operations with the getMore and killCursors commands instead of OP_GET_MORE and OP_KILL_CURSORS, respectively.

  • MUST NOT use OP_QUERY except to execute commands.

Commands

find

The find_ command replaces the query functionality of the OP_QUERY wire protocol message but cannot execute queries against special collections. Unlike the legacy OP_QUERY wire protocol message, the find command cannot be used to execute other commands.

.. _find: https://www.mongodb.com/docs/manual/reference/command/find/

For a successful command, the document returned from the server has the following format:

.. code:: javascript

{
  "cursor": {
    "id": <int64>,
    "ns": <string>,
    "firstBatch": [
      ...
    ]
  },
  "ok": 1
}

Special Collection names ^^^^^^^^^^^^^^^^^^^^^^^^

The find command does not support querying on system collections, so if drivers are using any system collections instead of the inprog, killop, unlock, etc. commands they SHOULD default to using the old-style OP_QUERY.

Any driver that provides helpers for any of the special collections below SHOULD use the replacement commands if hello.maxWireVersion >= 4 or higher.

.. list-table:: Special Collection Names :widths: 15 30 :header-rows: 1

    • Special collection name
    • Replacement Command
    • $cmd.sys.inprog
    • currentOp
    • $cmd.sys.unlock
    • fsyncUnlock
    • .system.indexes
    • listIndexes
    • .system.namespaces
    • listCollections

Exhaust ^^^^^^^

This section only applies to drivers that support exhaust cursors.

The exhaust protocol differs based on the server version:

================ ========================================================================================================================= Server version Server behavior ================ ========================================================================================================================= 4.0 and earlier Only supports exhaust over legacy OP_QUERY. The find command does not support the exhaust flag from OP_QUERY. 4.2 to 5.0 Supports exhaust both over legacy OP_QUERY and OP_MSG. 5.1 and later Supports exhaust over OP_MSG. ================ =========================================================================================================================

Therefore drivers that implement exhaust cursors:

================ ================================================================================================================================== Server version Driver behavior ================ ================================================================================================================================== 4.0 and earlier Drivers MUST use legacy OP_QUERY. 4.2 to 5.0 Drivers SHOULD use OP_MSG but MAY use legacy OP_QUERY. 5.1 and later Drivers MUST only use OP_MSG. Alternatively, drivers MAY fallback to a non-exhaust cursor when an exhaust cursor is requested. ================ ==================================================================================================================================

Interactions with OP_QUERY ^^^^^^^^^^^^^^^^^^^^^^^^^^

When sending a find operation as a find command rather than a legacy OP_QUERY find only the secondaryOk flag is honored of the flags available in the flag field on the wire protocol.

For the find, getMore and killCursors commands the numberToReturn field SHOULD be -1. To execute find commands against a secondary the driver MUST set the secondaryOk bit for the find command to successfully execute.

The secondaryOk flag SHOULD not be set for all follow-up getMore and killCursors commands. The cursor on the server keeps the original secondaryOk value first set on the find command.

More detailed information about the interaction of the secondaryOk with OP_QUERY can be found in the Server Selection Spec Passing a Read Preference_.

.. _Passing a Read Preference: ./server-selection/server-selection.md#passing-read-preference-to-mongos

Behavior of Limit, skip and batchSize ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The find command has different semantics to the existing 3.0 and earlier OP_QUERY wire protocol message. The limit field is a hard limit on the total number of documents returned by the cursor no matter what batchSize is provided. This includes other limiting operations, such as the $limit aggregation pipeline stage. This differs from existing OP_QUERY behavior where there is no server-side concept of limit and where the driver MUST keep track of the limit on the client side and MUST send a OP_KILL_CURSORS wire protocol message when the limit is reached.

When setting the batchSize on the find and getMore commands the value of batchSize MUST be based on the cursor limit calculations specified in the CRUD_ specification.

Note that with 5.0, the server-side handling of cursors with a limit has changed. Before 5.0, some cursors were automatically closed when the limit was reached (e.g. when running find with limit), and the reply document did not include a cursor ID (i.e. cursor.id was 0). Starting with 5.0, all cursor-producing operations will return a cursor ID if the end of the batch being returned lines up with the limit on the cursor. In this case, drivers MUST ensure the cursor is closed on the server, either by exhausting the cursor or by using killCursors to kill it.

In the following example the limit is set to 4 and the batchSize is set to 3 the following commands are executed. The last command is either killCursors or getMore, depending on how a driver ensures the cursor is closed on 5.0:

.. code:: javascript

{find: ..., batchSize:3, limit:4}
{getMore: ..., batchSize:1} // Returns remaining items but leaves cursor open on 5.0+
{...}          // Kills server-side cursor. Necessary on 5.0+

.. _CRUD: ./crud/crud.md#find

If there are not enough documents in the cursor to fulfill the limit defined, the cursor runs to exhaustion and is closed, returning a cursorId of 0 to the client.

Below are are some examples of using limit, skip and batchSize.

We have 100 documents in the collection t. We execute the following find command in the shell.

.. code:: javascript

var b = db.runCommand({find:"t", limit:20, batchSize:10});

db.runCommand({getMore:b.cursor.id, collection:"t", batchSize:20});

The find command executes and returns the first 10 results. The getMore command returns the final 10 results reaching the limit of 20 documents.

The skip option works in the same way as the current OP_QUERY starting the cursor after skipping n number of documents of the query.

.. code:: javascript

var b = db.runCommand({find:"t", limit:20, batchSize:10, skip:85});

db.runCommand({getMore:b.cursor.id, collection:"t", batchSize:20});

The find command returns the documents 86-95 and the getMore returns the last 5 documents.

Mapping OP_QUERY behavior to the find command limit and batchSize fields ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The way that limit, batchSize and singleBatch are defined for the find command differs from how these were specified in OP_QUERY and the CRUD spec.

Specifically, negative values for limit and batchSize are no longer allowed and the singleBatch option is used instead of negative values.

In order to have consistency between old and new applications, the following transformations MUST be performed before adding options to the find command:

.. code::

singleBatch = (limit < 0) || (batchSize < 0)
limit       = abs(limit)
if singleBatch:
    batchSize = (limit == 0) ? abs(batchSize) : limit
else:
    batchSize = abs(batchSize)

Further, after these transformation:

  • If limit is zero, it MUST be omitted from find options
  • If batchSize is zero, it MUST be omitted from find options
  • If singleBatch is false, it MUST be omitted from find options

BatchSize of 1 ^^^^^^^^^^^^^^

In 3.2 a batchSize of 1 means return a single document for the find command and it will not destroy the cursor after the first batch of documents are returned. Given a query returning 4 documents the number of commands issues will be.

  1. find command with batchSize=1
  2. getMore command with batchSize=1
  3. getMore command with batchSize=1
  4. getMore command with batchSize=1

The driver SHOULD NOT attempt to emulate the behavior seen in 3.0 or earlier as the new find command enables the user expected behavior of allowing the first result to contain a single document when specifying batchSize=1.

Tailable cursors ^^^^^^^^^^^^^^^^

By default, most cursors are non-tailable. For example, a typical find cursor is exhausted when all results for the filter have been returned. MongoDB also supports creating cursors that "tail" or follow the target namespace for new data. This is done via the find command's tailable option. Querying a capped collection is one use case for a tailable cursor. A tailable cursor can receive getMore responses with an empty nextBatch array, which does not indicate that the cursor has been exhausted.

Additionally, tailable cursors may "await data" from the server, which means that the server will wait to respond to the client's getMore command until new data is available or some period of time has elapsed, whichever comes first. A tailable find cursor can be configured using the awaitData option. The amount of time the server will wait for new data is based on the maxTimeMS field of the getMore (or server default if unspecified). If the time does expire an empty batch will be returned. A maxTimeMS field cannot be sent if the cursor was not configured with awaitData=true.

To create a tailable find cursor you execute the following command:

.. _find: https://www.mongodb.com/docs/manual/reference/command/find/

.. code:: typescript

interface FindCommand { /** The namespace to run the find command on / find: string; /* The filter to control what documents are returned / filter: BSONDocument; /* Informs the server whether to keep the cursor open even when there are no results to satisfy the query / tailable?: boolean; /* * Informs the server whether to block on a getMore's maxTimeMS before returning an empty nextBatch. * This must be set if getMores are to include maxTimeMS values. / awaitData?: boolean; /* Controls the amount of milliseconds the server will allow the operations to run for */ maxTimeMS?: PositiveIntegerNumber; // ... Note: additional options unrelated to tailable cursors omitted }

If maxTimeMS is not set in FindOptions, the driver SHOULD refrain from setting maxTimeMS on the find or getMore commands issued by the driver and allow the server to use its internal default value for maxTimeMS.

Semantics of maxTimeMS for a Driver ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In the case of a non-tailable cursor query OR a tailable cursor query with awaitData == false, the driver MUST set maxTimeMS on the find command and MUST NOT set maxTimeMS on the getMore command.

In the case of a tailable cursor with awaitData == true the driver MUST provide a Cursor level option named maxAwaitTimeMS (See CRUD specification for details). The maxTimeMS option on the getMore command MUST be set to the value of the option maxAwaitTimeMS. If no maxAwaitTimeMS is specified, the driver SHOULD not set maxTimeMS on the getMore command.

getMore

The getMore_ command replaces the OP_GET_MORE wire protocol message. The query flags passed to OP_QUERY for a getMore command MUST be secondaryOk=true when sent to a secondary. The OP_QUERY namespace MUST be the same as for the find and killCursors commands.

.. _getMore: https://www.mongodb.com/docs/manual/reference/command/getMore/

.. code:: typescript

interface GetMoreCommand { /** Set to the nonzero cursor id / getMore: int64; /* Set to the namespace returned by the initial command response / collection: string; /* * User configurable document count for the batch returned for this getMore. * Only attached to command document if nonzero. / batchSize?: PositiveIntegerNumber; /* * User configurable time limit enforced by the server. / maxTimeMS?: PositiveIntegerNumber; /* * User configurable comment that can be used to identify the operation in logs. * This can be any BSON value. */ comment?: BSONValue; }

On success, the getMore command will return the following:

.. code:: typescript

interface GetMoreResponse {
  ok: 1;
  cursor: {
    /** The cursor id, this may be equal to zero indicating the cursor is exhausted or closed */
    id: int64;
    /**
     * The namespace the cursor is running on.
     * This value may be different than the namespace the driver started the cursor on, for example, database level aggregations.
     */
    ns: string;
    /**
     * The next batch of documents.
     * This array may be empty, in the case of a tailable cursor, which DOES NOT indicate the cursor is exhausted.
     */
    nextBatch: BSONArray<BSONDocument>;
  }
  // ... Note: additional non-relevant fields omitted
}

The driver's local cursor MUST update its id and ns, as well as store the nextBatch from every getMore response.

killCursors

The killCursors_ command replaces the OP_KILL_CURSORS wire protocol message. The OP_QUERY namespace MUST be the same as for the find and getMore commands. The killCursors command is optional to implement in MongoDB 3.2.

.. _killCursors: https://www.mongodb.com/docs/manual/reference/command/killCursors/

The command response will be as follows:

.. code:: javascript

{
  "cursorsKilled": [
    <cursor id 1>
    <cursor id 2>,
    ...
    <cursor id n>
  ],
  "cursorsNotFound": [
    <cursor id 1>
    <cursor id 2>,
    ...
    <cursor id n>
  ],
  "cursorsAlive": [
    <cursor id 1>
    <cursor id 2>,
    ...
    <cursor id n>
  ],
  ok: 1
}

The cursorsAlive array contain cursors that were not possible to kill. The information SHOULD be ignored by the driver.

Difference from 3.0 OP_KILL_CURSORS ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

One of the differences with the new killCursors command compared to the OP_KILL_CURSORS wire protocol message is that the killCursors command returns a response while the OP_KILL_CURSORS wire protocol does not.

The OP_REPLY message has the following general structure.

.. code:: javascript

struct {
    int32     messageLength;  // total message size, including
                              // this

    int32     requestID;      // identifier for this message

    int32     responseTo;     // requestID from the original
                              // request(used in responses from db)

    int32     opCode;         // request type - see table below

    int32     responseFlags;  // bit vector - see details below

    int64     cursorID;       // cursor id if client needs to do
                              // get more's

    int32     startingFrom;   // where in the cursor this reply is
                              // starting

    int32     numberReturned; // number of documents in the reply

    document* documents;      // documents
}

For the find, getMore and killCursors MongoDB returns a single document meaning numberReturned is set to 1. This is in contrast to MongoDB 3.0 and earlier where a OP_QUERY query will set numberReturned to >= 0.

A driver MUST deserialize the command result and extract the firstBatch and nextBatch arrays for the find and getMore commands to access the returned documents.

The result from the killCursors command MAY be safely ignored.

If the driver supports returning raw BSON buffers instead of deserialized documents there might be a need to be able to partially deserialize documents to be able to efficiently provide the behavior in comparison to existing OP_QUERY queryresults.

Errors

The find and getMore commands will report errors using the standard mechanism: an "ok: 0" field paired with “errmsg” and “code” fields. See below for example error responses:

.. code:: shell

> db.runCommand({find: "t", sort: {padding: -1}})

.. code:: javascript

{
  "errmsg" : "exception: Executor error: Overflow sort stage buffered data usage of 41630570 bytes exceeds internal limit of 33554432 bytes",
  "code" : 28616,
  "ok" : 0
}

.. code:: shell

> db.runCommand({find: "t", foo: "bar"})

.. code:: javascript

{
  "ok" : 0,
  "errmsg" : "Failed to parse: { find: \"t\", foo: \"bar\" }. Unrecognized field 'foo'.",
  "code" : 2
}

Like other commands, the find and getMore commands will not use the OP_REPLY response flags. OP_REPLY Documentation_

.. _OP_REPLY Documentation: https://www.mongodb.com/docs/meta-driver/latest/legacy/mongodb-wire-protocol/#op-reply

FAQ

Changes in error handling for 3.2 tailable cursor

Tailable cursors pointing to documents in a capped collection that get overwritten will return a zero document result in MongoDB 3.0 or earlier but will return an error in MongoDB 3.2

Explain command

There is no equivalent of the $explain modifier in the find command. The driver SHOULD use the explain command. Information about the command can be found in the Explain command reference_.

.. _Explain command reference: https://www.mongodb.com/docs/manual/reference/command/explain/

ReadPreference and Mongos

The find command does not include a readPreference field. To pass a readPreference to a mongos use the $readPreference field and format your command as.

.. code:: javascript

{$query: {find: ...}, $readPreference: {}}

This format is general for all commands when executing against a Mongos proxy.

More in depth information about passing read preferences to Mongos can be found in the Server Selection Specification Server Selection Specification_.

.. _Server Selection Specification: ./server-selection/server-selection.md#passing-read-preference-to-mongos

Changelog

:2023-05-10: Improve tailable cursor description and update the getMore section code blocks. :2022-10-05: Remove spec front matter and reformat changelog. :2022-02-01: Replace examples/tables for find, getMore, and killCursors with server manual links. :2021-12-14: Exhaust cursors may fallback to non-exhaust cursors on 5.1+ servers. Relax requirement of OP_MSG for exhaust cursors. :2021-08-27: Exhaust cursors must use OP_MSG on 3.6+ servers. :2021-04-06: Updated to use hello and secondaryOk. :2015-10-21: If no maxAwaitTimeMS is specified, the driver SHOULD not set maxTimeMS on the getMore command. :2015-10-13: Added guidance on batchSize values as related to the getMore command. Legacy secondaryOk flag SHOULD not be set on getMore and killCursors commands. Introduced maxAwaitTimeMS option for setting maxTimeMS on getMore commands when the cursor is a tailable cursor with awaitData set. :2015-09-30: Legacy secondaryOk flag must be set to true on getMore and killCursors commands to make drivers have same behavior as for OP_GET_MORE and OP_KILL_CURSORS.

GridFS Spec

  • Status: Accepted
  • Minimum Server Version: 2.2

Abstract

GridFS is a convention drivers use to store and retrieve BSON binary data (type "x05") that exceeds MongoDB"s BSON-document size limit of 16 MiB. When this data, called a user file, is written to the system, GridFS divides the file into chunks that are stored as distinct documents in a chunks collection. To retrieve a stored file, GridFS locates and returns all of its component chunks. Internally, GridFS creates a files collection document for each stored file. Files collection documents hold information about stored files, and they are stored in a files collection.

This spec defines a basic API for GridFS. This spec also outlines advanced GridFS features that drivers can choose to support in their implementations. Additionally, this document attempts to clarify the meaning and purpose of all fields in the GridFS data model, disambiguate GridFS terminology, and document configuration options that were previously unspecified.

Definitions

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Terms

Bucket name
A prefix under which a GridFS system"s collections are stored. Collection names for the files and chunks collections are prefixed with the bucket name. The bucket name MUST be configurable by the user. Multiple buckets may exist within a single database. The default bucket name is "fs".

Chunk
A section of a user file, stored as a single document in the "chunks" collection of a GridFS bucket. The default size for the data field in chunks is 255 KiB. Chunk documents have the following form:

{
  "_id" : <ObjectId>,
  "files_id" : <TFileId>,
  "n" : <Int32>,
  "data" : <binary data>
}

_id
a unique ID for this document of type BSON ObjectId

files_id
the id for this file (the _id from the files collection document). This field takes the type of the corresponding _id in the files collection.

n
the index number of this chunk, zero-based.

data
a chunk of data from the user file

Chunks collection
A collection in which chunks of a user file are stored. The name for this collection is the word 'chunks' prefixed by the bucket name. The default is "fs.chunks".

Empty chunk
A chunk with a zero length "data" field.

Files collection
A collection in which information about stored files is stored. There will be one files collection document per stored file. The name for this collection is the word "files" prefixed by the bucket name. The default is "fs.files".

Files collection document
A document stored in the files collection that contains information about a single stored file. Files collection documents have the following form:

{
  "_id" : <TFileId>,
  "length" : <Int64>,
  "chunkSize" : <Int32>,
  "uploadDate" : <BSON datetime, ms since Unix epoch in UTC>,
  "md5" : <hex string>,
  "filename" : <string>,
  "contentType" : <string>,
  "aliases" : <string array>,
  "metadata" : <Document>
}

_id
a unique ID for this document. Usually this will be of type ObjectId, but a custom _id value provided by the application may be of any type.

length
the length of this stored file, in bytes

chunkSize
the size, in bytes, of each data chunk of this file. This value is configurable by file. The default is 255 KiB.

uploadDate
the date and time this file was added to GridFS, stored as a BSON datetime value. The value of this field MUST be the datetime when the upload completed, not the datetime when it was begun.

md5
DEPRECATED, a hash of the contents of the stored file

filename
the name of this stored file; this does not need to be unique

contentType
DEPRECATED, any MIME type, for application use only

aliases
DEPRECATED, for application use only

metadata
any additional application data the user wishes to store

Note: some older versions of GridFS implementations allowed applications to add arbitrary fields to the files collection document at the root level. New implementations of GridFS will not allow this, but must be prepared to handle existing files collection documents that might have additional fields.

Note: drivers SHOULD store length as Int64 and chunkSize as Int32 when creating new GridFS files. However, drivers MUST be able to handle existing GridFS files where the length and chunkSize fields might have been stored using a different numeric data type.

Orphaned chunk
A document in the chunks collections for which the "files_id" does not match any _id in the files collection. Orphaned chunks may be created if write or delete operations on GridFS fail part-way through.

Stored File
A user file that has been stored in GridFS, consisting of a files collection document in the files collection and zero or more documents in the chunks collection.

Stream
An abstraction that represents streamed I/O. In some languages a different word is used to represent this abstraction.

TFileId
While GridFS file id values are ObjectIds by default, an application may choose to use custom file id values, which may be of any type. In this spec the term TFileId refers to whatever data type is appropriate in the driver's programming language to represent a file id. This would be something like object, BsonValue or a generic <TFileId> type parameter.

User File
A data added by a user to GridFS. This data may map to an actual file on disk, a stream of input, a large data object, or any other large amount of consecutive data.

Specification

Guidance

Documentation

The documentation provided in code below is merely for driver authors and SHOULD NOT be taken as required documentation for the driver.

Operations

All drivers MUST offer the Basic API operations defined in the following sections and MAY offer the Advanced API operations. This does not preclude a driver from offering more.

Operation Parameters

All drivers MUST offer the same options for each operation as defined in the following sections. This does not preclude a driver from offering more. The options parameter is optional. A driver SHOULD NOT require a user to specify optional parameters.

Deviations

A non-exhaustive list of acceptable deviations are as follows:

  • Using named parameters instead of an options hash. For instance,

    id = bucket.upload_from_stream(filename, source, chunkSizeBytes: 16 * 1024);
    
  • Using a fluent style for constructing a GridFSBucket instance:

    bucket = new GridFSBucket(database)
      .withReadPreference(ReadPreference.Secondary);
    

When using a fluent-style builder, all options should be named rather than inventing a new word to include in the pipeline (like options). Required parameters are still required to be on the initiating constructor.

Timeouts

Drivers MUST enforce timeouts for all operations per Client Side Operations Timeout: GridFS API.

Naming

All drivers MUST name operations, objects, and parameters as defined in the following sections.

Deviations are permitted as outlined below.

Deviations

When deviating from a defined name, an author should consider if the altered name is recognizable and discoverable to the user of another driver.

A non-exhaustive list of acceptable naming deviations are as follows:

  • Using "bucketName" as an example, Java would use "bucketName" while Python would use "bucket_name". However, calling it "bucketPrefix" would not be acceptable.
  • Using "maxTimeMS" as an example, .NET would use "MaxTime" where its type is a TimeSpan structure that includes units. However, calling it "MaximumTime" would not be acceptable.
  • Using "GridFSUploadOptions" as an example, Javascript wouldn't need to name it while other drivers might prefer to call it "GridFSUploadArgs" or "GridFSUploadParams". However, calling it "UploadOptions" would not be acceptable.
  • Languages that use a different word than "Stream" to represent a streamed I/O abstraction may replace the word "Stream" with their language's equivalent word. For example, open_upload_stream might be called open_upload_file or open_upload_writer if appropriate.
  • Languages that support overloading MAY shorten the name of some methods as appropriate. For example, download_to_stream and download_to_stream_by_name MAY be overloaded download_to_stream methods with different parameter types. Implementers are encouraged not to shorten method names unnecessarily, because even if the shorter names are not ambiguous today they might become ambiguous in the future as new features are added.

API

This section presents two groups of features, a basic API that a driver MUST implement, and a more advanced API that drivers MAY choose to implement additionally.

Basic API

Configurable GridFSBucket class

class GridFSBucketOptions {

  /**
   * The bucket name. Defaults to 'fs'.
   */
  bucketName : String optional;

  /**
   * The chunk size in bytes. Defaults to 255 KiB.
   */
  chunkSizeBytes : Int32 optional;

  /**
   * The write concern. Defaults to the write concern of the database.
   */
  writeConcern : WriteConcern optional;

  /**
   * The read concern. Defaults to the read concern of the database.
   */
  readConcern : ReadConcern optional;

  /**
   * The read preference. Defaults to the read preference of the database.
   */
  readPreference : ReadPreference optional;

  /**
   * TRANSITIONAL: This option is provided for backwards compatibility.
   * It MUST be supported while a driver supports MD5 and MUST be removed
   * (or made into a no-op) when a driver removes MD5 support entirely.
   * When true, the GridFS implementation will not compute MD5 checksums
   * of uploaded files. Defaults to false.
   */
  disableMD5: Boolean
}

class GridFSBucket {

  /**
   * Create a new GridFSBucket object on @db with the given @options.
   */
  GridFSBucket new(Database db, GridFSBucketOptions options=null);

}

Creates a new GridFSBucket object, managing a GridFS bucket within the given database.

GridFSBucket objects MUST allow the following options to be configurable:

  • bucketName: the name of this GridFS bucket. The files and chunks collection for this GridFS bucket are prefixed by this name followed by a dot. Defaults to "fs". This allows multiple GridFS buckets, each with a unique name, to exist within the same database.
  • chunkSizeBytes: the number of bytes stored in chunks for new user files added through this GridFSBucket object. This will not reformat existing files in the system that use a different chunk size. Defaults to 255 KiB.

IF a driver supports configuring readConcern, readPreference or writeConcern at the database or collection level, then GridFSBucket objects MUST also allow the following options to be configurable:

  • readConcern: defaults to the read concern on the parent database (or client object if the parent database has no read concern).
  • readPreference: defaults to the read preference on the parent database (or client object if the parent database has no read preference).
  • writeConcern: defaults to the write concern on the parent database (or client object if the parent database has no write concern).

The following option is transitional:

  • disableMD5: this allows users to disable MD5 when operating under FIPS restrictions. It is provided to allow a transition period as drivers remove MD5 support. Until a driver removes MD5 support, drivers MUST support this option. Following a driver's normal feature removal cycle, when MD5 support is removed, this option MUST be removed or otherwise made into a no-op option.

GridFSBucket instances are immutable. Their properties MUST NOT be changed after the instance has been created. If your driver provides a fluent way to provide new values for properties, these fluent methods MUST return new instances of GridFSBucket.

Indexes

For efficient execution of various GridFS operations the following indexes MUST exist:

  • an index on { filename : 1, uploadDate : 1 } on the files collection
  • a unique index on { files_id : 1, n : 1 } on the chunks collection

Normally we leave it up to the user to create whatever indexes they see fit, but because GridFS is likely to be looked at as a black box we should create these indexes automatically in a way that involves the least amount of overhead possible.

Before read operations

For read operations, drivers MUST assume that the proper indexes exist.

Before write operations

Immediately before the first write operation on an instance of a GridFSBucket class is attempted (and not earlier), drivers MUST:

  • determine if the files collection is empty using the primary read preference mode.
  • and if so, create the indexes described above if they do not already exist

To determine whether the files collection is empty drivers SHOULD execute the equivalent of the following shell command:

db.fs.files.findOne({}, { _id : 1 })

If no document is returned the files collection is empty.

This method of determining whether the files collection is empty should perform better than checking the count in the case where the files collection is sharded.

Drivers MUST check whether the indexes already exist before attempting to create them. This supports the scenario where an application is running with read-only authorizations.

When checking whether an index exists drivers MUST compare numeric values by value even when they are of different types, because the actual type will depend on how the index was created (for example, indexes created using the shell will have double as the type for numeric values).

For example, the following index specifications should all be treated as equivalent:

  • { filename : 1, uploadDate : 1 } // where 1 is either a 32-bit or 64-bit integer
  • { filename : 1, uploadDate : 1.0 }
  • { filename : 1.0, uploadDate : 1 }
  • { filename : 1.0, uploadDate : 1.0 }

If a driver determines that it should create the indexes, it MUST raise an error if the attempt to create the indexes fails.

Drivers MUST create the indexes in foreground mode.

File Upload

class GridFSUploadOptions {

  /**
   * The number of bytes per chunk of this file. Defaults to the
   * chunkSizeBytes in the GridFSBucketOptions.
   */
  chunkSizeBytes : Int32 optional;

  /**
   * User data for the 'metadata' field of the files collection document.
   * If not provided the driver MUST omit the metadata field from the
   * files collection document.
   */
  metadata : Document optional;

  /**
   * DEPRECATED: A valid MIME type. If not provided the driver MUST omit the
   * contentType field from the files collection document.
   *
   * Applications wishing to store a contentType should add a contentType field
   * to the metadata document instead.
   */
  contentType : String optional;

  /**
   * DEPRECATED: An array of aliases. If not provided the driver MUST omit the
   * aliases field from the files collection document.
   *
   * Applications wishing to store aliases should add an aliases field to the
   * metadata document instead.
   */
  aliases: String[] optional;

}

class GridFSBucket {

  /**
   * Opens a Stream that the application can write the contents of the file to.
   * The driver generates the file id.
   *
   * Returns a Stream to which the application will write the contents.
   *
   * Note: this method is provided for backward compatibility. In languages
   * that use generic type parameters, this method may be omitted since
   * the TFileId type might not be an ObjectId.
   */
  Stream open_upload_stream(string filename, GridFSUploadOptions options=null);

  /**
   * Opens a Stream that the application can write the contents of the file to.
   * The application provides a custom file id.
   *
   * Returns a Stream to which the application will write the contents.
   */
  Stream open_upload_stream_with_id(TFileId id, string filename, GridFSUploadOptions options=null);

  /**
   * Uploads a user file to a GridFS bucket. The driver generates the file id.
   *
   * Reads the contents of the user file from the @source Stream and uploads it
   * as chunks in the chunks collection. After all the chunks have been uploaded,
   * it creates a files collection document for @filename in the files collection.
   *
   * Returns the id of the uploaded file.
   *
   * Note: this method is provided for backward compatibility. In languages
   * that use generic type parameters, this method may be omitted since
   * the TFileId type might not be an ObjectId.
   */
  ObjectId upload_from_stream(string filename, Stream source, GridFSUploadOptions options=null);

  /**
   * Uploads a user file to a GridFS bucket. The application supplies a custom file id.
   *
   * Reads the contents of the user file from the @source Stream and uploads it
   * as chunks in the chunks collection. After all the chunks have been uploaded,
   * it creates a files collection document for @filename in the files collection.
   *
   * Note: there is no need to return the id of the uploaded file because the application
   * already supplied it as a parameter.
   */
  void upload_from_stream_with_id(TFileId id, string filename, Stream source, GridFSUploadOptions options=null);
}

Uploads a user file to a GridFS bucket. For languages that have a Stream abstraction, drivers SHOULD use that Stream abstraction. For languages that do not have a Stream abstraction, drivers MUST create an abstraction that supports streaming.

In the case of open_upload_stream, the driver returns a Stream to which the application will write the contents of the file. As the application writes the contents to the returned Stream, the contents are uploaded as chunks in the chunks collection. When the application signals it is done writing the contents of the file by calling close (or its equivalent) on the returned Stream, a files collection document is created in the files collection. Once the Stream has been closed (and the files collection document has been created) a driver MUST NOT allow further writes to the upload Stream.

The driver MUST make the Id of the new file available to the caller. Typically a driver SHOULD make the Id available as a property named Id on the Stream that is returned. In languages where that is not idiomatic, a driver MUST make the Id available in a way that is appropriate for that language.

In the case of upload_from_stream, the driver reads the contents of the user file by consuming the the source Stream until end of file is reached. The driver does NOT close the source Stream.

Drivers MUST take an "options" document with configurable parameters. Drivers for dynamic languages MUST ignore any unrecognized fields in the options for this method (this does not apply to drivers for static languages which define an Options class that by definition only contains valid fields).

Note that in GridFS, "filename" is not a unique identifier. There may be many stored files with the same filename stored in a GridFS bucket under different ids. Multiple stored files with the same filename are called 'revisions', and the 'uploadDate' is used to distinguish newer revisions from older ones.

Implementation details:

If "chunkSizeBytes" is set through the options, that value MUST be used as the chunk size for this stored file. If this parameter is not specified, the default chunkSizeBytes setting for this GridFSBucket object MUST be used instead.

To store a user file, the file must have a unique id. In some cases the driver can generate a unique ObjectId to serve as the id for the file being uploaded. Otherwise the application provides the value. Drivers store the contents of the user file in the chunks collection by breaking up the contents into chunks of size "chunkSizeBytes". For a non-empty user file, for each nth section of the file, drivers create a chunk document and set its fields as follows:

  • Files_id: the id generated for this stored file.
  • N: this is the nth section of the stored file, zero based.
  • Data: a section of file data, stored as BSON binary data with subtype 0x00. All chunks except the last one must be exactly 'chunkSizeBytes' long. The last chunk can be smaller, and should only be as large as necessary.

Historically, while streaming the user file, drivers computed an MD5 digest for the (now deprecated) 'md5' field of the files collection document. If drivers preserve this behavior for backwards compatibility, they MUST provide the 'disableMD5' member of GridFSBucketOptions. When 'disableMD5' is true, drivers MUST NOT compute an MD5 digest or include it in the files collection document. If drivers no longer support the deprecated 'md5' field, they MUST NOT provide the 'disableMD5' member (or it MUST be a no-op) and MUST NOT compute MD5.

After storing all chunk documents generated for the user file in the "chunks" collection, drivers create a files collection document for the file and store it in the files collection. The fields in the files collection document are set as follows:

NameDescription
lengththe length of this stored file, in bytes.
chunksizethe chunk size in bytes used to break the user file into chunks. While the configuration option is named "chunkSizeBytes" for clarity, for legacy reasons, the files collection document uses only "chunkSize".
uploaddatea BSON datetime object for the current time, in UTC, when the files collection document was created.
md5MD5 checksum for this user file, computed from the file's data, stored as a hex string, if computed, otherwise omitted.
filenamethe filename passed to this function, UTF-8 encoded.
contenttypethe "contentType" passed in the options, if provided; otherwise omitted.
aliasesthe array passed in the options, if provided; otherwise omitted.
metadatathe "metadata" document passed in the options, if provided; otherwise omitted.

If a user file contains no data, drivers MUST still create a files collection document for it with length set to zero. Drivers MUST NOT create any empty chunks for this file.

Drivers MUST NOT run the 'filemd5' database command to confirm that all chunks were successfully uploaded. We assume that if none of the inserts failed then the chunks must have been successfully inserted, and running the 'filemd5' command would just be unnecessary overhead and doesn't work for sharded chunk keys anyway.

Operation Failure

If any of the above operations fail against the server, drivers MUST raise an error. If some inserts succeeded before the failed operation, these become orphaned chunks. Drivers MUST NOT attempt to clean up these orphaned chunks. The rationale is that whatever failure caused the orphan chunks will most likely also prevent cleaning up the orphaned chunks, and any attempts to clean up the orphaned chunks will simply cause long delays before reporting the original failure to the application.

Aborting an upload

Drivers SHOULD provide a mechanism to abort an upload. When using open_upload_stream, the returned Stream SHOULD have an Abort method. When using upload_from_stream, the upload will be aborted if the source stream raises an error.

When an upload is aborted any chunks already uploaded MUST be deleted. Note that this differs from the case where an attempt to insert a chunk fails, in which case drivers immediately report the failure without attempting to delete any chunks already uploaded.

Abort MUST raise an error if it is unable to successfully abort the upload (for example, if an error occurs while deleting any chunks already uploaded). However, if the upload is being aborted because the source stream provided to upload_from_stream raised an error then the original error should be re-raised.

Abort MUST also close the Stream, or at least place it in an aborted state, so any further attempts to write additional content to the Stream after Abort has been called fail immediately.

File Download

class GridFSBucket {

  /** Opens a Stream from which the application can read the contents of the stored file
   * specified by @id.
   *
   * Returns a Stream.
   */
  Stream open_download_stream(TFileId id);

  /**
   * Downloads the contents of the stored file specified by @id and writes
   * the contents to the @destination Stream.
   */
  void download_to_stream(TFileId id, Stream destination);

}

Downloads a stored file from a GridFS bucket. For languages that have a Stream abstraction, drivers SHOULD use that Stream abstraction. For languages that do not have a Stream abstraction, drivers MUST create an abstraction that supports streaming.

In the case of open_download_stream, the application reads the contents of the stored file by reading from the returned Stream until end of file is reached. The application MUST call close (or its equivalent) on the returned Stream when it is done reading the contents.

In the case of download_to_stream the driver writes the contents of the stored file to the provided Stream. The driver does NOT call close (or its equivalent) on the Stream.

Note: By default a file id is of type ObjectId. If an application uses custom file ids it may be of any type.

Implementation details:

Drivers must first retrieve the files collection document for this file. If there is no files collection document, the file either never existed, is in the process of being deleted, or has been corrupted, and the driver MUST raise an error.

Then, implementers retrieve all chunks with files_id equal to id, sorted in ascending order on "n".

However, when downloading a zero length stored file the driver MUST NOT issue a query against the chunks collection, since that query is not necessary. For a zero length file, drivers return either an empty stream or send nothing to the provided stream (depending on the download method).

If a networking error or server error occurs, drivers MUST raise an error.

As drivers stream the stored file they MUST check that each chunk received is the next expected chunk (i.e. it has the expected "n" value) and that the data field is of the expected length. In the case of open_download_stream, if the application stops reading from the stream before reaching the end of the stored file, any errors that might exist beyond the point at which the application stopped reading won't be detected by the driver.

File Deletion

class GridFSBucket {

  /**
   * Given a @id, delete this stored file"s files collection document and
   * associated chunks from a GridFS bucket.
   */
  void delete(TFileId id);

}

Deletes the stored file"s files collection document and associated chunks from the underlying database.

As noted for download(), drivers that previously used id"s of a different type MAY implement a delete() method that accepts that type, but MUST mark that method as deprecated.

Implementation details:

There is an inherent race condition between the chunks and files collections. Without some transaction-like behavior between these two collections, it is always possible for one client to delete a stored file while another client is attempting a read of the stored file. For example, imagine client A retrieves a stored file"s files collection document, client B deletes the stored file, then client A attempts to read the stored file"s chunks. Client A wouldn"t find any chunks for the given stored file. To minimize the window of vulnerability of reading a stored file that is the process of being deleted, drivers MUST first delete the files collection document for a stored file, then delete its associated chunks.

If there is no such file listed in the files collection, drivers MUST raise an error. Drivers MAY attempt to delete any orphaned chunks with files_id equal to id before raising the error.

If a networking or server error occurs, drivers MUST raise an error.

Generic Find on Files Collection

class GridFSFindOptions {

  /**
   * Enables writing to temporary files on the server. When set to true, the server
   * can write temporary data to disk while executing the find operation on the files collection.
   *
   * This option is sent only if the caller explicitly provides a value. The default
   * is to not send a value. For servers < 3.2, this option is ignored and not sent
   * as allowDiskUse does not exist in the OP_QUERY wire protocol.
   *
   * @see https://www.mongodb.com/docs/manual/reference/command/find/
   */
  allowDiskUse: Optional<Boolean>;

  /**
   * The number of documents to return per batch.
   */
  batchSize : Int32 optional;

  /**
   * The maximum number of documents to return.
   */
  limit : Int32 optional;

  /**
   * The maximum amount of time to allow the query to run.
   */
  maxTimeMS: Int64 optional;

  /**
   * The server normally times out idle cursors after an inactivity period
   * to prevent excess memory use. Set this option to prevent that.
   */
  noCursorTimeout : Boolean optional;

  /**
   * The number of documents to skip before returning.
   */
  skip : Int32 optional;

  /**
   * The order by which to sort results. Defaults to not sorting.
   */
  sort : Document optional;

}

class GridFSBucket {

  /**
   * Find and return the files collection documents that match @filter.
   */
  Iterable find(Document filter, GridFSFindOptions options=null);

}

This call will trigger a find() operation on the files collection using the given filter. Drivers returns a sequence of documents that can be iterated over. Drivers return an empty or null set when there are no matching files collection documents. As the number of files could be large, drivers SHOULD return a cursor-like iterable type and SHOULD NOT return a fixed-size array type.

Implementation details:

Drivers SHOULD NOT perform any validation on the filter. If the filter contains fields that do not exist within files collection documents, then an empty result set will be returned.

Drivers MUST propagate the GridFSFindOptions to the FindOptions passed to the underlying Find operation.

Drivers MUST document how users query files collection documents, including how to query metadata, e.g. using a filter like { metadata.fieldname : "some_criteria" }.

Advanced API

File Download by Filename

class GridFSDownloadByNameOptions {

  /**
   * Which revision (documents with the same filename and different uploadDate)
   * of the file to retrieve. Defaults to -1 (the most recent revision).
   *
   * Revision numbers are defined as follows:
   * 0 = the original stored file
   * 1 = the first revision
   * 2 = the second revision
   * etc…
   * -2 = the second most recent revision
   * -1 = the most recent revision
   */
  revision : Int32 optional;

}

class GridFSBucket {

  /** Opens a Stream from which the application can read the contents of the stored file
   * specified by @filename and the revision in @options.
   *
   * Returns a Stream.
   */
  Stream open_download_stream_by_name(string filename, GridFSDownloadByNameOptions options=null);

  /**
   * Downloads the contents of the stored file specified by @filename and by the
   * revision in @options and writes the contents to the @destination Stream.
   */
  void download_to_stream_by_name(string filename, Stream destination,
    GridFSDownloadByNameOptions options=null);

}

Retrieves a stored file from a GridFS bucket. For languages that have a Stream abstraction, drivers SHOULD use that Stream abstraction. For languages that do not have a Stream abstraction, drivers MUST create an abstraction that supports streaming.

Implementation details:

If there is no file with the given filename, or if the requested revision does not exist, drivers MUST raise an error with a distinct message for each case.

Drivers MUST select the files collection document of the file to-be-returned by running a query on the files collection for the given filename, sorted by uploadDate (either ascending or descending, depending on the revision requested) and skipping the appropriate number of documents. For negative revision numbers, the sort is descending and the number of documents to skip equals (-revision - 1). For non-negative revision numbers, the sort is ascending and the number of documents to skip equals the revision number.

If a networking error or server error occurs, drivers MUST raise an error.

Partial File Retrieval

In the case of open_download_stream, drivers SHOULD support partial file retrieval by allowing the application to read only part of the stream. If a driver does support reading only part of the stream, it MUST do so using the standard stream methods of its language for seeking to a position in a stream and reading the desired amount of data from that position. This is the preferred method of supporting partial file retrieval.

In the case of download_to_stream, drivers are not required to support partial file retrieval. If they choose to do so, drivers can support this operation by adding "start" and "end" to their supported options for download_to_stream. These values represent non-negative byte offsets from the beginning of the file. When "start" and "end" are specified, drivers return the bytes of the file in [start, end). If "start" and "end" are equal no data is returned.

If either "start" or "end" is invalid, drivers MUST raise an error. These values are considered invalid if they are negative, greater than the file length, or if "start" is greater than "end".

When performing partial reads, drivers SHOULD use the file"s "chunkSize" to calculate which chunks contain the desired section and avoid reading unneeded documents from the "chunks" collection.

Renaming stored files

class GridFSBucket {

  /**
   * Renames the stored file with the specified @id.
   */
  void rename(TFileId id, string new_filename);

}

Sets the filename field in the stored file"s files collection document to the new filename.

Implementation details:

Drivers construct and execute an update_one command on the files collection using { _id: @id } as the filter and { $set : { filename : "new_filename" } } as the update parameter.

To rename multiple revisions of the same filename, users must retrieve the full list of files collection documents for a given filename and execute "rename" on each corresponding _id.

If there is no file with the given id, drivers MUST raise an error.

Dropping an entire GridFS bucket

class GridFSBucket {

  /**
   * Drops the files and chunks collections associated with
   * this bucket.
   */
  void drop();

}

This method drops the files and chunks collections associated with this GridFS bucket.

Drivers should drop the files collection first, and then the chunks collection.

Test Plan

TBD

Motivation for Change

The existing GridFS documentation is only concerned with the underlying data model for this feature, and does not specify what basic set of features an implementation of GridFS should or should not provide. As a result, GridFS is currently implemented across drivers, but with varying APIs, features, and behavior guarantees. Current implementations also may not conform to the existing documentation.

This spec documents minimal operations required by all drivers offering GridFS support, along with optional features that drivers may choose to support. This spec is also explicit about what features/behaviors of GridFS are not specified and should not be supported. Additionally, this spec validates and clarifies the existing data model, deprecating fields that are undesirable or incorrect.

Design Rationale

Why is the default chunk size 255 KiB?
On MMAPv1, the server provides documents with extra padding to allow for in-place updates. When the "data" field of a chunk is limited to 255 KiB, it ensures that the whole chunk document (the chunk data along with an _id and other information) will fit into a 256 KiB section of memory, making the best use of the provided padding. Users setting custom chunk sizes are advised not to use round power-of-two values, as the whole chunk document is likely to exceed that space and demand extra padding from the system. WiredTiger handles its memory differently, and this optimization does not apply. However, because application code generally won"t know what storage engine will be used in the database, always avoiding round power-of-two chunk sizes is recommended.

Why can"t I alter documents once they are in the system?
GridFS works with documents stored in multiple collections within MongoDB. Because there is currently no way to atomically perform operations across collections in MongoDB, there is no way to alter stored files in a way that prevents race conditions between GridFS clients. Updating GridFS stored files without that server functionality would involve a data model that could support this type of concurrency, and changing the GridFS data model is outside of the scope of this spec.

Why provide a "rename" method?
By providing users with a reasonable alternative for renaming a file, we can discourage users from writing directly to the files collections under GridFS. With this approach we can prevent critical files collection documents fields from being mistakenly altered.

Why is there no way to perform arbitrary updates on the files collection?
The rename helper defined in this spec allows users to easily rename a stored file. While updating files collection documents in other, more granular ways might be helpful for some users, validating such updates to ensure that other files collection document fields remain protected is a complicated task. We leave the decision of how best to provide this functionality to a future spec.

What is the "md5" field of a files collection document and how was it used?
"md5" holds an MD5 checksum that is computed from the original contents of a user file. Historically, GridFS did not use acknowledged writes, so this checksum was necessary to ensure that writes went through properly. With acknowledged writes, the MD5 checksum is still useful to ensure that files in GridFS have not been corrupted. A third party directly accessing the 'files' and "chunks" collections under GridFS could, inadvertently or maliciously, make changes to documents that would make them unusable by GridFS. Comparing the MD5 in the files collection document to a re-computed MD5 allows detecting such errors and corruption. However, drivers now assume that the stored file is not corrupted, and applications that want to use the MD5 value to check for corruption must do so themselves.

Why store the MD5 checksum instead of creating the hash as-needed?
The MD5 checksum must be computed when a file is initially uploaded to GridFS, as this is the only time we are guaranteed to have the entire uncorrupted file. Computing it on-the-fly as a file is read from GridFS would ensure that our reads were successful, but guarantees nothing about the state of the file in the system. A successful check against the stored MD5 checksum guarantees that the stored file matches the original and no corruption has occurred.

Why are MD5 checksums now deprecated? What should users do instead?
MD5 is prohibited by FIPS 140-2. Operating systems and libraries operating in FIPS mode do not provide the MD5 algorithm. To avoid a broken GridFS feature on such systems, the use of MD5 with GridFS is deprecated, should not be added to new implementations, and should be removed from existing implementations according to the deprecation policy of individual drivers. Applications that desire a file digest should implement it outside of GridFS and store it with other file metadata.

Why do drivers no longer need to call the filemd5 command on upload?
When a chunk is inserted and no error occurs the application can assume that the chunk was correctly inserted. No other operations that insert or modify data require the driver to double check that the operation succeeded. It can be assumed that any errors would have been detected by use of the appropriate write concern. Using filemd5 also prevents users from sharding chunk keys.

What about write concern?
This spec leaves the choice of how to set write concern to driver authors. Implementers may choose to accept write concern through options on the given methods, to set a configurable write concern on the GridFS object, to enforce a single write concern for all GridFS operations, or to do something different.

If a user has given GridFS a write concern of 0, should we perform MD5 calculations? (If supported for backwards compatibility)
Yes, because the checksum is used for detecting future corruption or misuse of GridFS collections.

Is GridFS limited by sharded systems?
For best performance, clients using GridFS on a sharded system should use a shard key that ensures all chunks for a given stored file are routed to the same shard. Therefore, if the chunks collection is sharded, you should shard on the files_id. Normally only the chunks collection benefits from sharding, since the files collection is usually small. Otherwise, there are no limitations to GridFS on sharded systems.

Why is contentType deprecated?
Most fields in the files collection document are directly used by the driver, with the exception of: metadata, contentType and aliases. All information that is purely for use of the application should be embedded in the 'metadata' document. Users of GridFS who would like to store a contentType for use in their applications are encouraged to add a 'contentType' field to the "metadata" document instead of using the deprecated top-level "contentType" field.

Why are aliases deprecated?
The "aliases" field of the files collection documents was misleading. It implies that a file in GridFS could be accessed by alternate names when, in fact, none of the existing implementations offer this functionality. For GridFS implementations that retrieve stored files by filename or support specifying specific revisions of a stored file, it is unclear how "aliases" should be interpreted. Users of GridFS who would like to store alternate filenames for use in their applications are encouraged to add an "aliases" field to the "metadata" document instead of using the deprecated top-level "aliases" field.

What happened to the put and get methods from earlier drafts?
Upload and download are more idiomatic names that more clearly indicate their purpose. Get and put are often associated with getting and setting properties of a class, and using them instead of download and upload was confusing.

Why aren't there methods to upload and download byte arrays?
We assume that GridFS files are usually quite large and therefore that the GridFS API must support streaming. Most languages have easy ways to wrap a stream around a byte array. Drivers are free to add helper methods that directly support uploading and downloading GridFS files as byte arrays.

Should drivers report an error if a stored file has extra chunks?
The length and the chunkSize fields of the files collection document together imply exactly how many chunks a stored file should have. If the chunks collection has any extra chunks the stored file is in an inconsistent state. Ideally we would like to report that as an error, but this is an extremely unlikely state and we don't want to pay a performance penalty checking for an error that is almost never there. Therefore, drivers MAY ignore extra chunks.

Why have we changed our mind about requiring the file id to be an ObjectId?
This spec originally required the file id for all new GridFS files to be an ObjectId and specified that the driver itself would be the one to generate the ObjectId when a new file was uploaded. While this sounded like a good idea, it has since become evident that there are valid use cases for an application to want to generate its own file id, and that an application wouldn't necessarily want to use ObjectId as the type of the file id. The most common case where an application would want to use a custom file id is when the chunks collection is to be sharded and the application wants to use a custom file id that is suitable for sharding. Accordingly, we have relaxed this spec to allow an application to supply a custom file id (of any type) when uploading a new file.

How can we maintain backward compatibility while supporting custom file ids?
For most methods supporting custom file ids is as simple as relaxing the type of the id parameter from ObjectId to something more general like object or BSON value (or to a type parameter like <TFileId> in languages that support generic methods). In a few cases new methods were added to support custom file ids. The original upload_from_stream method returned an ObjectId, and support for custom file ids is implemented by adding a new method that takes the custom file id as an additional parameter. Drivers should continue to support the original method if possible to maintain backward compatibility. This spec does not attempt to completely mandate how each driver should maintain backward compatibility, as different languages have different approaches and capabilities for maintaining backward compatibility.

Backwards Compatibility

This spec presents a new API for GridFS systems, which may break existing functionality for some drivers. The following are suggestions for ways to mitigate these incompatibilities.

File revisions
This document presents a basic API that does not support specifying specific revisions of a stored file, and an advanced API that does. Drivers MAY choose to implement whichever API is closest to the functionality they now support. Note that the methods for file insertion are the same whether specifying specific revisions is supported or not.

Method names
If drivers provide methods that conform to the functionality outlined in this document, drivers MAY continue to provide those methods under their existing names. In this case, drivers SHOULD make it clear in their documentation that these methods have equivalents defined in the spec under a different name.

ContentType field
Drivers MAY continue to create a "contentType'" field within files collection documents, so that applications depending on this field continue to work. However, drivers SHOULD make it clear in their documentation that this field is deprecated, and is not used at all in driver code. Documentation SHOULD encourage users to store contentType in the "metadata" document instead.

Aliases field
Drivers MAY continue to create an "aliases" field within files collection documents, so that applications depending on this field continue to work. However, drivers SHOULD make it clear in their documentation that this field is deprecated, and is not used at all in driver code. Documentation SHOULD encourage users to store aliases in the "metadata" document instead.

Reference Implementation

TBD

Future work

Changes to the GridFS data model are out-of-scope for this spec, but may be considered for the future.

The ability to alter or append to existing GridFS files has been cited as something that would greatly improve the system. While this functionality is not in-scope for this spec (see "Why can"t I alter documents once they are in the system?") it is a potential area of growth for the future.

Changelog

  • 2024-02-27: Migrated from reStructuredText to Markdown.
  • 2016-05-10: Support custom file ids
  • 2016-10-07: Drivers SHOULD handle any numeric type of length and chunkSize
  • 2016-10-07: Added ReadConcern to the GridFS spec
  • 2016-10-07: Modified a JSON test that was testing optional behavior
  • 2018-01-31: Deprecated MD5, and specified an option to disable MD5 until removed
  • 2018-07-05: Must not use 'filemd5'
  • 2020-01-17: Added allowDiskUse to GridFSFindOptions
  • 2022-01-19: Require that timeouts be applied per the client-side operations timeout spec
  • 2022-10-05: Remove spec front matter and reformat changelog.

====================== Stable API For Drivers

:Status: Accepted :Minimum Server Version: N/A

.. contents::


Abstract

As MongoDB moves toward more frequent releases (a.k.a. continuous delivery), we want to enable users to take advantage of our rapidly released features, without exposing applications to incompatible server changes due to automatic server upgrades. A stable API will help accomplish that goal.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 <https://www.ietf.org/rfc/rfc2119.txt>__.

This document tends to use "SHOULD" more frequently than other specifications, but mainly in the context of providing guidance on writing test files. This is discussed in more detail in Design Rationale_.

Specification

Background

When applications interact with MongoDB, both the driver and the server participate in executing operations. Therefore, when determining application compatibility with MongoDB, both the driver and the server behavior must be taken into account.

An application can specify the server API version when creating MongoClient. When this is done:

  • The client sends the specified API version to the server, causing the server to behave in a manner compatible with that API version.
  • The driver will behave in a manner compatible with a server configured with that API version, regardless of the server's actual release version.

Presently there is no specification for how a driver must behave when a particular server API version is requested, or what driver operations are subject to API compatibility guarantees. Such requirements may be stipulated in subsequent specifications.

This specification requires MongoClient to validate that it supports the specified server API version, if any, but does not define what such support means.

MongoClient changes

MongoClient instances accept a new serverApi option to allow the user to declare an API version:

.. code:: typescript

class MongoClient { MongoClient(... serverApi: ServerApi); }

enum ServerApiVersion { v1 = "1", }

class ServerApi { version: string|ServerApiVersion; strict: Optional; // Default false deprecationErrors: Optional; // Default false }

Drivers SHOULD group the serverApi option with other similar client options like autoEncryptionOpts. Drivers MUST NOT allow specification of any stable API options via the connection string. See the Design Rationale_ for more details.

ServerApiVersion enumeration


This specification and subsequent specifications will define the known API
versions. Drivers SHOULD define an enumeration containing the known API
versions, using the version identifiers given. Drivers MAY deviate from the
version identifiers used in this and subsequent specifications if doing so is
necessary given the driver's programming language's constraints. Drivers MUST
ensure that adding new API versions to this enumeration does not result in
backward compatibility breaks in non-major releases. This can be the case in
languages that allow exhaustive ``switch`` statements (e.g. Swift).

Drivers for languages that don't have enums (e.g. PHP) MUST expose the version
as a string, but SHOULD offer constants to allow for IDE features such as code
completion. In these cases, the driver MUST validate (e.g. when the application
provides a version string to the ``ServerApi`` class) that the version string is
valid and trigger a client-side error if an unknown API version was used.


ServerApi class
~~~~~~~~~~~~~~~

The ``ServerApi`` class stores an API version, along with flags that decide
whether or not unknown or deprecated commands in the specified API version
trigger a server-side error. A ``version`` MUST be specified when declaring an
API version, while the ``strict`` and ``deprecationErrors`` options are both
optional. The ``ServerApi`` class is considered immutable; changes to the
declared API version MUST be prohibited.


Declared Version Inheritance

Drivers MUST ensure that users cannot override the API version declared in the MongoClient instance. This includes the MongoDatabase and MongoCollection classes, as well as any operations in these classes. See the rationale for more details.

Sending Declared API Version to the Server

The declared API version MUST be sent to the server as part of every command request, with the exception of the cases listed below. Drivers MUST NOT use a server's reported maxWireVersion to decide whether it supports the stable API. The server will reply with an error if the declared API version is not supported, or if the command does not support API versioning options. If the user does not declare an API version, the driver MUST NOT send any API versioning options to the server.

If an API version is declared then the driver MUST use OP_MSG for all messages, including the initial handshake.

Command Syntax


The options from the declared API version are mapped to the following command
options:

===================== ========================
**ServerApi field**   **Command option**
``version``           ``apiVersion``
``strict``            ``apiStrict``
``deprecationErrors`` ``apiDeprecationErrors``
===================== ========================

If an API version was declared, drivers MUST add the ``apiVersion`` option to
every command that is sent to a server. Drivers MUST add the ``apiStrict`` and
``apiDeprecationErrors`` options if they were specified by the user, even when
the specified value is equal to the server default. Drivers MUST NOT add any
API versioning options if the user did not specify them.
This includes the ``getMore`` command as well as all commands that are part of a
transaction. A previous version of this specification excluded those commands,
but that has since changed in the server.


Handshake behavior

If an API version was declared, drivers MUST NOT use the legacy hello command during the initial handshake or afterwards. Instead, drivers MUST use the hello command exclusively and use the OP_MSG protocol. If the server does not support hello, the server description MUST reflect this with an Unknown server type.

Cursors


For ``getMore``, drivers MUST submit API parameters. If the values given do not
match the API parameters given in the cursor-initiating command, the server will
reply with an error.


Transactions

When running commands as part of a transaction, drivers MUST send API parameters with all commands that are part of a transaction, including commitTransaction and abortTransaction. If the API parameters for a command in a transaction do not match those of the transaction-starting command, the server will reply with an error.

Generic command helper


Drivers that offer a generic command helper (e.g. ``command()`` or
``runCommand()``) MUST NOT inspect the command document to detect API versioning
options. As with all other commands, drivers MUST inherit the API version from
the client. Specifying API versioning options in the command document and
declaring an API version on the client is not supported. Drivers MUST document
that the behaviour of the command helper is undefined in this case.


Design Rationale
================

No URI Options
--------------

Since changing the API version can cause the application to behave differently,
drivers MUST NOT allow users to change the declared API version without
deploying code changes. This ensures that users don't copy a connection string
with a declared API version that may be different from what their application
expects. A URI option can be added later if we realise our users need it, while
the opposite is not easily accomplished.


Don't Allow Overriding the Declared API Version
-----------------------------------------------

While users are used to overriding options like read preference, read concern,
and write concern in ``MongoDatabase`` and ``MongoCollection`` objects, or on an
operation level, we explicitly decided against this for the declared API
version. With a single API version available to start, we can't anticipate what
use cases users may have to override the API version. Not including this feature
at the beginning allows us to gather feedback on use cases and add the features
users are looking for. On the other hand, adding the ability to override the
declared API version can't be undone until a future major release, which is
almost impossible to accomplish across all drivers.


Generic Command Helper Behaviour
--------------------------------

The runCommand helper is a way for the user to run a native command with the
driver doing little to no inspection in the command. This allows users to run
arbitrary commands that may not have helpers in the driver, or to pass options
that are not supported by the driver version they are currently using. Commands
run using this helper do not inherit any ``readConcern`` or ``writeConcern``
options that may have been set on the ``MongoClient`` or ``MongoDatabase``
objects.

However, the declared API version is a different case. We are introducing this
feature to give users a certain peace of mind when upgrading driver or server
versions, by ensuring that their code will continue to show the same behaviour
they've gotten used to. This includes all commands run using the generic command
helper. Thus, the helper will inherit the API version declared on the client.


Hardcode supported versions in drivers
--------------------------------------

Since a new API version might require driver changes (e.g. to account for
removed commands), we don't yet know what changes drivers must make for a future
version. Until we do, we must prevent users from choosing any unknown API
version.


Backward Compatibility
======================

Driver changes are fully backward compatible. Not declaring an API version when
creating a client may cause an error if the server was started with the
``requireApiVersion`` option enabled, but this is outside of driver control.


Future Work
===========

Overriding the Declared API Version
-----------------------------------

In the future, we may want to allow users to override the declared API version
on a ``MongoDatabase``, ``MongoCollection``, or individual operation level.
However, this is not necessary until there is a different API version and we
have data on why and how users would want to override the declared API version.


Stable CRUD API
---------------

Drivers may also want to provide specialized ``MongoClient``, ``MongoDatabase``,
and ``MongoCollection`` classes to only include features that are part of the
stable API. This is not covered in this specification.


Changelog
=========

:2022-10-05: Remove spec front matter and reformat changelog.
:2022-02-24: Rename Versioned API to Stable API
:2022-01-14: Require ``OP_MSG`` for all messages including the initial step of
             the handshake when using stable API.
:2021-05-05: Require sending stable API parameters with ``getMore`` and
             transaction-continuing commands.
:2021-04-20: Require using ``hello`` when using the stable API.
:2021-04-10: Replaced usages of ``acceptAPIVersion2`` with ``acceptApiVersion2``.

Client Side Encryption

  • Status: Accepted
  • Minimum Server Version: 4.2 (CSFLE), 6.0 (Queryable Encryption)
  • Version: 1.13.0

Abstract

MongoDB 4.2 introduced support for client side encryption, guaranteeing that sensitive data can only be encrypted and decrypted with access to both MongoDB and a separate key management provider (supporting AWS, Azure, GCP, a local provider, and KMIP). Once enabled, data can be seamlessly encrypted and decrypted with minimal application code changes. 6.0 introduced the next generation of client side encryption based on a Structured Encryption framework which allows expressive encrypted search operations. This spec covers both capabilities - 1st generation, "Client Side Field Level Encryption" and generation 2, "Queryable Encryption" - as the associated core cryptographic library and supporting drivers share a common codebase.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Terms

encrypted MongoClient
A MongoClient with client side encryption enabled.

data key
A key used to encrypt and decrypt BSON values. Data keys are encrypted with a key management service (e.g. AWS KMS) and stored within a document in the MongoDB key vault collection (see Key vault collection schema for data keys for a description of the data key document). Therefore, a client needs access to both MongoDB and the external KMS service to utilize a data key.

MongoDB key vault collection
A MongoDB collection designated to contain data keys. This can either be co-located with the data-bearing cluster, or in a separate external MongoDB cluster.

Key Management Service (KMS)
An external service providing fixed-size encryption/decryption. Only data keys are encrypted and decrypted with KMS.

KMS providers\

A map of KMS providers to credentials. Configured client-side. Example:

kms_providers = {
   "aws": {
      "accessKeyId": AWS_KEYID,
      "secretAccessKey": AWS_SECRET,
   },
   "local": {
      "key": LOCAL_KEK
   },
}

KMS provider
A configured KMS. Identified by a key in the KMS providers map. The key has the form "" or ":". Examples: "aws" or "aws:myname". In libmongocrypt, the key is referred to as the KMS ID.

KMS provider type
The type of backing KMS. Identified by the string: "aws", "azure", "gcp", "kmip", or "local".

KMS provider name
An optional name to identify a KMS provider. Enables configuring multiple KMS providers with the same KMS provider type (e.g. "aws:name1" and "aws:name2" can refer to different AWS accounts).

Customer Master Key (CMK)
The underlying key AWS KMS uses to encrypt and decrypt. See AWS Key Management Service Concepts.

schema
A MongoDB JSON Schema (either supplied by the server or client-side) which may include metadata about encrypted fields. This is a JSON Schema based on draft 4 of the JSON Schema specification, as documented in the MongoDB manual..

libmongocrypt
A library, written in C, that coordinates communication, does encryption/decryption, caches key and schemas. Located here.

mongocryptd
A local process the driver communicates with to determine how to encrypt values in a command.

crypt_shared
This term, spelled in all-lowercase with an underscore, refers to the client-side field-level-encryption dynamic library provided as part of a MongoDB Enterprise distribution. It replaces mongocryptd as the method of marking-up a database command for encryption <subtype6.intent-to-encrypt>.

See also:

ciphertext
One of the data formats of BSON binary subtype 6, representing an encoded BSON document containing encrypted ciphertext and metadata.

FLE
FLE is the first version of Client-Side Field Level Encryption. FLE is almost entirely client-side with the exception of server-side JSON schema.

Queryable Encryption
Queryable Encryption the second version of Client-Side Field Level Encryption. Data is encrypted client-side. Queryable Encryption supports indexed encrypted fields, which are further processed server-side.

In-Use Encryption
Is an umbrella term describing the both FLE and Queryable Encryption.

encryptedFields
A BSON document describing the Queryable Encryption encrypted fields. This is analogous to the JSON Schema in FLE. The following is an example encryptedFields in extended canonical JSON:

{
    "escCollection": "enxcol_.CollectionName.esc",
    "ecocCollection": "enxcol_.CollectionName.ecoc",
    "fields": [
        {
            "path": "firstName",
            "keyId": { "$binary": { "subType": "04", "base64": "AAAAAAAAAAAAAAAAAAAAAA==" }},
            "bsonType": "string",
            "queries": {"queryType": "equality"}
        },
        {
            "path": "ssn",
            "keyId": { "$binary": { "subType": "04", "base64": "BBBBBBBBBBBBBBBBBBBBBB==" }},
            "bsonType": "string"
        }
    ]
}

The acronyms within encryptedFields are defined as follows:

  • ECOC: Encrypted Compaction Collection
  • ESC: Encrypted State Collection

Introduction

Client side encryption enables users to specify what fields in a collection must be encrypted, and the driver automatically encrypts commands and decrypts results. Automatic encryption is enterprise only. But users can manually encrypt and decrypt with a new ClientEncryption object.

Client side encryption requires MongoDB 4.2 compatible drivers, and is only supported against 4.2 or higher servers. See Why is a 4.2 server required?.

The following shows basic usage of the new API.

# The schema map identifies fields on collections that must undergo encryption.

schema_map = open("./schemas.json", "r").read()

# AWS KMS is used to decrypt data keys stored in the key vault collection.

aws_creds = open("./aws_credentials.json", "r").read()

# A client is configured for automatic encryption and decryption by passing
# AutoEncryptionOpts. Automatic encryption is an enterprise only feature.

opts = AutoEncryptionOpts(
    kms_providers={"aws": aws_creds},
    key_vault_namespace="db.datakeys",
    schema_map=schema_map)

db = MongoClient(auto_encryption_opts=opts).db

# Commands are encrypted, as determined by the JSON Schema from the schema_map.
db.coll.insert_one({"ssn": "457-55-5462"})

# Replies are decrypted.
print(db.coll.find_one()) # { "ssn": "457-55-5462" } but stored and transported as ciphertext.

# A ClientEncryption object is used for explicit encryption, decryption, and creating data keys.
opts = ClientEncryptionOpts(kms_providers=kms, key_vault_namespace="db.datakeys")
clientencryption = ClientEncryption(client, opts)

# Use a ClientEncryption to create new data keys.
# The master key identifies the CMK on AWS KMS to use for encrypting the data key.
master_key = open("./aws_masterkey.json", "r").read()
opts = DataKeyOpts (master_key=master_key)
created_key_id = clientencryption.create_data_key("aws", opts)

# Use a ClientEncryption to explicitly encrypt and decrypt.
opts = EncryptOpts(key_id=created_key_id,
    algorithm="AEAD_AES_256_CBC_HMAC_SHA_512-Random")
encrypted = clientencryption.encrypt("secret text", opts)
decrypted = clientencryption.decrypt(encrypted)

There are many moving parts to client side encryption with lots of similar sounding terms. Before proceeding to implement the specification, the following background should provide some context.

The driver interacts with multiple components to implement client side encryption.

image

The driver communicates with…

  • MongoDB cluster to get remote JSON Schemas.
  • MongoDB key vault collection to get encrypted data keys and create new data keys.
  • A KMS Provider to decrypt fetched data keys and encrypt new data keys.
  • mongocryptd to ask what values in BSON commands must be encrypted (unless crypt_shared is in use).

The MongoDB key vault may be the same as the MongoDB cluster. Users may choose to have data key stored on a separate MongoDB cluster, or co-locate with their data.

MongoDB Key Vault collection

The key vault collection is a special MongoDB collection containing key documents. See the appendix section Key vault collection schema for data keys for a description of the documents.

The key material in the key vault collection is encrypted with a separate KMS service. Therefore, encryption and decryption requires access to a MongoDB cluster and the KMS service.

KMS Provider

A KMS provider (AWS KMS, Azure Key Vault, GCP KMS, the local provider, or KMIP) is used to decrypt data keys after fetching from the MongoDB Key Vault, and encrypt newly created data keys. Refer to KMSProviders for the shape of the KMS provider options.

mongocryptd

mongocryptd is a singleton local process needed for auto-encryption when no crypt_shared library is used. It speaks the MongoDB wire protocol and the driver uses mongocryptd by connecting with a MongoClient. By default, if crypt_shared is unavailable, the driver should attempt to automatically spawn mongocryptd. If the MongoClient is configured with extraOptions.mongocryptdBypassSpawn set to true, OR bypassAutoEncryption is set to true, OR bypassQueryAnalysis is set to true then the driver will not attempt to spawn mongocryptd.

The mongocryptd process is responsible for self terminating after idling for a time period. If extraOptions.cryptSharedLibRequired is set to true, the driver will not connect to mongocryptd and instead rely on crypt_shared being available.

crypt_shared

crypt_shared is a dynamically-loaded C++ library providing query analysis for auto-encryption. It replaces mongocryptd for performing query analysis to - mark-up sensitive fields within a command.

Drivers are not required to load and interact with crypt_shared directly. Instead, they inform libmongocrypt where to find crypt_shared and libmongocrypt will handle crypt_shared communication automatically.

See also: Enabling crypt_shared for information on using enabling the crypt_shared library.

libmongocrypt

libmongocrypt is a C library providing crypto and coordination with external components. Located here.

libmongocrypt is responsible for…

  • orchestrating an internal state machine.
  • asking the driver to perform I/O, then handling the responses.
    • includes constructing KMS HTTP requests and parsing KMS responses.
  • doing encryption and decryption.
  • caching data keys.
  • caching results of listCollections.
  • creating key material.

The driver is responsible for…

  • performing all I/O needed at every state:
    • speaking to mongocryptd to mark commands (unless crypt_shared is used).
    • fetching encrypted data keys from key vault collection (mongod).
    • running listCollections on mongod.
    • decrypting encrypted data keys with KMS over TLS.
  • doing I/O asynchronously as needed.

See Why require including a C library?

User facing API

Drivers MUST NOT preclude future options from being added to any of the new interfaces.

Drivers MAY represent the options types in a way that is idiomatic to the driver or language. E.g. options MAY be a BSON document or dictionary type. The driver MAY forego validating options and instead defer validation to the underlying implementation.

Drivers MAY deviate the spelling of option names to conform to their language's naming conventions and implement options in an idiomatic way (e.g. keyword arguments, builder classes, etc.).

Drivers MAY use a native UUID type in place of a parameter or return type specified as a BSON binary with subtype 0x04 as described in Handling of Native UUID Types.

MongoClient Changes

class MongoClient {
   MongoClient(... autoEncryptionOpts: AutoEncryptionOpts);

   // Implementation details.
   private mongocrypt_t libmongocrypt_handle; // Handle to libmongocrypt.
   private Optional<MongoClient> mongocryptd_client; // Client to mongocryptd.
   private MongoClient keyvault_client; // Client used to run find on the key vault collection. This is either an external MongoClient, the parent MongoClient, or internal_client.
   private MongoClient metadata_client; // Client used to run listCollections. This is either the parent MongoClient or internal_client.
   private Optional<MongoClient> internal_client; // An internal MongoClient. Created if no external keyVaultClient was set, or if a metadataClient is needed

   // Exposition-only, used for caching automatic Azure credentials. The cached credentials may be globally cached.
   private cachedAzureAccessToken?: AzureAccessToken;
   private azureAccessTokenExpireTime?: PointInTime;
}
class AutoEncryptionOpts {
   keyVaultClient: Optional<MongoClient>;
   keyVaultNamespace: String;
   kmsProviders: KMSProviders;
   schemaMap: Optional<Map<String, Document>>; // Maps namespace to a local schema
   bypassAutoEncryption: Optional<Boolean>; // Default false.
   extraOptions: Optional<Map<String, Value>>;
   tlsOptions?: KMSProvidersTLSOptions; // Maps KMS provider to TLS options.
   encryptedFieldsMap: Optional<Map<String, Document>>; // Maps namespace to encryptedFields.
   // bypassQueryAnalysis disables automatic analysis of outgoing commands.
   // Set bypassQueryAnalysis to true to use explicit encryption on indexed fields
   // without the MongoDB Enterprise Advanced licensed crypt_shared library.
   bypassQueryAnalysis: Optional<Boolean>; // Default false.
}

A MongoClient can be configured to automatically encrypt collection commands and decrypt results.

Drivers MUST document that auto encryption is an enterprise-only feature and that auto encryption only occurs on collection level operations by including the following in the driver documentation for AutoEncryptionOpts:

Automatic encryption is an enterprise only feature that only applies to operations on a collection. Automatic encryption is not supported for operations on a database or view, and operations that are not bypassed will result in error (see libmongocrypt: Auto Encryption Allow-List). To bypass automatic encryption for all operations, set bypassAutoEncryption=true in AutoEncryptionOpts.

Explicit encryption/decryption and automatic decryption is a community feature. A MongoClient configured with bypassAutoEncryption=true will still automatically decrypt.

Drivers MUST document that auto encryption requires the authenticated user to have the listCollections privilege action by including the following in the driver documentation for MongoClient:

Automatic encryption requires the authenticated user to have the listCollections privilege action.

See Why is client side encryption configured on a MongoClient?

keyVaultNamespace

The key vault collection namespace refers to a collection that contains all data keys used for encryption and decryption (aka the key vault collection). Data keys are stored as documents in a special MongoDB collection. Data keys are protected with encryption by a KMS provider (AWS KMS, Azure key vault, GCP KMS, a local master key, or KMIP).

keyVaultClient

The key vault collection is assumed to reside on the same MongoDB cluster as indicated by the connecting URI. But the optional keyVaultClient can be used to route data key queries to a separate MongoDB cluster.

If a keyVaultClient is not passed, and the parent MongoClient is configured with a limited maxPoolSize, the keyVaultClient is set to an internal MongoClient. See keyVaultClient, metadataClient, and the internal MongoClient for configuration behavior.

See What's the deal with metadataClient, keyVaultClient, and the internal client?

keyVaultClient, metadataClient, and the internal MongoClient

The following pseudo-code describes the configuration behavior for the three MongoClient objects:

def getOrCreateInternalClient (client, clientOpts):
   if client.internalClient != None:
      return client.internalClient
   internalClientOpts = copy(clientOpts)
   internalClientOpts.autoEncryptionOpts = None
   internalClientOpts.minPoolSize = 0
   client.internalClient = MongoClient (internalClientOpts)
   return client.internalClient

def configureAutoEncryptionClients (client, clientOpts):
   if clientOpts.autoEncryptionOpts.keyVaultClient != None:
      client.keyVaultClient = clientOpts.autoEncryptionOpts.keyVaultClient
   elif clientOpts.maxPoolSize == 0:
      client.keyVaultClient = client
   else:
      client.keyVaultClient = getOrCreateInternalClient (client, clientOpts)

   if clientOpts.autoEncryptionOpts.bypassAutomaticEncryption:
      client.metadataClient = None
   elif clientOpts.maxPoolSize == 0:
      client.metadataClient = client
   else:
      client.metadataClient = getOrCreateInternalClient (client, clientOpts)

Configuring the internal MongoClient MUST match the parent MongoClient, except minPoolSize is set to 0 and AutoEncryptionOpts is omitted. This includes copying the options and host information from the URI, and other non-URI configuration (monitoring callbacks, stable API, etc.).

Drivers MUST document that an additional MongoClient may be created, using the following as a template:

If a MongoClient with a limited connection pool size (i.e a non-zero maxPoolSize) is configured with AutoEncryptionOpts, a separate internal MongoClient is created if any of the following are true:

  • AutoEncryptionOpts.keyVaultClient is not passed.
  • AutoEncryptionOpts.bypassAutomaticEncryption is false.

If an internal MongoClient is created, it is configured with the same options as the parent MongoClient except minPoolSize is set to 0 and AutoEncryptionOpts is omitted.

See What's the deal with metadataClient, keyVaultClient, and the internal client?

kmsProviders

The kmsProviders property may be specified on ClientEncryptionOpts or AutoEncryptionOpts. Multiple KMS providers may be specified, each using a specific property on the KMSProviders object. The options differ for each KMS provider type. The "local" KMS provider type is configured with master key material. The external providers are configured with credentials to authenticate.

Throughout this document, the KMS provider is annotated as - KMSProvider, but this name is for exposition only: drivers MUST accept arbitrary strings at runtime for forward-compatibility.

interface KMSProviders {
   aws?: AWSKMSOptions | { /* Empty (See "Automatic Credentials") */ };
   azure?: AzureKMSOptions | { /* Empty (See "Automatic Credentials") */ };
   gcp?: GCPKMSOptions | { /* Empty (See "Automatic Credentials") */ };
   local?: LocalKMSOptions;
   kmip?: KMIPKMSOptions;

   // KMS providers may be specified with an optional name suffix separated by a colon.
   // Named KMS providers do not support "Automatic Credentials".
   // Note: the named KMS providers strings below are not valid Typescript regexes. They are intended for exposition only.
   "^aws:[a-zA-Z0-9_]+$"?: AWSKMSOptions;
   "^azure:[a-zA-Z0-9_]+$"?: AzureKMSOptions;
   "^gcp:[a-zA-Z0-9_]+$"?: GCPKMSOptions;
   "^local:[a-zA-Z0-9_]+$"?: LocalKMSOptions;
   "^kmip:[a-zA-Z0-9_]+$"?: KMIPKMSOptions;
};

// KMSProvider is a string identifying a KMS provider. Note: For forward
// compatibility, any string should be accepted.
type KMSProvider = string;

interface AWSKMSOptions {
   accessKeyId: string;
   secretAccessKey: string;
   sessionToken?: string; // Required for temporary AWS credentials.
};

type AzureKMSOptions = AzureKMSCredentials | AzureAccessToken;

interface AzureKMSCredentials {
   tenantId: string;
   clientId: string;
   clientSecret: string;
   identityPlatformEndpoint?: string; // Defaults to login.microsoftonline.com
};

interface AzureAccessToken {
   accessToken: string;
};

type GCPKMSOptions = GCPKMSCredentials | GCPKMSAccessToken

interface GCPKMSCredentials {
   email: string;
   privateKey: byte[] | string; // May be passed as a base64 encoded string.
   endpoint?: string; // Defaults to oauth2.googleapis.com
};

interface GCPKMSAccessToken {
   accessToken: string;
}

interface LocalKMSOptions {
   key: byte[96] | string; // The master key used to encrypt/decrypt data keys. May be passed as a base64 encoded string.
};

interface KMIPKMSOptions {
   endpoint: string;
};

The following shows an example object of KMSProviders:

{
   # Pass credentials for AWS:
   "aws": { "accessKeyId": "foo", "secretAccessKey": "bar" },
   # Use an empty document to enable "Automatic Credentials" for Azure:
   "azure": {},
   # Pass an access token for GCP:
   "gcp": { "accessToken": "foo" },
   # Pass a 96 byte base64 encoded string for the local KMS provider.
   "local": { "key": "Mng0NCt4ZHVUYUJCa1kxNkVyNUR1QURhZ2h2UzR2d2RrZzh0cFBwM3R6NmdWMDFBMUN3YkQ5aXRRMkhGRGdQV09wOGVNYUMxT2k3NjZKelhaQmRCZGJkTXVyZG9uSjFk" }
   # Pass the endpoint for KMIP:
   "kmip": { "endpoint": "localhost:5698" }
   # Pass credentials for a different AWS account by appending a name.
   # Note: credentials with a name do not support "Automatic Credentials".
   "aws:name2": { "accessKeyId": "foo2", "secretAccessKey": "bar2" }
}
Automatic Credentials

Certain values of KMSProviders indicate a request by the user that the associated KMS providers should be populated lazily on-demand. The driver MUST be able to populate the respective options object on-demand if-and-only-if such respective credentials are needed. The request for KMS credentials will be indicated by libmongocrypt only once they are needed.

When such a state is detected, libmongocrypt will call back to the driver by entering the MONGOCRYPT_CTX_NEED_KMS_CREDENTIALS state, upon which the driver should fill in the KMS options automatically.

Automatic credentials are only supported for the KMS provider types aws, gcp, and azure. KMS providers containing a name (e.g. aws:myname) do not support automatic credentials. Attempting to configure a KMS provider with a name for automatic credentials results in a runtime error from libmongocrypt. See Why do on-demand KMS credentials not support named KMS providers?

[!NOTE] Drivers MUST NOT eagerly fill an empty KMS options property.

Once requested, drivers MUST create a new KMSProviders $P$ according to the following process:

  1. Let $K$ be the kmsProviders value provided by the user as part of the original ClientEncryptionOpts or AutoEncryptionOpts.
  2. Initialize $P$ to an empty KMSProviders object.
  3. If $K$ contains an aws property, and that property is an empty map:
    1. Attempt to obtain credentials $C$ from the environment using similar logic as is detailed in the obtaining-AWS-credentials section from the Driver Authentication specification, but ignoring the case of loading the credentials from a URI
    2. If credentials $C$ were successfully loaded, create a new AWSKMSOptions map from $C$ and insert that map onto $P$ as the aws property.
  4. If $K$ contains an gcp property, and that property is an empty map:
    1. Attempt to obtain credentials $C$ from the environment logic as is detailed in Obtaining GCP Credentials.
    2. If credentials $C$ were successfully loaded, create a new GCPKMSOptions map from $C$ and insert that map onto $P$ as the gcp property.
  5. If $K$ contains an azure property, and that property is an empty map:
    1. If there is a cachedAzureAccessToken AND the duration until azureAccessTokenExpireTime is greater than one minute, insert cachedAzureAccessToken as the azure property on $P$.
    2. Otherwise:
      1. Let $t_0$ be the current time.
      2. Attempt to obtain an Azure VM Managed Identity Access Token $T$ as detailed in Obtaining an Access Token for Azure Key Vault.
      3. If a token $T$ with expire duration $d_{exp}$ were obtained successfully, create a new AzureAccessToken object with $T$ as the accessToken property. Insert that AzureAccessToken object into $P$ as the azure property. Record the generated AzureAccessToken in cachedAzureAccessToken. Record the azureAccessTokenExpireTime as $t_0 + d_{exp}$.
  6. Return $P$ as the additional KMS providers to libmongocrypt.
Obtaining GCP Credentials

Set HOST to metadata.google.internal.

Send an HTTP request to the URL http://<HOST>/computeMetadata/v1/instance/service-accounts/default/token with the header Metadata-Flavor: Google.

If the HTTP response code is not 200, return an error including the body of the HTTP response in the error message.

Parse the HTTP response body as JSON. If parsing fails, return an error including the body of the HTTP response in the error message.

Check the parsed JSON for the field "access_token". If "access_token" is not present, return an error including the body of the HTTP response in the error message.

Return "access_token" as the credential.

Obtaining an Access Token for Azure Key Vault

Virtual machines running on the Azure platform have one or more Managed Identities associated with them. From within the VM, an identity can be used by obtaining an access token via HTTP from the Azure Instance Metadata Service (IMDS). See this documentation for more information

[!NOTE] To optimize for testability, it is recommended to implement an isolated abstraction for communication with IMDS. This will aide in the implementation of the prose tests of the communication with an IMDS server.

The below steps should be taken:

  1. Let $U$ be a new URL, initialized from the URL string "http://169.254.169.254/metadata/identity/oauth2/token"

  2. Add a query parameter api-version=2018-02-01 to $U$.

  3. Add a query parameter resource=https://vault.azure.net/ to $U$.

  4. Prepare an HTTP GET request $Req$ based on $U$.

    [!NOTE] All query parameters on $U$ should be appropriately percent-encoded

  5. Add HTTP headers Metadata: true and Accept: application/json to $Req$.

  6. Issue $Req$ to the Azure IMDS server 169.254.169.254:80. Let $Resp$ be the response from the server. If the HTTP response is not completely received within ten seconds, consider the request to have timed out, and return an error instead of an access token.

  7. If $Resp_{status} ≠ 200$, obtaining the access token has failed, and the HTTP response body of $Resp$ encodes information about the error that occurred. Return an error including the HTTP response body instead of an access token.

  8. Otherwise, let $J$ be the JSON document encoded in the HTTP response body of $Resp$.

  9. The result access token $T$ is given as the access_token string property of $J$. Return $T$ as the resulting access token.

  10. The resulting "expires in" duration $d_{exp}$ is a count of seconds given as an ASCII-encoded integer string expires_in property of $J$.

[!NOTE] If JSON decoding of $Resp$ fails, or the access_token property is absent from $J$, this is a protocol error from IMDS. Indicate this error to the requester of the access token.

[!NOTE] If an Azure VM has more than one managed identity, requesting an access token requires additional query parameters to disambiguate the request. For simplicity, these parameters are omitted, and only VMs that have a single managed identity are supported.

KMS provider TLS options

Drivers MUST provide TLS options to configure TLS connections KMS providers.

The TLS options SHOULD be consistent with the existing TLS options for MongoDB server TLS connections. The TLS options MUST enable setting a custom client certificate, equivalent to the tlsCertificateKeyFile URI option.

Drivers SHOULD provide API that is consistent with configuring TLS options for MongoDB server TLS connections. New API to support the TLS options MUST support future addition of KMS providers without requiring a driver API change. The following is an example:

class AutoEncryptionOpts {
   // setTLSOptions accepts a map of KMS providers to TLSOptions.
   // The TLSOptions apply to any TLS socket required to communicate
   // with the KMS provider.
   setTLSOptions (opts: KMSProvidersTLSOptions)
}

class ClientEncryptionOpts {
   // setTLSOptions accepts a map of KMS providers to TLSOptions.
   // The TLSOptions apply to any TLS socket required to communicate
   // with the KMS provider.
   setTLSOptions (opts: KMSProvidersTLSOptions)
}

Drivers MUST raise an error if the TLS options are set to disable TLS. The error MUST contain the message "TLS is required".

Drivers SHOULD raise an error if insecure TLS options are set. The error MUST contain the message "Insecure TLS options prohibited". This includes options equivalent to the following URI options:

  • tlsInsecure
  • tlsAllowInvalidCertificates
  • tlsAllowInvalidHostnames
  • tlsDisableCertificateRevocationCheck

Drivers MUST NOT raise an error if tlsDisableOCSPEndpointCheck is set. Setting tlsDisableOCSPEndpointCheck may prevent operation errors when OCSP responders are unresponsive.

See the OCSP specification for a description of the default values of tlsDisableOCSPEndpointCheck and tlsDisableCertificateRevocationCheck Drivers MUST NOT modify the default value of tlsDisableOCSPEndpointCheck and tlsDisableCertificateRevocationCheck for KMS TLS connections.

See Why do KMS providers require TLS options?

schemaMap

Automatic encryption is configured with an "encrypt" field in a collection's JSONSchema. By default, a collection's JSONSchema is periodically polled with the listCollections command. But a JSONSchema may be specified locally with the schemaMap option. Drivers MUST document that a local schema is more secure and MUST include the following in the driver documentation for MongoClient:

Supplying a schemaMap provides more security than relying on JSON Schemas obtained from the server. It protects against a malicious server advertising a false JSON Schema, which could trick the client into sending unencrypted data that should be encrypted.

Drivers MUST document that a local schema only applies to client side encryption, and specifying JSON Schema features unrelated to encryption will result in error. Drivers MUST include the following in the driver documentation for MongoClient:

Schemas supplied in the schemaMap only apply to configuring automatic encryption for client side encryption. Other validation rules in the JSON schema will not be enforced by the driver and will result in an error.

bypassAutoEncryption

Drivers MUST disable auto encryption when the 'bypassAutoEncryption' option is true and not try to spawn mongocryptd. Automatic encryption may be disabled with the bypassAutoEncryption option. See Why is there a bypassAutoEncryption?.

extraOptions

The extraOptions relate to the mongocryptd process, an implementation detail described in the Implementation section:

{
   // Defaults to "mongodb://localhost:27020".
   mongocryptdURI: Optional<String>,

   // Defaults to false.
   mongocryptdBypassSpawn: Optional<Boolean>,

   // Used for spawning. Defaults to empty string and spawns mongocryptd from system path.
   mongocryptdSpawnPath: Optional<String>,

   // Passed when spawning mongocryptd. If omitted, this defaults to ["--idleShutdownTimeoutSecs=60"]
   mongocryptdSpawnArgs: Optional<Array[String]>

   // Override the path used to load the crypt_shared library
   cryptSharedLibPath: Optional<string>;

   // If 'true', refuse to continue encryption without a crypt_shared library
   cryptSharedLibRequired: boolean;
}

Drivers MUST implement extraOptions in a way that allows deprecating/removing options in the future without an API break, such as with a BSON document or map type instead of a struct type with fixed fields. See Why are extraOptions and kmsProviders maps?.

extraOptions.cryptSharedLibPath
  • Type: undefined | string
  • Default: undefined

Allow the user to specify an absolute path to a crypt_shared dynamic library to load. Refer:

extraOptions.cryptSharedLibRequired
  • Type: boolean
  • Default: false

If true, the driver MUST refuse to continue unless crypt_shared was loaded successfully.

If, after initializing a libmongocrypt_handle, crypt_shared is detected to be unavailable AND extraOptions.cryptSharedLibRequired is true, the driver MUST consider the libmongocrypt_handle to be invalid and return an error to the user. Refer:

encryptedFieldsMap

encryptedFieldsMap maps a collection namespace to an encryptedFields.

encryptedFieldsMap only applies to Queryable Encryption.

If a collection is present on both the encryptedFieldsMap and schemaMap, libmongocrypt will error on initialization. See fle2-and-fle1-error.

If a collection has a set of encrypted fields, the behavior of CreateCollection() and Collection.Drop() is altered. An additional helper, CreateEncryptedCollection() has been defined as a convenience wrapper around CreateCollection(). See fle2-createcollection-drop.

Automatic encryption in Queryable Encryption is configured with the encryptedFields.

If a collection is not present on the encryptedFields a server-side collection encryptedFields may be used by libmongocrypt. Drivers MUST include the following in the documentation for MongoClient:

Supplying an encryptedFieldsMap provides more security than relying on an encryptedFields obtained from the server. It protects against a malicious server advertising a false encryptedFields.

bypassQueryAnalysis

See Why is bypassQueryAnalysis needed?.

Queryable Encryption Create and Drop Collection Helpers

A collection supporting Queryable Encryption requires an index and three additional collections.

Collection encryptedFields Lookup (GetEncryptedFields)

The convenience methods support the following lookup process for finding the encryptedFields associated with a collection.

Assume an exposition-only function $GetEncryptedFields(opts, collName, dbName, askDb)$, where $opts$ is a set of options, $collName$ is the name of the collection, $dbName$ is the name of the database associated with that collection, and $askDb$ is a boolean value. The resulting encryptedFields $EF$ is found by:

  1. Let $QualName$ be the string formed by joining$dbName$ and $collName$ with an ASCII dot ".".
  2. If $opts$ contains an "encryptedFields" property, then $EF$ is the value of that property.
  3. Otherwise, if AutoEncryptionOptions.encryptedFieldsMap contains an element named by $QualName$, then $EF$ is the value of that element.
  4. Otherwise, if $askDb$ is true:
    1. Issue a listCollections command against the database named by $dbName$, filtered by {name: <collName>}. Let the result be the document $L$.
    2. If $L$ contains an options document element, and that element contains an encryptedFields document element, $EF$ is $L$ ["options"]["encryptedFields"].
    3. Otherwise, $EF$ is not-found
  5. Otherwise, $EF$ is considered not-found.

Create Collection Helper

Drivers MUST support a BSON document option named encryptedFields for any create command helpers (e.g. Database.createCollection()). This option will be interpreted by the helper method and MUST be passed to the create command.

[!NOTE] Users are not expected to set the escCollection and ecocCollection options in encryptedFields. SERVER-74069 added server-side validation for those fields and no longer allows names to deviate from the following:

  • enxcol_.<collectionName>.esc
  • enxcol_.<collectionName>.ecoc

Drivers MUST NOT document the escCollection and ecocCollection options.

For a helper function, CreateCollection(collectionName, collectionOptions) with the name of the database associated as $dbName$, look up the encrypted fields encryptedFields for the collection as $GetEncryptedFields(collectionOptions, collectionName, dbName, false)$ (See here).

If a set of encryptedFields was found, then do the following operations. If any of the following operations error, the remaining operations are not attempted:

  • Check the wire version of the writable server. If the wire version is less than 21 (for server 7.0.0), return an error containing the error message: "Driver support of Queryable Encryption is incompatible with server. Upgrade server to use Queryable Encryption."
  • Create the collection with name encryptedFields["escCollection"] as a clustered collection using the options {clusteredIndex: {key: {_id: 1}, unique: true}}. If encryptedFields["escCollection"] is not set, use the collection name enxcol_.<collectionName>.esc. Creating this collection MUST NOT check if the collection namespace is in the AutoEncryptionOpts.encryptedFieldsMap. the collection namespace is in the AutoEncryptionOpts.encryptedFieldsMap.
  • Create the collection with name encryptedFields["ecocCollection"] as a clustered collection using the options {clusteredIndex: {key: {_id: 1}, unique: true}}. If encryptedFields["ecocCollection"] is not set, use the collection name enxcol_.<collectionName>.ecoc. Creating this collection MUST NOT check if the collection namespace is in the AutoEncryptionOpts.encryptedFieldsMap.
  • Create the collection collectionName with collectionOptions and the option encryptedFields set to the encryptedFields.
  • Create the the index {"__safeContent__": 1} on collection collectionName.

Create Encrypted Collection Helper

To support automatic generation of encryption data keys, a helper $CreateEncryptedCollection(CE, database, collName, collOpts, kmsProvider, masterKey)$ is defined, where $CE$ is a ClientEncryption object, $kmsProvider$ is a KMSProvider and $masterKey$ is equivalent to the $masterKey$ defined in DataKeyOpts. It has the following behavior:

  • If $collOpts$ contains an "encryptedFields" property, then $EF$ is the value of that property. Otherwise, report an error that there are no encryptedFields defined for the collection.
  • Let $EF'$ be a copy of $EF$. Update $EF'$ in the following manner:
    • Let $Fields$ be the "fields" element within $EF'$.
    • If $Fields$ is present and is an array value, then for each element $F$ of Fields:
      • If $F$ is not a document element, skip it.
      • Otherwise, if $F$ has a "keyId" named element $K$ and $K$ is a null value:
        • Create a DataKeyOpts named $dkOpts$ with the $masterKey$ argument.
        • Let $D$ be the result of CE.createDataKey(kmsProvider, dkOpts).
        • If generating $D$ resulted in an error $E$, the entire $CreateEncryptedCollection$ must now fail with error $E$. Return the partially-formed $EF'$ with the error so that the caller may know what datakeys have already been created by the helper.
        • Replace $K$ in $F$ with $D$.
  • Create a new set of options $collOpts'$ duplicating $collOpts$. Set the "encryptedFields" named element of $collOpts'$ to $EF'$.
  • Invoke the CreateCollection helper as $CreateCollection(database, collName, collOpts')$. Return the resulting collection and the generated $EF'$. If an error occurred, return the resulting $EF$ with the error so that the caller may know what datakeys have already been created by the helper.

Drivers MUST document that $createEncryptedCollection$ does not affect any auto encryption settings on existing MongoClients that are already configured with auto encryption. Users must configure auto encryption after creating the encrypted collection with the $createEncryptedCollection$ helper.

Drop Collection Helper

Drivers MUST support a BSON document option named encryptedFields for any drop command helpers (e.g. Database.dropCollection(), Collection.drop()). This option will only be interpreted by the helper method and MUST NOT be passed to the drop command.

[!NOTE] Users are not expected to set the escCollection and ecocCollection options in encryptedFields. SERVER-74069 added server-side validation for those fields and no longer allows names to deviate from the following:

  • enxcol_.<collectionName>.esc
  • enxcol_.<collectionName>.ecoc

Drivers SHOULD NOT document the escCollection and ecocCollection options.

For a helper function DropCollection(dropOptions) with associated collection named $collName$ and database named $dbName$, look up the encrypted fields encryptedFields as $GetEncryptedFields(dropOptions, collName, dbname, true)$ (See here).

If a set of encryptedFields was found, then perform the following operations. If any of the following operations error, the remaining operations are not attempted. A namespace not found error returned from the server (server error code 26) MUST be ignored:

  • Drop the collection with name encryptedFields["escCollection"]. If encryptedFields["escCollection"] is not set, use the collection name enxcol_.<collectionName>.esc.
  • Drop the collection with name encryptedFields["ecocCollection"]. If encryptedFields["ecocCollection"] is not set, use the collection name enxcol_.<collectionName>.ecoc.
  • Drop the collection collectionName.

ClientEncryption

class ClientEncryption {
   ClientEncryption(opts: ClientEncryptionOpts);

   // The "Create Encrypted Collection" helper is a convenience function wrapping CreateCollection. It will
   // create a collection with encrypted fields, automatically allocating and assigning new data encryption
   // keys. It returns a handle to the new collection, as well as a document consisting of the generated
   // "encryptedFields" options. Refer to "Create Encrypted Collection Helper"
   createEncryptedCollection(database: Database, collName: string, collOpts, kmsProvider: KMSProvider, masterKey: Optional<Document>): [Collection, Document];

   // Creates a new key document and inserts into the key vault collection.
   // Returns the _id of the created document as a UUID (BSON binary subtype 0x04).
   createDataKey(kmsProvider: KMSProvider, opts: DataKeyOpts | null): Binary;

   // Decrypts multiple data keys and (re-)encrypts them with a new masterKey, or with their current masterKey if a new one is not given.
   // The updated fields of each rewrapped data key is updated in the key vault collection as part of a single bulk write operation.
   // If no data key matches the given filter, no bulk write operation is executed.
   // Returns a RewrapManyDataKeyResult.
   rewrapManyDataKey(filter: Document, opts: RewrapManyDataKeyOpts | null): RewrapManyDataKeyResult;

   // Removes the key document with the given UUID (BSON binary subtype 0x04) from the key vault collection.
   // Returns the result of the internal deleteOne() operation on the key vault collection.
   deleteKey(id: Binary): DeleteResult;

   // Finds a single key document with the given UUID (BSON binary subtype 0x04).
   // Returns the result of the internal find() operation on the key vault collection.
   getKey(id: Binary): Optional<Document>;

   // Finds all documents in the key vault collection.
   // Returns the result of the internal find() operation on the key vault collection.
   getKeys(): Iterable<Document>;

   // Adds a keyAltName to the keyAltNames array of the key document in the key vault collection with the given UUID (BSON binary subtype 0x04).
   // Returns the previous version of the key document.
   addKeyAltName(id: Binary, keyAltName: String): Optional<Document>;

   // Removes a keyAltName from the keyAltNames array of the key document in the key vault collection with the given UUID (BSON binary subtype 0x04).
   // Returns the previous version of the key document.
   removeKeyAltName(id: Binary, keyAltName: String): Optional<Document>;

   // Returns a key document in the key vault collection with the given keyAltName.
   getKeyByAltName(keyAltName: String): Optional<Document>;

   // Encrypts a BsonValue with a given key and algorithm.
   // Returns an encrypted value (BSON binary of subtype 6). The underlying implementation MAY return an error for prohibited BSON values.
   encrypt(value: BsonValue, opts: EncryptOpts): Binary;

   // encryptExpression encrypts a Match Expression or Aggregate Expression to query a range index.
   // `expr` is expected to be a BSON document of one of the following forms:
   // 1. A Match Expression of this form:
   //   {$and: [{<field>: {$gt: <value1>}}, {<field>: {$lt: <value2> }}]}
   // 2. An Aggregate Expression of this form:
   //   {$and: [{$gt: [<fieldpath>, <value1>]}, {$lt: [<fieldpath>, <value2>]}]
   // $gt may also be $gte. $lt may also be $lte.
   // Only supported when queryType is "range" and algorithm is "Range".
   encryptExpression(expr: Document, opts: EncryptOpts): Document;

   // Decrypts an encrypted value (BSON binary of subtype 6).
   // Returns the original BSON value.
   decrypt(value: Binary): BsonValue;

   // Implementation details.
   private mongocrypt_t libmongocrypt_handle;
   private MongoClient keyvault_client;
}
interface ClientEncryptionOpts {
   keyVaultClient: MongoClient;
   keyVaultNamespace: String;
   kmsProviders: KMSProviders;
   tlsOptions?: KMSProvidersTLSOptions; // Maps KMS provider to TLS options.
};

interface KMSProvidersTLSOptions {
   // Map the KMS providers to a set of TLS options
   [provider: KMSProvider]: TLSOptionsMap;
};

The ClientEncryption encapsulates explicit operations on a key vault collection that cannot be done directly on a MongoClient. Similar to configuring auto encryption on a MongoClient, it is constructed with a MongoClient (to a MongoDB cluster containing the key vault collection), KMS provider configuration, keyVaultNamespace, and tlsOptions. It provides an API for explicitly encrypting and decrypting values, and managing data keys in the key vault collection.

See Why do we have a separate top level type for ClientEncryption? and Why do we need to pass a client to create a ClientEncryption?

When implementing behavior and error handling for key vault functions, Drivers SHOULD assume the presence of a unique index in the key vault collection on the keyAltNames field with a partial index filter for only documents where keyAltNames exists when implementing behavior of key management functions. Drivers MAY choose to not validate or enforce the existence of the unique index, but MUST still be capable of handling errors that such a unique index may yield.

See Why aren't we creating a unique index in the key vault collection?

DataKeyOpts

class DataKeyOpts {
   masterKey: Optional<Document>
   keyAltNames: Optional<Array[String]> // An alternative to _id to reference a key.
   keyMaterial: Optional<BinData>
}

masterKey

The masterKey document identifies a KMS-specific key used to encrypt the new data key. If kmsProvider has KMS provider type "aws", the masterKey is required and has the following fields:

{
   region: String,
   key: String, // The Amazon Resource Name (ARN) to the AWS customer master key (CMK).
   endpoint: Optional<String> // An alternate host identifier to send KMS requests to. May include port number. Defaults to "kms.<region>.amazonaws.com"
}

If the kmsProvider has KMS provider type "azure", the masterKey is required and has the following fields:

{
   keyVaultEndpoint: String, // Host with optional port. Example: "example.vault.azure.net".
   keyName: String,
   keyVersion: Optional<String> // A specific version of the named key, defaults to using the key's primary version.
}

If the kmsProvider has KMS provider type "gcp", the masterKey is required and has the following fields:

{
   projectId: String,
   location: String,
   keyRing: String,
   keyName: String,
   keyVersion: Optional<String>, // A specific version of the named key, defaults to using the key's primary version.
   endpoint: Optional<String> // Host with optional port. Defaults to "cloudkms.googleapis.com".
}

If the kmsProvider has KMS provider type "local", the masterKey is not applicable.

If the kmsProvider has KMS provider type "kmip", the masterKey is required and has the following fields:

{
   keyId: Optional<String>, // keyId is the KMIP Unique Identifier to a 96 byte KMIP Secret Data managed object.
                            // If keyId is omitted, the driver creates a random 96 byte KMIP Secret Data managed object.
   endpoint: Optional<String>, // Host with optional port.
   delegated: Optional<Boolean> // If true, this key should be decrypted by the KMIP server.
}

Drivers MUST document the expected fields in the masterKey document for the "aws", "azure", "gcp", and "kmip" KMS provider types. Additionally, they MUST document that masterKey is required for these KMS provider types and is not optional.

The value of endpoint or keyVaultEndpoint is a host name with optional port number separated by a colon. E.g. "kms.us-east-1.amazonaws.com" or "kms.us-east-1.amazonaws.com:443". It is assumed that the host name is not an IP address or IP literal. Though drivers MUST NOT inspect the value of "endpoint" that a user sets when creating a data key, a driver will inspect it when connecting to KMS to determine a port number if present.

keyAltNames

An optional list of string alternate names used to reference a key. If a key is created with alternate names, then encryption may refer to the key by the unique alternate name instead of by _id. The following example shows creating and referring to a data key by alternate name:

opts = DataKeyOpts(keyAltNames=["name1"])
clientencryption.create_data_key ("local", opts)
# reference the key with the alternate name
opts = EncryptOpts(keyAltName="name1", algorithm="AEAD_AES_256_CBC_HMAC_SHA_512-Random")
clientencryption.encrypt("457-55-5462", opts)

keyMaterial

An optional BinData of 96 bytes to use as custom key material for the data key being created. If keyMaterial is given, the custom key material is used for encrypting and decrypting data. Otherwise, the key material for the new data key is generated from a cryptographically secure random device.

rewrapManyDataKey

If applicable, drivers MUST document that users must upgrade dependencies if necessary to avoid being impacted by MONGOCRYPT-464.

If applicable, drivers MUST return an error if rewrapManyDataKey is called with libmongocrypt 1.5.1 or 1.5.0.

RewrapManyDataKeyOpts

class RewrapManyDataKeyOpts {
   provider: String
   masterKey: Optional<Document>
}

The masterKey document MUST have the fields corresponding to the given provider as specified in masterKey. masterKey MUST NOT be given if it is not applicable for the given provider.

RewrapManyDataKeyResult

class RewrapManyDataKeyResult {
   bulkWriteResult: Optional<BulkWriteResult>;
}

bulkWriteResult is the result of the bulk write operation used to update the key vault collection with one or more rewrapped data keys. If rewrapManyDataKey() does not find any matching keys to rewrap, no bulk write operation will be executed and this field will be unset. This field may also be unset if the bulk write operation is unacknowledged as permitted by the CRUD API Spec.

See Why does rewrapManyDataKey return RewrapManyDataKeyResult instead of BulkWriteResult?

EncryptOpts

class EncryptOpts {
   keyId : Optional<Binary>
   keyAltName: Optional<String>
   algorithm: String,
   contentionFactor: Optional<Int64>,
   queryType: Optional<String>
   rangeOpts: Optional<RangeOpts>
}

// RangeOpts specifies index options for a Queryable Encryption field supporting "range" queries.
// min, max, trimFactor, sparsity, and precision must match the values set in the encryptedFields of the destination collection.
// For double and decimal128, min/max/precision must all be set, or all be unset.
class RangeOpts {
   // min is required if precision is set.
   min: Optional<BSONValue>,
   // max is required if precision is set.
   max: Optional<BSONValue>,
   trimFactor: Int32,
   sparsity: Int64,
   // precision may only be set for double or decimal128.
   precision: Optional<Int32>
}

Explicit encryption requires a key and algorithm. Keys are either identified by _id or by alternate name. Exactly one is required.

keyId

Identifies a data key by _id. The value is a UUID (binary subtype 4).

keyAltName

Identifies a key vault collection document by 'keyAltName'.

algorithm

One of the strings:

  • "AEAD_AES_256_CBC_HMAC_SHA_512-Deterministic"
  • "AEAD_AES_256_CBC_HMAC_SHA_512-Random"
  • "Indexed"
  • "Unindexed"
  • "Range"

The result of explicit encryption with the "Indexed" or "Range" algorithm must be processed by the server to insert or query. Drivers MUST document the following behavior:

To insert or query with an "Indexed" or "Range" encrypted payload, use a MongoClient configured with AutoEncryptionOpts. AutoEncryptionOpts.bypassQueryAnalysis may be true. AutoEncryptionOpts.bypassAutoEncryption must be false.

contentionFactor

contentionFactor only applies when algorithm is "Indexed" or "Range". It is an error to set contentionFactor when algorithm is not "Indexed" or "Range".

queryType

One of the strings:

  • "equality"
  • "range"

queryType only applies when algorithm is "Indexed" or "Range". It is an error to set queryType when algorithm is not "Indexed" or "Range".

rangeOpts

rangeOpts only applies when algorithm is "range". It is an error to set rangeOpts when algorithm is not "range".

User facing API: When Auto Encryption Fails

Auto encryption requires parsing the MongoDB query language client side (with the mongocryptd process or crypt_shared library). For unsupported operations, an exception will propagate to prevent the possibility of the client sending unencrypted data that should be encrypted. Drivers MUST include the following in the documentation for MongoClient:

If automatic encryption fails on an operation, use a MongoClient configured with bypassAutoEncryption=true and use ClientEncryption.encrypt() to manually encrypt values.

For example, currently an aggregate with $lookup into a foreign collection is unsupported (mongocryptd and crypt_shared return errors):

opts = AutoEncryptionOpts (
   key_vault_namespace="keyvault.datakeys",
   kms_providers=kms)
client = MongoClient(auto_encryption_opts=opts)
accounts = client.db.accounts
results = accounts.aggregate([
   {
      "$lookup": {
      "from": "people",
      "pipeline": [
         {
            "$match": {
               "ssn": "457-55-5462"
            }
         }
      ],
      "as": "person"
   }
]) # Raises an error

print (next(results)["person"]["ssn"])

In this case, the user should use explicit encryption on a client configured to bypass auto encryption. (Note, automatic decryption still occurs).

opts = AutoEncryptionOpts (
   key_vault_namespace="keyvault.datakeys",
   kms_providers=kms,
   bypass_auto_encryption=True)
client = MongoClient(auto_encryption_opts=opts)

opts = ClientEncryptionOpts (
   key_vault_client=client,
   key_vault_namespace="keyvault.datakeys",
   kms_providers=kms,
   bypass_auto_encryption=True)
client_encryption = ClientEncryption(opts)

accounts = client.db.accounts
results = accounts.aggregate([
   {
      "$lookup": {
      "from": "people",
      "pipeline": [
         {
            "$match": {
               "ssn": client_encryption.encrypt("457-55-5462", EncryptOpts(key_alt_name="ssn", algorithm="AEAD_AES_256_CBC_HMAC_SHA_512-Deterministic"))
            }
         }
      ],
      "as": "person"
   }
]) # Throws an exception

print (next(results)["person"]["ssn"])

User facing API: View limitations

Users cannot use auto encryption with views. Attempting to do so results in an exception. Drivers do not need to validate when the user is attempting to enable auto encryption on a view, but may defer to the underlying implementation.

Although auto encryption does not work on views, users may still use explicit encrypt and decrypt functions on views on a MongoClient without auto encryption enabled.

See Why do operations on views fail?

Implementation

Drivers MUST integrate with libmongocrypt. libmongocrypt exposes a simple state machine to perform operations. Follow the guide to integrating libmongocrypt.

Drivers SHOULD take a best-effort approach to store sensitive data securely when interacting with KMS since responses may include decrypted data key material (e.g. use secure malloc if available).

All errors from the MongoClient to mongocryptd or the crypt_shared error category MUST be distinguished in some way (e.g. exception type) to make it easier for users to distinguish when a command fails due to auto encryption limitations.

All errors from the MongoClient interacting with the key vault collection MUST be distinguished in some way (e.g. exception type) to make it easier for users to distinguish when a command fails due to behind-the-scenes operations required for encryption, decryption, or key management.

Drivers MUST apply timeouts to operations executed as part of client-side encryption per Client Side Operations Timeout: Client Side Encryption.

Integrating with libmongocrypt

Each ClientEncryption instance MUST have one libmongocrypt_handle.

The libmongocrypt C API documentation For information on how to initialize, encrypt, decrypt with libmongocrypt.

The Guide to Integrating libmongocrypt For information about integrating the libmongocrypt library in a driver.

libmongocrypt exposes logging capabilities. If a driver provides a logging mechanism, it MUST enable this logging and integrate. E.g. if your driver exposes a logging callback that a user can set, it SHOULD be possible to get log messages from libmongocrypt.

Drivers MUST propagate errors from libmongocrypt in whatever way is idiomatic to the driver (exception, error object, etc.). These errors MUST be distinguished in some way (e.g. exception type) to make it easier for users to distinguish when a command fails due to client side encryption.

Enabling Command Marking with the crypt_shared Library

The MongoDB Enterprise distribution includes a dynamic library named mongo_crypt_v1 (with the appropriate file extension or filename suffix for the host platform). This library will be loaded by libmongocrypt when the mongocrypt_init function is invoked (from the libmongocrypt C API) based on the search criteria that are provided by the driver.

libmongocrypt allows the driver to specify an arbitrary list of directory search paths in which to search for the crypt_shared dynamic library. The user-facing API does not expose this full search path functionality. This extended search path customization is intended to facilitate driver testing with crypt_shared (Refer: Search Paths for Testing and Path Resolution Behavior).

[!NOTE] The driver MUST NOT manipulate or do any validation on the crypt_shared path options provided in extraOptions. They should be passed through to libmongocrypt unchanged.

Setting Search Paths

For the user-facing API the driver MUST append the literal string - "$SYSTEM" to the search paths for the libmongocrypt_handle if bypassAutoEncryption is not set to true, and MUST NOT append to the search path if it is set to true or if the libmongocrypt instance is used for explicit encryption only (i.e. on the ClientEncryption class). For purposes of testing, a driver may use a different set of search paths.

Explanation

The search paths array in libmongocrypt allows the driver to customize the way that libmongocrypt searches and loads the crypt_shared library. For testing purposes, the driver may change the paths that it appends for crypt_shared searching to better isolate the test execution from the ambient state of the host system.

Refer to: Path Resolution Behavior and Search Paths for Testing

Overriding the crypt_shared Library Path

If extraOptions.cryptSharedLibPath was specified by the user, the driver MUST set the crypt_shared path override on the libmongocrypt_handle.

[!NOTE] If a path override is set on a libmongocrypt_handle and libmongocrypt fails to load crypt_shared from that filepath, then that will result in a hard-error when initializing libmongocrypt.

Path Resolution Behavior

Drivers should include and note the following information regarding the behavior of crypt_shared path options in extraOptions:

  • If used, the override path must be given as a path to the crypt_shared dynamic library file itself, and not simply the directory that contains it.

  • If the given override path is a relative path and the first path component is the literal string "$ORIGIN", the "$ORIGIN" component will be replaced by the absolute path to the directory containing the libmongocrypt library that is performing the crypt_shared search. This behavior mimics the $ORIGIN behavior of the RUNPATH/RPATH properties of ELF executable files. This permits bundling the crypt_shared library along with libmongocrypt for creating portable application distributions without relying on a externally/globally available crypt_shared library.

    [!NOTE] No other RPATH/RUNPATH-style substitutions are available.

  • If the override path is given as a relative path, that path will be resolved relative to the working directory of the operating system process.

  • If an override path was specified and libmongocrypt fails to load crypt_shared from that filepath, libmongocrypt will fail to initialize with a hard-error. libmongocrypt will not attempt to search for crypt_shared in any other locations.

  • If libmongocrypt fails to load the crypt_shared library after searching the system (and no override path is specified), libmongocrypt will proceed without error and presume that crypt_shared is unavailable.

Search Paths for Testing

Drivers can make use of different search paths settings for testing purposes. These search paths use the following behavior:

  • For crypt_shared search paths, if a search path string is "$SYSTEM", then — instead of libmongocrypt searching for crypt_shared in a directory named "$SYSTEM" — libmongocrypt will defer to the operating system's own dynamic-library resolution mechanism when processing that search-path. For this reason, "$SYSTEM" is the only search path appended when the driver is used via the user-facing API.

  • The search paths also support the $ORIGIN substitution string.

  • Like with the override path, if a search path is given as a relative path, that path will be resolved relative to the working directory of the operating system process.

  • If no search paths are appended to the libmongocrypt_handle, the resulting search paths will be an empty array, effectively disabling crypt_shared searching.

    In this case, unless an override path is specified, libmongocrypt is guaranteed not to load crypt_shared.

Detecting crypt_shared Availability

crypt_shared availability can only be reliably detected after initializing the libmongocrypt_handle.

After initializing the libmongocrypt_handle, the driver can detect whether crypt_shared was successfully loaded by asking libmongocrypt for the crypt_shared version string. If the result is an empty string, libmongocrypt did not load crypt_shared and the driver must rely on mongocryptd to mark command documents for encryption.

"Disabling" crypt_shared

For purposes of testing, a driver can "disable" crypt_shared searching to ensure that mongocryptd is used instead, even if a crypt_shared library would be available.

As noted in Path Resolution Behavior, crypt_shared can be "disabled" on a libmongocrypt_handle by omission:

  1. Do not specify any search paths,
  2. AND do not specify a crypt_shared library override path (extraOptions.cryptSharedLibPath).

This will have the effect that libmongocrypt will not attempt to search or load crypt_shared during initialization.

At the current time, no user-facing API is exposed that allows users to opt-out of crypt_shared.

Loading crypt_shared Multiple Times

Due to implementation restrictions, there must not be more than one crypt_shared dynamic library loaded simultaneously in a single operating system process. libmongocrypt will do its best to enforce this at the time that it loads crypt_shared while initializing a libmongocrypt_handle. libmongocrypt will keep track of the open crypt_shared library globally, and any subsequent attempt to use a crypt_shared library that does not exactly match the filepath of the already-loaded crypt_shared will result in an error.

If at least one libmongocrypt_handle exists in an operating system process that has an open handle to a crypt_shared library, subsequent attempts to initialize an additional libmongocrypt_handle will fail if:

  1. The new libmongocrypt_handle wants crypt_shared (i.e. at least one search path was specified or an override path was specified).
  2. AND the initialization of that libmongocrypt_handle does not successfully find and load the same crypt_shared library that was loaded by the existing libmongocrypt_handle that is already using crypt_shared.

Drivers MUST document this limitation for users along with the documentation on the cryptShared* options in extraOptions by including the following:

All MongoClient objects in the same process should use the same setting for extraOptions.cryptSharedLibPath, as it is an error to load more that one crypt_shared dynamic library simultaneously in a single operating system process.

Once all open handles to a crypt_shared library are closed, it is possible to load a different crypt_shared library than was previously loaded. The restriction only applies to simultaneous open handles within a single process.

Managing mongocryptd

If the following conditions are met:

then mongocryptd MUST be spawned by the driver.

If the extraOptions.cryptSharedLibRequired option is true then the driver MUST NOT attempt to spawn or connect to mongocryptd.

[!NOTE] Since spawning mongocryptd requires checking whether crypt_shared is loaded, and checking whether crypt_shared is available can only be done after having initialized the libmongocrypt_handle, drivers will need to defer spawning mongocryptd until after initializing libmongocrypt and checking for crypt_shared.

Spawning mongocryptd

If a MongoClient is configured for Client Side Encryption (eg. bypassAutoEncryption=false), then by default (unless mongocryptdBypassSpawn=true), mongocryptd MUST be spawned by the driver. Spawning MUST include the command line argument --idleShutdownTimeoutSecs. If the user does not supply one through extraOptions.mongocryptdSpawnArgs (which may be either in the form --idleShutdownTimeoutSecs=60 or as two consecutive arguments ["--idleShutdownTimeoutSecs", 60], then the driver MUST append --idleShutdownTimeoutSecs=60 to the arguments. This tells mongocryptd to automatically terminate after 60 seconds of non-use. The stdout and stderr of the spawned process MUST not be exposed in the driver (e.g. redirect to /dev/null). Users can pass the argument --logpath to extraOptions.mongocryptdSpawnArgs if they need to inspect mongocryptd logs.

Upon construction, the MongoClient MUST create a MongoClient to mongocryptd configured with serverSelectionTimeoutMS=10000.

If spawning is necessary, the driver MUST spawn mongocryptd whenever server selection on the MongoClient to mongocryptd fails. If the MongoClient fails to connect after spawning, the server selection error is propagated to the user.

Connecting to mongocryptd

If the crypt_shared library is loaded, the driver MUST NOT attempt to connect to mongocryptd. (Refer: Detecting crypt_shared Availability).

Single-threaded drivers MUST connect with serverSelectionTryOnce=false, connectTimeoutMS=10000, and MUST bypass cooldownMS when connecting to mongocryptd. See Why are serverSelectionTryOnce and cooldownMS disabled for single-threaded drivers connecting to mongocryptd?

If the ClientEncryption is configured with mongocryptdBypassSpawn=true, then the driver is not responsible for spawning mongocryptd. If server selection ever fails when connecting to mongocryptd, the server selection error is propagated to the user.

[!NOTE] A correctly-behaving driver will never attempt to connect to mongocryptd when extraOptions.cryptSharedLibRequired is set to true or crypt_shared is loaded.

ClientEncryption

The new ClientEncryption type interacts uses libmongocrypt to perform ClientEncryption operations. See the libmongocrypt API documentation for more information.

The ClientEncryption contains a MongoClient connected to the MongoDB cluster containing the key vault collection. It does not contain a MongoClient to mongocryptd.

See Why does ClientEncryption have key management functions when Drivers can use existing CRUD operations instead?.

Key Vault collection

The key vault collection is the specially designated collection containing encrypted data keys. There is no default collection (user must specify). The key vault collection is used for automatic and explicit encryption/decryption as well as key management functions.

For key management functions that require creating, updating, or deleting key documents in the key vault collection, the corresponding operations MUST be done with write concern majority.

For encryption/decryption and key management functions that require reading key documents from the key vault collection, the corresponding operations MUST be done with read concern majority.

Some key management functions may require multiple commands to complete their operation. Key management functions currently assume there is no concurrent access of the key vault collection being operated on. Concurrent access to a key vault collection being operated on may result in unexpected or undefined behavior. Support for concurrent key management may be considered for future work.

See Support sessions in key management functions.

Auto encrypt and decrypt

An encrypted MongoClient automatically encrypts values for filtering and decrypts results.

The driver MUST use libmongocrypt to initiate auto encryption and decryption. Create the BSON command meant to be sent over the wire, then pass that through the libmongocrypt state machine and use the returned BSON command in its place. The state machine is created with the libmongocrypt function mongocrypt_ctx_new and initialized with a mongocrypt_ctx_encrypt_init or mongocrypt_ctx_decrypt_init. See the libmongocrypt API documentation for more information.

An encrypted MongoClient configured with bypassAutoEncryption MUST NOT attempt automatic encryption for any command.

Otherwise, an encrypted MongoClient MUST attempt to auto encrypt all commands. Note, the underlying implementation may determine no encryption is necessary, or bypass many checks if the command is deemed to not possibly contain any encrypted data (e.g. ping). See the appendix section: libmongocrypt: Auto Encryption Allow-List.

An encrypted MongoClient MUST attempt to auto decrypt the results of all commands.

Drivers MUST raise an error when attempting to auto encrypt a command if the maxWireVersion is less than 8. The error message MUST contain "Auto-encryption requires a minimum MongoDB version of 4.2".

Note, all client side features (including all of ClientEncryption) are only supported against 4.2 or higher servers. However, errors are only raised for automatic encryption/decryption against older servers. See Why is a 4.2 server required?

Interaction with Command Monitoring

Unencrypted data MUST NOT appear in the data of any command monitoring events. Encryption MUST occur before generating a CommandStartedEvent, and decryption MUST occur after generating a CommandSucceededEvent.

Size limits for Write Commands

Automatic encryption requires the driver to serialize write commands as a single BSON document before automatically encrypting with libmongocrypt (analogous to constructing OP_MSG payload type 0, not a document sequence). Automatic encryption returns a single (possibly modified) BSON document as the command to send.

Because automatic encryption increases the size of commands, the driver MUST split bulk writes at a reduced size limit before undergoing automatic encryption. The write payload MUST be split at 2MiB (2097152). Where batch splitting occurs relative to automatic encryption is implementation-dependent.

Drivers MUST not reduce the size limits for a single write before automatic encryption. I.e. if a single document has size larger than 2MiB (but less than maxBsonObjectSize) proceed with automatic encryption.

Drivers MUST document the performance limitation of enabling client side encryption by including the following documentation in MongoClient:

Enabling Client Side Encryption reduces the maximum write batch size and may have a negative performance impact.

Appendix

Appendix terms

intent-to-encrypt marking
One of the data formats of BSON binary subtype 6, representing an encoded BSON document containing plaintext and metadata.

Key vault collection schema for data keys

Data keys are stored in the MongoDB key vault collection with the following schema:

NameTypeDescription
_idUUIDA unique identifier for the key.
versionInt64A numeric identifier for the schema version of this document. Implicitly 0 if unset.
keyAltNamesArray of stringsAlternate names to search for keys by. Used for a per-document key scenario in support of GDPR scenarios.
keyMaterialBinDataEncrypted data key material, BinData type General
creationDateDateThe datetime the wrapped data key material was imported into the Key Database.
updateDateDateThe datetime the wrapped data key material was last modified. On initial import, this value will be set to creationDate.
statusInt0 = enabled, 1 = disabled
masterKeyDocumentPer provider master key definition, see below

masterKey contents

NameTypeDescription
provider"aws"
keyStringAWS ARN. Only applicable for "aws" provider.
regionStringAWS Region that contains AWS ARN. Only applicable for "aws" provider.
endpointStringAlternate AWS endpoint (needed for FIPS endpoints)
NameTypeDescription
provider"azure"
keyVaultEndpointStringRequired key vault endpoint. (e.g. "example.vault.azure.net")
keyNameStringRequired key name.
keyVersionStringOptional key version.
NameTypeDescription
provider"gcp"
projectIdStringRequired project ID.
locationStringRequired location name (e.g. "global")
keyRingStringRequired key ring name.
keyNameStringRequired key name.
keyVersionStringOptional key version.
endpointStringOptional, KMS URL, defaults to https://cloudkms.googleapis.com
NameTypeDescription
provider"local"
NameTypeDescription
provider"kmip"
endpointStringOptional. Defaults to kmip.endpoint from KMS providers.
delegatedBooleanOptional. Defaults to false.
keyIdStringRequired. keyId is the Unique Identifier to a 96 byte KMIP Secret Data managed object.

Data keys are needed for encryption and decryption. They are identified in the intent-to-encrypt marking and ciphertext. Data keys may be retrieved by querying the "_id" with a UUID or by querying the "keyAltName" with a string.

Note, "status" is unused and is purely informational.

Example data key document

{
   "_id" : UUID("00000000-0000-0000-0000-000000000000"),
   "status" : 1,
   "masterKey" : {
      "provider" : "aws",
      "key" : "arn:aws...",
      "region" : "us-east-1"
   },
   "updateDate" : ISODate("2019-03-18T22:53:50.483Z"),
   "keyMaterial" : BinData(0,"AQICAH..."),
   "creationDate" : ISODate("2019-03-18T22:53:50.483Z"),
   "keyAltNames" : [
      "altname",
      "another_altname"
   ]
}

BSON binary subtype 6

BSON Binary Subtype 6 has a one byte leading identifier. The following is a quick reference.

struct {
   uint8 subtype;
   [more data - see individual type definitions]
}

Type 0: Intent-to-encrypt marking

struct {
   uint8 subtype = 0;
   [ bson ];
}

Types 1 and 2: Ciphertext

struct {
   uint8 subtype = (1 or 2);
   uint8 key_uuid[16];
   uint8 original_bson_type;
   uint32 ciphertext_length;
   uint8 ciphertext[ciphertext_length];
}

See Driver Spec: BSON Binary Subtype 6 for more information.

JSONSchema "encrypt"

The additional support for encryption in JSONSchema will be documented in the MongoDB manual. But the following is an example:

encrypt : {
   bsonType: "int"
   algorithm: "AEAD_AES_256_CBC_HMAC_SHA_512-Deterministic"
   keyId: [UUID(...)]
}

Each field is briefly described as follows:

NameTypeDescription
bsonTypestringThe bsonType of the underlying encrypted field.
algorithmstring"AEAD_AES_256_CBC_HMAC_SHA_512-Random" or
"AEAD_AES_256_CBC_HMAC_SHA_512-Deterministic"
keyIdstring or array of UUIDIf string, it is a JSON pointer to a field with a scalar value
identifying a key by keyAltName.
If array, an array of eligible keys.

libmongocrypt: Prohibitions and warnings

libmongocrypt MUST validate options. The following noteworthy cases are prohibited:

  • Explicit encryption using the deterministic algorithm on any of the following types:
    • array
    • document
    • code with scope
    • single value types: undefined, MinKey, MaxKey, Null
    • decimal128
    • double
    • bool
  • Explicit encryption on a BSON binary subtype 6.

The following cases MUST warn:

  • A local schema that does not include encrypted fields.

libmongocrypt: Collection info caching

libmongocrypt will cache the collection infos so encryption with remote schemas need not run listCollections every time. Collection infos (or lack thereof) are cached for one minute. This is not configurable. After expiration, subsequent attempts to encrypt will result in libmongocrypt requesting a new collection info.

A collection info result indicates if the collection is really a view. If it is, libmongocrypt returns an error since it does not know the schema of the underlying collection.

A collection info with validators that aside from one top level $jsonSchema are considered an error.

libmongocrypt: Data key caching

Data keys are cached in libmongocrypt for one minute. This is not configurable, and there is no maximum number of keys in the cache. The data key material is stored securely. It will not be paged to disk and the memory will be properly zero'ed out after freeing.

libmongocrypt: Crypto implementation

libmongocrypt uses AEAD_SHA256_CBC_HMAC512 for both "randomized" and "deterministic" encryption algorithms. It is described in this IETF document draft. For "randomized", libmongocrypt securely creates a random IV. For "deterministic", libmongocrypt securely creates a random IV key and any given encryption operation will derive the IV from the IV key and the field plaintext data.

libmongocrypt: Auto Encryption Allow-List

libmongocrypt determines whether or not the command requires encryption (i.e. is sent to mongocryptd) based on the table below. Commands not listed in this table will result in an error returned by libmongocrypt.

CommandAction
aggregate (collection)AUTOENCRYPT
countAUTOENCRYPT
distinctAUTOENCRYPT
deleteAUTOENCRYPT
findAUTOENCRYPT
findAndModifyAUTOENCRYPT
getMoreBYPASS
insertAUTOENCRYPT
updateAUTOENCRYPT
authenticateBYPASS
getnonceBYPASS
logoutBYPASS
helloBYPASS
legacy helloBYPASS
abortTransactionBYPASS
commitTransactionBYPASS
endSessionsBYPASS
startSessionBYPASS
createBYPASS
createIndexesBYPASS
createSearchIndexesBYPASS
dropBYPASS
dropDatabaseBYPASS
dropIndexesBYPASS
dropSearchIndexBYPASS
killCursorsBYPASS
listCollectionsBYPASS
listDatabasesBYPASS
listIndexesBYPASS
renameCollectionBYPASS
explainAUTOENCRYPT
pingBYPASS
killAllSessionsBYPASS
killSessionsBYPASS
killAllSessionsByPatternBYPASS
refreshSessionsBYPASS
updateSearchIndexBYPASS

All AUTOENCRYPT commands are sent to mongocryptd, even if there is no JSONSchema. This is to ensure that commands that reference other collections (e.g. aggregate with $lookup) are handled properly.

Test Plan

See the README.md in the test directory.

Rationale

Design Principles

In addition to the Driver Mantras there are design principles specific to this project.

1. Make encryption easy to enable

Users should be able to enable encryption with minimal application change.

2. Minimize risk of exposing sensitive data

Storing or querying with unencrypted data can have dire consequences, because users may not be made aware immediately. When in doubt, we should error. It should be clear to the user when an operation gets encrypted and when one doesn't.

3. Minimize API

The first version of Client Side Encryption is to get signal. If it becomes popular, further improvements will be made (removing mongocryptd process, support for more queries, better performance). But the public API we provide now will stick around for the long-term. So let's keep it minimal to accomplish our goals.

How did we arrive at this API?

The API for client side encryption underwent multiple iterations during the design process.

Why is client side encryption configured on a MongoClient?

There is state that must be shared among all auto encrypted collections: the MongoClient to mongocryptd and the handle to libmongocrypt (because key caching + JSONSchema caching occurs in libmongocrypt).

Why not make auto encryption "opt-in"?

Because auto encryption is specified with a collection JSONSchema, we cannot auto encrypt database or client operations. So we cannot know if the user is passing sensitive data as a filter to a database/client change stream or a currentOp command for example. We also must always fail on view operations. We considered making auto encryption opt-in for collections. But we decided against this. It is much simpler for users to enable auto encryption without enumerating all collections with encryption in the common case of using remote JSONSchemas.

Note, this takes the trade-off of a better user experience over less safety. If a user mistakenly assumes that auto encryption occurs on a database, or on a collection doing a $(graph)lookup on a collection with auto encryption, they may end up sending unencrypted data.

Why are auto encrypted collections configured at level of MongoClient?

In a previous iteration of the design, we proposed enabling auto encryption only in db.getCollection() for better usability. But this better aligns with our design principles.

  • Safer. Users won't forget to enable auto encryption on one call to db.getCollection()
  • Easier. It only requires changing MongoClient code instead of every db.getCollection()

Why do we have a separate top level type for ClientEncryption?

The encrypt/decrypt and createDataKey functions were originally placed on MongoClient. But, then we'd have API that depends on optional configuration. A new top level type seemed warranted.

Why not pass the ClientEncryption into db.getCollection() to enable auto encryption?

As it is now, a ClientEncryption and a MongoClient cannot share state (libmongocrypt handle or MongoClient to mongocryptd). Foreseeably, they could share state if auto encryption was enabled by passing a ClientEncryption object like:

db.getCollection("coll", { autoEncrypt: { clientEncryption: clientEncryption } })

But this would require a MongoCollection to peek into the internals of a ClientEncryption object. This is messy and language dependent to implement and makes mocking out the ClientEncryption difficult for tests.

Why do we need to pass a client to create a ClientEncryption?

We need to support an external key vault collection (i.e. on another MongoDB cluster).

Why are extraOptions and kmsProviders maps?

Because we don't want AWS as part of the public types and we don't want to put mongocryptd and crypt_shared options as types since mongocryptd is an implementation detail we'd like to hide as much as possible.

Why is there a bypassAutoEncryption?

bypassAutoEncryption still supports auto decryption. In cases where mongocryptd or crypt_shared cannot analyze a query, it's still useful to provide auto decryption. Just like static program analysis cannot always prove that a runtime invariant holds, mongocryptd/crypt_shared cannot always prove that a query will be safe with respect to encryption at runtime.

Why not require compatibility between mongocryptd and the server?

It isn't necessary or unsafe if mongocryptd parses an old version of MQL. Consider what happens when we add a new operator, $newOperator. If it properly encrypts a value in the $newOperator expression and sends it to an old server that doesn't have $newOperator, that's a mistake but not a security hole. Also if the app passes a query with $newOperator to mongocryptd, and mongocryptd doesn't know about $newOperator, then it will error, "Unrecognized operator $newOperator" or something. Also a mistake, not a security hole.

So long as mongocryptd errors on unrecognized expressions, we don't need version compatibility between the mongocryptd and server for the sake of security.

Why cache keys?

We can't re-fetch the key on each operation, the performance goal for this project requires us to cache. We do need a revocation mechanism, based upon periodic checking from the client. Initially this window will not be configurable.

Why require including a C library?

  • libmongocrypt deduplicates a lot of the work: JSONSchema cache, KMS message construction/parsing, key caching, and encryption/decryption.
  • Our "best-effort" of storing decrypted key material securely is best accomplished with a C library.
  • Having crypto done in one centralized C library makes it much easier to audit the crypto code.

Why warn if a local schema does not have encrypted fields?

Because that is the only use of local schemas. No other JSONSchema validators have any function. It's likely the user misconfigured encryption.

Why limit to one top-level $jsonSchema?

  • If we allow siblings, we can run into cases where the user specifies a top-level $and/$or or any arbitrary match-expression that could have nested $jsonSchema's.
  • Furthermore, the initial versions of mongocryptd and crypt_shared are only implementing query analysis when the validator consists of a single $jsonSchema predicate. This helps to simplify the mongocryptd and crypt_shared logic, and unifies it with the case where users configure their schemas directly in the driver.

Why not allow schemas to be configured at runtime?

We could have something like Collection::setEncryptionSchema(), but users can simply recreate the client to set new local schemas.

Why not support other AWS auth mechanisms?

We could potentially authenticate against AWS in a more sophisticated way, like read credentials from ~/.aws/credentials or assuming a role with EC2 instance metadata. But we've decided to implement the simplest authentication mechanism for v1, and defer more sophisticated ones as future work.

Why not pass a URI for external key vault collections instead of a MongoClient?

Some configuration on a MongoClient can only be done programmatically. E.g. in Java TLS configuration can only be done at runtime since it is abstracted in an SSLContext which cannot be accessed or altered by the driver.

What happened to multiple key vault collections?

An earlier revision of this specification supported multiple active key vaults with the notion of a "key vault collection alias". The key vault collection alias identified one of possibly many key vault collections that stored the key to decrypt the ciphertext. However, enforcing one key vault collection is a reasonable restriction for users. There isn't clear value in having multiple key vault collections. And having active multiple key vault collections is not necessary to migrate key vault collections.

Why auto encrypt a command instead of a wire protocol message?

  • It is significantly easier to implement communication in drivers if libmongocrypt gives back BSON object that can be passed to run command.
  • mongocryptd cannot return document sequences, so it will return an array of documents anyway.
  • Though it is foreseeable that a driver can take the final result of encryption and turn it into an OP_MSG document sequence, it does not seem worthwhile to impose extra complexity in this case.

Why is a failure to decrypt always an error?

In the original design we proposed not to error if decryption failed due to a missing key. But, it's not clear this is a needed functionality, it goes against our principle of "Minimize API", and there's a simple recourse for users: bypass mongocryptd and explicitly decrypt instead.

Why are there no APM events for mongocryptd?

Though it may be helpful for debugging to expose APM events for mongocryptd, mongocryptd is an implementation detail we'd like to have the freedom to remove in the future. So we want to expose mongocryptd as little as possible.

Why aren't we creating a unique index in the key vault collection?

There should be a unique index on the keyAltNames field with a partial index filter for only documents where keyAltNames exists. Although GridFS automatically creates indexes as a convenience upon first write, it has been problematic before due to requiring the createIndex privilege, which a user might not have if they are just querying the key vault collection with find and adding keys with insert.

See also https://www.mongodb.com/docs/manual/reference/privilege-actions/#mongodb-authaction-createIndex.

Why do operations on views fail?

Currently, the driver does not resolve the entire view pipeline, which would be necessary to know the schema of the underlying collection. But, the driver does know whether or not a namespace is a view based on the response to listCollections. And the driver will run listCollections on all namespaces omitted from the schemaMap.

Why is a 4.2 server required?

Limiting to 4.2 reduces testing complexity. Additionally The encrypt subdocument in JSON schema is only supported on 4.2 or higher servers. Although not technically necessary for client side encryption, it does provide a fallback against accidentally sending unencrypted data from misconfigured clients.

Why are serverSelectionTryOnce and cooldownMS disabled for single-threaded drivers connecting to mongocryptd?

By default, single threaded clients set serverSelectionTryOnce to true, which means server selection fails if a topology scan fails the first time (i.e. it will not make repeat attempts until serverSelectionTimeoutMS expires). This behavior is overridden since there may be a small delay between spawning mongocryptd (which the driver may be responsible for) and for mongocryptd to listen on sockets. See the Server Selection spec description of serverSelectionTryOnce.

Similarly, single threaded clients will by default wait for 5 second cooldown period after failing to connect to a server before making another attempt. Meaning if the first attempt to mongocryptd fails to connect, then the user would observe a 5 second delay. This is not configurable in the URI, so this must be overridden internally. Since mongocryptd is a local process, there should only be a very short delay after spawning mongocryptd for it to start listening on sockets. See the SDAM spec description of cooldownMS.

Because single threaded drivers may exceed serverSelectionTimeoutMS by the duration of the topology scan, connectTimeoutMS is also reduced.

What's the deal with metadataClient, keyVaultClient, and the internal client?

When automatically encrypting a command, the driver runs:

  • a listCollections command to determine if the target collection has a remote schema. This uses the metadataClient.
  • a find against the key vault collection to fetch keys. This uses the keyVaultClient.

Why not reuse the parent MongoClient when maxPoolSize is limited?

These operations MUST NOT reuse the same connection pool as the parent MongoClient configured with automatic encryption to avoid possible deadlock situations.

Drivers supporting a connection pool (see CMAP specification) support an option for limiting the connection pool size: maxPoolSize.

Drivers need to check out a connection before serializing the command. If the listCollections or find command during automatic encryption uses the same connection pool as the parent MongoClient, the application is susceptible to deadlocks.

Using the same connection pool causes automatic encryption to check out multiple connections from the pool when processing a single command. If maxPoolSize=1, this is an immediate deadlock. If maxPoolSize=2, and two threads check out the first connection, they will deadlock attempting to check out the second.

Why is keyVaultClient an exposed option, but metadataClient private?

The keyVaultClient supports the use case where the key vault collection is stored on a MongoDB cluster separate from the data-bearing cluster.

The metadataClient is only used for listCollections against the data-bearing cluster.

listCollections responses are cached by libmongocrypt for one minute.

The use pattern of the metadataClient will likely greatly differ from the parent MongoClient. So it is configured with minPoolSize=0.

The metadataClient is not an exposed option because a user could misconfigure it to point to another MongoDB cluster, which could be a security risk.

Why is the metadataClient not needed if bypassAutoEncryption=true

JSON schema data is only needed for automatic encryption but not for automatic decryption. listCollections is not run when bypassAutoEncryption is true, making a metadataClient unnecessary.

Why are commands sent to mongocryptd on collections without encrypted fields?

If a MongoClient is configured with automatic encryption, all commands on collections listed as AUTOENCRYPT in libmongocrypt: Auto Encryption Allow-List undergo the automatic encryption process. Even if the collection does not have an associated schema, the command is sent to mongocryptd as a safeguard. A collection may not have encrypted fields, but a command on the collection may could have sensitive data as part of the command arguments. For example:

db.publicData.aggregate([
   {$lookup: {from: "privateData", localField: "_id", foreignField: "_id", as: "privateData"}},
   {$match: {"privateData.ssn": "123-45-6789"}},
])

The publicData collection does not have encrypted fields, but the privateData collection does. mongocryptd rejects an aggregate with $lookup since there is no mechanism to determine encrypted fields of joined collections.

Why do KMS providers require TLS options?

Drivers authenticate to KMIP servers with the client certificate presented in TLS connections.

This specification assumes that TLS connections to KMIP servers may require different TLS options than TLS connections to MongoDB servers.

KMIP support in the MongoDB server is a precedent. The server supports --kmipServerCAFile and --kmipClientCertificateFile to configure the encrypted storage engine KMIP. See https://www.mongodb.com/docs/manual/tutorial/configure-encryption/.

TLS options may be useful for the AWS, Azure, and GCP KMS providers in a case where the default trust store does not include the needed CA certificates.

Why is it an error to have an FLE 1 and Queryable Encryption field in the same collection?

There is no technical limitation to having a separate FLE field and Queryable Encryption field in the same collection. Prohibiting FLE and Queryable Encryption in the same collection reduces complexity. From the product perspective, a random FLE field and a non-queryable Queryable Encryption field have the same behavior and similar security guarantees. A deterministic FLE field leaks more information then a deterministic Queryable Encryption field. There is not a compelling use case to use both FLE and Queryable Encryption in the same collection.

Is it an error to set schemaMap and encryptedFieldsMap?

No. FLE and Queryable Encryption fields can coexist in different collections. The same collection cannot be in the encryptedFieldsMap and schemaMap. libmongocrypt will error if the same collection is specified in a schemaMap and encryptedFieldsMap.

Why is bypassQueryAnalysis needed?

Commands containing payloads for encrypted indexed fields require a top-level "encryptionInformation" field for the server processing. bypassQueryAnalysis enables the use case of Explicit Encryption without the MongoDB Enterprise Advanced licensed crypt_shared library or mongocryptd process.

Here is an example:

// No MongoDB Enterprise Advanced licensed 'crypt_shared' shared library.
aeo := options.AutoEncryption().
   SetKeyVaultNamespace("keyvault.datakeys").
   SetEncryptedFieldsMap(efcMap).
   SetKmsProviders(kmsProviders).
   SetBypassQueryAnalysis(true)

co := options.Client().
   ApplyURI(uri).
   SetAutoEncryptionOptions(aeo)

encryptedClient, err := mongo.Connect(ctx, co)
defer encryptedClient.Disconnect(ctx)
if err != nil {
   log.Fatalf("error in Connect: %v", err)
}

coll := encryptedClient.Database("foo").Collection("bar")
// Explicit Encrypt an Queryable Encryption Indexed Field.
eo := options.Encrypt().
   SetEncryptIndexType(options.EncryptIndexEquality)
ciphertext, err := ce.Encrypt(ctx, val, eo)
// In InsertOne, libmongocrypt appends "encryptionInformation" to the insert command.
_, err = coll.InsertOne(ctx, bson.D{{"encryptedIndexed", ciphertext}})
if err != nil {
   log.Fatalf("error in InsertOne: %v", err)
}

A rejected alternative to adding bypassQueryAnalysis is to change the behavior of bypassAutoEncryption. bypassQueryAnalysis is distinct from bypassAutoEncryption. bypassAutoEncryption bypasses all of libmongocrypt for commands. Changing the behavior of bypassAutoEncryption could harm performance (e.g. by serializing as smaller documents).

Why does rewrapManyDataKey return RewrapManyDataKeyResult instead of BulkWriteResult?

rewrapManyDataKey is risky as it performs both a find and an update on the key vault collection. Using RewrapManyDataKeyResult allows new fields to be added in the future and more easily deprecate the wrapped BulkWriteResult if necessary.

Why does ClientEncryption have key management functions when Drivers can use existing CRUD operations instead?

The key management functions are added for parity with mongosh to reduce friction between conducting operations using mongosh and a Driver. Their inclusion assumes their value for users outweighs their cost of implementation and maintenance.

Why are the QueryType and Algorithm options a String?

Using an enum to represent QueryType was considered and rejected in DRIVERS-2352.

A string value helps with future compatibility. When new values of QueryType and IndexType are added in libmongocrypt, users would only need to upgrade libmongocrypt, and not the driver, to use the new values.

Why is there an encryptExpression helper?

Querying a range index requires encrypting a lower bound (value for $gt or $gte) and upper bound (value for $lt or $lte) payload. A rejected alternative API is to encrypt the lower and upper bound payloads separately. The lower and upper bound payloads must have a unique matching UUID. The lower and upper bound payloads are unique. This API requires handling the UUID and distinguishing the upper and lower bounds. Here are examples showing possible errors:

uuid = UUID()
lOpts = EncryptOpts(
   keyId=keyId, algorithm="range", queryType="range", uuid=uuid, bound="lower")
lower = clientEncryption.encrypt (value=30, lOpts)
uOpts = EncryptOpts(
   keyId=keyId, algorithm="range", queryType="range", uuid=uuid, bound="upper")
upper = clientEncryption.encrypt (value=40, uOpts)

# Both bounds match UUID ... OK
db.coll.find_one ({"age": {"$gt": lower, "$lt": upper }})

# Upper bound is used as a lower bound ... ERROR!
db.coll.find_one ({"age": {"$gt": upper }})

lower2 = clientEncryption.encrypt (value=35, lOpts)

# UUID is re-used ... ERROR!
db.coll.find_one ({ "$or": [
   {"age": {"$gt": lower, "$lt": upper }},
   {"age": {"$gt": lower2 }}
]})

# UUID does not match between lower and upper bound ... ERROR!
db.coll.find_one ({ "age": {"$gt": lower2, "$lt": upper }})

Requiring an Aggregate Expression or Match Expression hides the UUID and handles both bounds.

Returning an Aggregate Expression or Match Expression as a BSON document motivated adding a new ClientEncryption.encryptExpression() helper. ClientEncryption.encrypt() cannot be reused since it returns a Binary.

To limit scope, only $and is supported. Support for other operators ($eq, $in) can be added in the future if desired.

Why do on-demand KMS credentials not support named KMS providers?

libmongocrypt supports supplying KMS providers credentials on-demand. This enables obtaining credentials from the host machine. Supporting on-demand credentials was added as part of these projects: DRIVERS-2280 ("aws"), DRIVERS-2377 ("gcp"), DRIVERS-2411 ("azure").

On-demand credentials are primarily intended to support passing credentials assigned in an environment. Supporting on-demand credentials for more than one KMS provider of the same type is not expected to be useful. Supporting on-demand KMS credentials would require added work in drivers inspecting the KMS providers when obtaining credentials, as well as additional test coverage. Supporting on-demand KMS credentials for named KMS providers can be considered as future work if needed.

What is the KMIP delegated option?

By default, the KMS will retrieve the key encryption key from the KMIP server and use it to encrypt the data key. If the delegated option is set to true (recommended), the KMIP server will instead perform encryption and decryption locally, ensuring that the key encryption key never leaves the server.

Future work

Make libmonogocrypt cache window configurable

There's a principle at MongoDB, "no knobs", that we should honor wherever possible. Configurability is bad, mandating one well-chosen value is good. But if our default caching behavior is discovered unsuitable for some use cases we may add configuration as part of future work.

APM events for encryption or key service interaction

APM events include the encrypted data before it is sent to the server, or before it is decrypted in a reply. Therefore, app developers can determine whether or not encryption occurred by inspecting the values in their command monitoring events. However, it might be useful to offer separate "encryption" and "decryption" events, as well as interactions with the key service.

Remove mongocryptd

A future version plans to remove the mongocryptd process and fold the logic into libmongocrypt using the crypt_shared library. Therefore, this spec mandates that drivers use libmongocrypt to abstract encryption logic, deduplicate work, and to provide a simpler future path to removing mongocryptd.

Support external key vault collection discovery

The only way to configure an external key vault collection is by passing a MongoClient.

For apps like Compass, where it may not be possible for users to configure this app side, there should ideally be enough information in the database to decrypt data. (Excluding KMS credentials, which are still passed as MongoClient options).

We may want to store a URI to the external key vault collection somewhere in the data bearing cluster, so clients can connect to the external key vault collection without additional user supplied configuration.

Batch listCollections requests on expired schema cache entries

Currently libmongocrypt will refresh schemas one at a time.

Add a maximum size for the JSONSchema/key cache.

They're unbounded currently.

Recalculate message size bounds dynamically

Instead of using one reduced maxMessageSizeBytes, libmongocrypt could hide the complexity of properly resplitting bulk writes after encryption. It could use a simple back-off algorithm: try marking a command with maxMessageSizeBytes=24MB. If after marking we determine that's too large, try again with maxMessageSizeBytes=12MB and so on. And in the end libmongocrypt would create multiple OP_MSGs to send.

Support sessions in key management functions

Key management functions currently assume there are no concurrent accesses of the key vault collection being operated on. To support concurrent access of the key vault collection, the key management functions may be overloaded to take an explicit session parameter as described in the Drivers Sessions Specification.

Changelog

  • 2024-05-31: Replace rangePreview with range.

  • 2024-03-20: Add delegated option to "kmip" KMS provider

  • 2024-02-27: Migrated from reStructuredText to Markdown.

  • 2024-01-10: Add named KMS providers

  • 2023-03-30: Remove ECC collection

  • 2023-02-01: Replace DataKeyOpts with masterKey in createEncryptedCollection.

  • 2023-01-31: createEncryptedCollection does not check AutoEncryptionOptions for the encryptedFieldsMap.

  • 2023-01-30: Return encryptedFields on CreateCollection error.

  • 2023-01-26: Use GetEncryptedFields in more helpers.

  • 2022-11-30: Add Range.

  • 2022-11-28: Permit tlsDisableOCSPEndpointCheck in KMS TLS options.

  • 2022-11-27: Fix typo for references to cryptSharedLibRequired option.

  • 2022-11-10: Defined a CreateEncryptedCollection helper for creating new encryption keys automatically for the queryable encrypted fields in a new collection.

  • 2022-11-07: Reformat changelog.

  • 2022-11-04: Permit global cache for Azure credentials.

  • 2022-10-26: Do not connect to mongocryptd if shared library is loaded.

  • 2022-10-11: Specify a timeout on Azure IMDS HTTP requests and fix the resource URL

  • 2022-10-05: Remove spec front matter and versionadded RST macros (since spec version was removed)

  • 2022-09-26: Add behavior for automatic Azure KeyVault credentials for kmsProviders.

  • 2022-09-09: Prohibit rewrapManyDataKey with libmongocrypt <= 1.5.1.

  • 2022-07-20: Add behavior for automatic GCP credential loading in kmsProviders.

  • 2022-06-30: Add behavior for automatic AWS credential loading in kmsProviders.

  • 2022-06-29: Clarify bulk write operation expectations for rewrapManyDataKey().

  • 2022-06-27: Remove createKey.

  • 2022-06-24: Clean up kmsProviders to use more TypeScript-like type definitions.

  • 2022-06-23: Make RewrapManyDataKeyResult.bulkWriteResult optional.

  • 2022-06-16: Change QueryType to a string.

  • 2022-06-15: Clarify description of date fields in key documents.

  • 2022-06-08: Add Queryable Encryption to abstract.

  • 2022-06-02: Rename FLE 2 to Queryable Encryption

  • 2022-05-31: Rename csfle to crypt_shared

  • 2022-05-27: Define ECC, ECOC, and ESC acronyms within encryptedFields

  • 2022-05-26: Clarify how encryptedFields interacts with create and drop commands

  • 2022-05-24: Add key management API functions

  • 2022-05-18: Add createKey and rewrapManyDataKey

  • 2022-05-11: Update create state collections to use clustered collections. Drop data collection after state collection.

  • 2022-05-03: Add queryType, contentionFactor, and "Indexed" and "Unindexed" to algorithm.

  • 2022-04-29: Add bypassQueryAnalysis option

  • 2022-04-11: Document the usage of the new csfle library (Note: Later renamed to crypt_shared)

  • 2022-02-24: Rename Versioned API to Stable API

  • 2022-01-19: Require that timeouts be applied per the CSOT spec

  • 2021-11-04: Add "kmip" KMS provider

  • 2021-04-08: Updated to use hello and legacy hello

  • 2021-01-22: Add sessionToken option to "aws" KMS provider

  • 2020-12-12: Add metadataClient option and internal client

  • 2020-10-19: Add "azure" and "gcp" KMS providers

  • 2019-10-11: Add "endpoint" to AWS masterkey

  • 2019-12-17: Clarified bypassAutoEncryption and managing mongocryptd

BSON Binary Subtype 6

  • Status: Accepted
  • Minimum Server Version: 4.2

Abstract

Client side encryption requires a new binary subtype to store (1) encrypted ciphertext with metadata, and (2) binary markings indicating what values must be encrypted in a document. (1) is stored in the server, but (2) is only used in the communication protocol between libmongocrypt and mongocryptd described in Driver Spec: Client Side Encryption Encryption.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Specification

This spec introduces a new BSON binary subtype with value 6. The binary has multiple formats determined by the first byte, but are all related to client side encryption. The first byte indicates the type and layout of the remaining data.

All values are represented in little endian. The payload is generally optimized for storage size. The exception is the intent-to-encrypt markings which are only used between libmongocrypt and mongocryptd and never persisted.

struct {
   uint8 subtype;
   [more data - see individual type definitions]
}
NameTypeDescription
subtypeuint8Type of blob format.
TypeNameBlob Description
0Intent-to-encrypt marking.Contains unencrypted data that will be encrypted (by libmongocrypt) along with metadata describing how to encrypt it.
1AEAD_AES_CBC_HMAC_SHA512 deterministic ciphertext.The metadata and encrypted data for deterministic encrypted data.
2AEAD_AES_CBC_HMAC_SHA512 randomized ciphertext.The metadata and encrypted data for random encrypted data.

Type 0: Intent-to-encrypt marking

struct {
   uint8 subtype = 0;
   [ bson ];
}

bson is the raw bytes of the following BSON document:

NameLong NameTypeDescription
vvalueanyValue to encrypt.
aalgorithmint32Encryption algorithm to use. Same as fle_blob_subtype: 1 for deterministic, 2 for randomized.
kikeyIdUUIDOptional. Used to query the key vault by _id. If omitted, then "ka" must be specified.
kakeyAltNamestringOptional. Used to query the key vault by keyAltName. If omitted, then "ki" must be specified.

Types 1 and 2: Ciphertext

struct {
   uint8 subtype = (1 or 2);
   uint8 key_uuid[16];
   uint8 original_bson_type;
   uint8 ciphertext[ciphertext_length];
}
NameDescription
subtypeType of blob format and encryption algorithm used.
key_uuid[16]The value of _id for the key used to decrypt the ciphertext.
original_bson_typeThe byte representing the original BSON type of the encrypted data.
ciphertext[]The encrypted ciphertext (includes IV prepended).

Test Plan

Covered in Driver Spec: Client Side Encryption Encryption.

Design Rationale

Why not use a new BSON type?

An alternative to using a new binary subtype would be introducing a new BSON type. This would be a needless backwards breaking change. Since FLE is largely a client side feature, it should be possible to store encrypted data in old servers.

Plus, encrypted ciphertext is inherently a binary blob. Packing metadata inside isolates all of the encryption related data into one BSON value that can be treated as an opaque blob in most contexts.

Why not use separate BSON binary subtypes instead of a nested subtype?

If we used separate subtypes, we'd need to reserve three (and possibly more in the future) of our 124 remaining subtypes.

Why are intent-to-encrypt markings needed?

Intent-to-encrypt markings provide a simple way for mongocryptd to communicate what values need to be encrypted to libmongocrypt. Alternatively, mongocryptd could respond with a list of field paths. But field paths are difficult to make unambiguous, and even the query language is not always consistent.

What happened to the "key vault alias"?

In an earlier revision of this specification the notion of a "key vault alias". The key vault alias identified one of possibly many key vaults that stored the key to decrypt the ciphertext. However, enforcing one key vault is a reasonable restriction for users. Users can migrate from one key vault to another without ciphertext data including a key vault alias. If we find a future need for multiple key vaults, we can easily introduce a new format with the fle_blob_subtype.

Why distinguish between "deterministic" and "randomized" when they contain the same fields?

Deterministic and randomized ciphertext supports different behavior. Deterministic ciphertext supports exact match queries but randomized does not.

Why is the original BSON type not encrypted?

Exposing the underlying BSON type gives some validation of the data that is encrypted. A JSONSchema on the server can validate that the underlying encrypted BSON type is correct.

Reference Implementation

libmongocrypt and mongocryptd will be the reference implementation of how BSON binary subtype 6 is used.

Security Implication

It would be a very bad security flaw if intent-to-encrypt markings were confused with ciphertexts. This could lead to a marking inadvertently being stored on a server – meaning that plaintext is stored where ciphertext should have been.

Therefore, the leading byte of the BSON binary subtype distinguishes between marking and ciphertext.

Changelog

  • 2024-03-04: Migrated from reStructuredText to Markdown.
  • 2022-10-05: Remove spec front matter and create changelog.

.. role:: javascript(code) :language: javascript

============================== Command Logging and Monitoring

:Status: Accepted :Minimum Server Version: 2.4

.. contents::


Specification

The command logging and monitoring specification defines a set of behaviour in the drivers for providing runtime information about commands in log messages as well as in events which users can consume programmatically, either directly or by integrating with third-party APM libraries.

Definitions

META ^^^^

The keywords “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119 <https://www.ietf.org/rfc/rfc2119.txt>_.

Terms ^^^^^

Document The term Document refers to the implementation in the driver's language of a BSON document.

Guidance

Documentation ^^^^^^^^^^^^^

The documentation provided in code below is merely for driver authors and SHOULD NOT be taken as required documentation for the driver.

Messages and Events ^^^^^^^^^^^^^^^^^^^

All drivers MUST implement the specified event types as well as log messages.

Implementation details are noted in the comments when a specific implementation is required. Within each event and log message, all properties are REQUIRED unless noted otherwise.

Naming ^^^^^^

All drivers MUST name types, properties, and log message values as defined in the following sections. Exceptions to this rule are noted in the appropriate section. Class and interface names may vary according to the driver and language best practices.

Publishing & Subscribing ^^^^^^^^^^^^^^^^^^^^^^^^

The driver SHOULD publish events in a manner that is standard to the driver's language publish/subscribe patterns and is not strictly mandated in this specification.

Similarly, as described in the logging specification <../logging/logging.md#implementation-requirements>__ the driver SHOULD emit log messages in a manner that is standard for the language.

Guarantees ^^^^^^^^^^

The driver MUST guarantee that every CommandStartedEvent has either a correlating CommandSucceededEvent or CommandFailedEvent, and that every "command started" log message has either a correlating "command succeeded" log message or "command failed" log message.

The driver MUST guarantee that the requestId of the CommandStartedEvent and the corresponding CommandSucceededEvent or CommandFailedEvent is the same, and that the requestId of the "command started" log message and the corresponding "command succeeded" or "command failed" log message is the same.

Unacknowledged/Acknowledged Writes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Unacknowledged writes must provide a CommandSucceededEvent and a "command succeeded" log message with a { ok: 1 } reply.

A non-default write concern MUST be included in the published command. The default write concern is not required to be included.

Succeeded or Failed ^^^^^^^^^^^^^^^^^^^

Commands that executed on the server and return a status of { ok: 1.0 } are considered successful commands and MUST generate a CommandSucceededEvent and "command succeeded" log message. Commands that have write errors are included since the actual command did succeed, only writes failed.

Error Handling ^^^^^^^^^^^^^^

If an exception occurs while sending the operation to the server, the driver MUST generate a CommandFailedEvent and "command failed" log message with the exception or message, and re-raise the exception.

Bulk Writes ^^^^^^^^^^^

This specification defines the monitoring and logging of individual commands and in that respect MUST generate events and log messages for each command a bulk write executes. Each of these commands, however, must be linked together via the same operationId.

Implementation Notes ^^^^^^^^^^^^^^^^^^^^

When a driver sends an OP_MSG with a document sequence, it MUST include the document sequence as a BSON array in CommandStartedEvent.command. The array's field name MUST be the OP_MSG sequence identifier. For example, if the driver sends an update command using OP_MSG, and sends a document sequence as a separate section of payload type 1 with identifier updates, the driver MUST include the documents as a BSON array in CommandStartedEvent.command with field name updates.

See "Why are document sequences included as BSON arrays?" in the rationale_.

Rationale

1. Why are commands with { ok: 1 } treated as successful and { ok: 0 } as failed?

The specification is consistent with what the server deems as a successful or failed command and reports this as so. This also allows for server changes around this behaviour in the future to require no change in the drivers to continue to be compliant.

The command listener API is responsible only for receiving and handling events sent from the lowest level of the driver, and is only about informing listeners about what commands are sent and what replies are received. As such, it would be innappropiate at this level for a driver to execute custom logic around particular commands to determine what failure or success means for a particular command. Implementers of the API are free to handle these events as they see fit, which may include code that further interprets replies to specific commands based on the presence or absence of other fields in the reply beyond the ok field.

2. Why are document sequences included as BSON arrays?

The OP_MSG wire protocol was introduced in MongoDB 3.6, with document sequences as an optimization for bulk writes. We have chosen to represent these OP_MSGs as single command documents for now, until a need for a more accurate (and perhaps better-performing) command monitoring API for document sequences has been demonstrated.

3. Why is BSON serialization and deserialization optional to include in durations?

Different drivers will serialize and deserialize BSON at different levels of the driver architecture. For example, some parts of a command (e.g. inserted document structs) could be pre-encoded early into a "raw" BSON form and the final command with late additions like a session ID could encoded just before putting it on the wire.

Rather than specify a duration rule that would be hard to satisfy consistently, we allow duration to include BSON serialization/deserialization or not based on the architecture needs of each driver.

Security

Some commands and replies will contain sensitive data relating to authentication.

In order to not risk leaking this data to external sources or logs, for these commands:

  • The "command" field in CommandStartedEvent and "command started" log messages MUST be replaced with an empty BSON document.

  • The "reply" field in CommandSucceededEvent and "command succeeded" log messages MUST be replaced with an empty BSON document.

  • If the error is a server-side error, the "failure" field in CommandFailedEvent and "command failed" log messages MUST have all fields besides the following redacted:

    • code
    • codeName
    • errorLabels

    The exact implementation of redaction is flexible depending on the type the driver uses to represent a failure in these events and log messages. For example, a driver could choose to set all properties besides these on an error object to null. Alternatively, a driver that uses strings to represent failures could replace relevant portions of the string with "REDACTED".

The list of sensitive commands is as follows:

.. list-table:: :header-rows: 1 :widths: 50

    • Command
    • authenticate
    • saslStart
    • saslContinue
    • getnonce
    • createUser
    • updateUser
    • copydbgetnonce
    • copydbsaslstart
    • copydb
    • hello (or legacy hello) when speculativeAuthenticate is present

See the MongoDB Handshake spec <https://github.com/mongodb/specifications/blob/master/source/mongodb-handshake/handshake.rst>_ for more information on hello and legacy hello. Note that legacy hello has two different letter casings that must be taken into account. See the previously mentioned MongoDB Handshake spec for details.

Events API

See the Load Balancer Specification <../load-balancers/load-balancers.md#events>__ for details on the serviceId field.

.. code:: typescript

interface CommandStartedEvent {

/**
 * Returns the command.
 */
command: Document;

/**
 * Returns the database name.
 */
databaseName: String;

/**
 * Returns the command name.
 */
commandName: String;

/**
 * Returns the driver generated request id.
 */
requestId: Int64;

/**
 * Returns the driver generated operation id. This is used to link events together such
 * as bulk write operations. OPTIONAL.
 */
operationId: Int64;

/**
 * Returns the connection id for the command. For languages that do not have this,
 * this MUST return the driver equivalent which MUST include the server address and port.
 * The name of this field is flexible to match the object that is returned from the driver.
 */
connectionId: ConnectionId;

/**
 * Returns the server connection id for the command. The server connection id is distinct from
 * the connection id and is returned by the hello or legacy hello response as "connectionId"
 * from the server on 4.2+. Drivers MUST use an Int64 to represent the server connection ID value.
 */
serverConnectionId: Optional<Int64>;

/**
 * Returns the service id for the command when the driver is in load balancer mode.
 * For drivers that wish to include this in their ConnectionId object, this field is
 * optional.
 */
serviceId: Optional<ObjectId>;

}

interface CommandSucceededEvent {

/**
 * Returns the execution time of the event in the highest possible resolution for the platform.
 * The calculated value MUST be the time to send the message and receive the reply from the server
 * and MAY include BSON serialization and/or deserialization. The name can imply the units in which the
 * value is returned, i.e. durationMS, durationNanos.
 */
duration: Int64;

/**
 * Returns the command reply.
 */
reply: Document;

/**
 * Returns the command name.
 */
commandName: String;

/**
 * Returns the database name.
 */
databaseName: String;

/**
 * Returns the driver generated request id.
 */
requestId: Int64;

/**
 * Returns the driver generated operation id. This is used to link events together such
 * as bulk write operations. OPTIONAL.
 */
operationId: Int64;

/**
 * Returns the connection id for the command. For languages that do not have this,
 * this MUST return the driver equivalent which MUST include the server address and port.
 * The name of this field is flexible to match the object that is returned from the driver.
 */
connectionId: ConnectionId;

/**
 * Returns the server connection id for the command. The server connection id is distinct from
 * the connection id and is returned by the hello or legacy hello response as "connectionId"
 * from the server on 4.2+. Drivers MUST use an Int64 to represent the server connection ID value.
 */
serverConnectionId: Optional<Int64>;

/**
 * Returns the service id for the command when the driver is in load balancer mode.
 * For drivers that wish to include this in their ConnectionId object, this field is
 * optional.
 */
serviceId: Optional<ObjectId>;

}

interface CommandFailedEvent {

/**
 * Returns the execution time of the event in the highest possible resolution for the platform.
 * The calculated value MUST be the time to send the message and receive the reply from the server
 * and MAY include BSON serialization and/or deserialization. The name can imply the units in which the
 * value is returned, i.e. durationMS, durationNanos.
 */
duration: Int64;

/**
 * Returns the command name.
 */
commandName: String;

/**
 * Returns the database name.
 */
databaseName: String;

/**
 * Returns the failure. Based on the language, this SHOULD be a message string, exception
 * object, or error document.
 */
failure: String,Exception,Document;

/**
 * Returns the client generated request id.
 */
requestId: Int64;

/**
 * Returns the driver generated operation id. This is used to link events together such
 * as bulk write operations. OPTIONAL.
 */
operationId: Int64;

/**
 * Returns the connection id for the command. For languages that do not have this,
 * this MUST return the driver equivalent which MUST include the server address and port.
 * The name of this field is flexible to match the object that is returned from the driver.
 */
connectionId: ConnectionId;

/**
 * Returns the server connection id for the command. The server connection id is distinct from
 * the connection id and is returned by the hello or legacy hello response as "connectionId"
 * from the server on 4.2+. Drivers MUST use an Int64 to represent the server connection ID value.
 */
serverConnectionId: Optional<Int64>;

/**
 * Returns the service id for the command when the driver is in load balancer mode.
 * For drivers that wish to include this in their ConnectionId object, this field is
 * optional.
 */
serviceId: Optional<ObjectId>;

}

Log Messages

Please refer to the logging specification <../logging/logging.md>__ for details on logging implementations in general, including log levels, log components, and structured versus unstructured logging.

Drivers MUST support logging of command information via the following types of log messages. These messages MUST be logged at Debug level and use the command log component.

The log messages are intended to match the information contained in the events above. Drivers MAY implement command logging support via an event subscriber if it is convenient to do so.

The types used in the structured message definitions below are demonstrative, and drivers MAY use similar types instead so long as the information is present (e.g. a double instead of an integer, or a string instead of an integer if the structured logging framework does not support numeric types.)

Drivers MUST not emit command log messages for commands issued as part of the handshake with the server, or heartbeat commands issued by server monitors.

Common Fields ^^^^^^^^^^^^^ The following key-value pairs MUST be included in all command messages:

.. list-table:: :header-rows: 1 :widths: 1 1 1

    • Key
    • Suggested Type
    • Value
    • commandName
    • String
    • The command name.
    • databaseName
    • String
    • The database name.
    • requestId
    • Int
    • The driver-generated request ID.
    • operationId
    • Int
    • The driver-generated operation ID. Optional; only present if the driver generated operation IDs and this command has one.
    • driverConnectionId
    • Int64
    • The driver's ID for the connection used for the command. Note this is NOT the same as CommandStartedEvent.connectionId defined above, but refers to the connectionId defined in the connection monitoring and pooling specification <../connection-monitoring-and-pooling/connection-monitoring-and-pooling.md>_. Unlike CommandStartedEvent.connectionId this field MUST NOT contain the host/port; that information MUST be in the following fields, serverHost and serverPort. This field is optional for drivers that do not implement CMAP if they do have an equivalent concept of a connection ID.
    • serverHost
    • String
    • The hostname or IP address for the server the command is being run on.
    • serverPort
    • Int
    • The port for the server the command is being run on. Optional; not present for Unix domain sockets. When the user does not specify a port and the default (27017) is used, the driver SHOULD include it here.
    • serverConnectionId
    • Int64
    • The server's ID for the connection used for the command. Optional; only present for server versions 4.2+. NOTE: Existing drivers may represent this as an Int32 already. For strongly-typed languages, you may have to introduce a new Int64 field and deprecate the old Int32 field. The next major version should remove the old Int32 field.
    • serviceId
    • String
    • The hex string representation of the service ID for the command. Optional; only present when the driver is in load balancer mode.

Command Started Message ^^^^^^^^^^^^^^^^^^^^^^^ In addition to the common fields, command started messages MUST contain the following key-value pairs:

.. list-table:: :header-rows: 1 :widths: 1 1 1

    • Key
    • Suggested Type
    • Value
    • message
    • String
    • "Command started"
    • command
    • String
    • Relaxed extJSON representation of the command. This document MUST be truncated appropriately according to rules defined in the logging specification <../logging/logging.md#configurable-max-document-length>__, and MUST be replaced with an empty document "{ }" if the command is considered sensitive.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Command "{{commandName}}" started on database "{{databaseName}}" using a connection with driver-generated ID {{driverConnectionId}} and server-generated ID {{serverConnectionId}} to {{serverHost}}:{{serverPort}} with service ID {{serviceId}}. The requestID is {{requestId}} and the operation ID is {{operationId}}. Command: {{command}}

Command Succeeded Message ^^^^^^^^^^^^^^^^^^^^^^^^^ In addition to the common fields, command succeeded messages MUST contain the following key-value pairs:

.. list-table:: :header-rows: 1 :widths: 1 1 1

    • Key
    • Suggested Type
    • Value
    • message
    • String
    • "Command succeeded"
    • durationMS
    • Int
    • The execution time for the command in milliseconds. The calculated value MUST be the time to send the message and receive the reply from the server and MAY include BSON serialization and/or deserialization.
    • reply
    • String
    • Relaxed extJSON representation of the reply. This document MUST be truncated appropriately according to rules defined in the logging specification <../logging/logging.md#configurable-max-document-length>__, and MUST be replaced with an empty document "{ }" if the command is considered sensitive.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Command "{{commandName}}" succeeded on database "{{databaseName}}" in {{durationMS}} ms using a connection with driver-generated ID {{driverConnectionId}} and server-generated ID {{serverConnectionId}} to {{serverHost}}:{{serverPort}} with service ID {{serviceId}}. The requestID is {{requestId}} and the operation ID is {{operationId}}. Command reply: {{command}}

Command Failed Message ^^^^^^^^^^^^^^^^^^^^^^ In addition to the common fields, command failed messages MUST contain the following key-value pairs:

.. list-table:: :header-rows: 1 :widths: 1 1 1

    • Key
    • Suggested Type
    • Value
    • message
    • String
    • "Command failed"
    • durationMS
    • Int
    • The execution time for the command in milliseconds. The calculated value MUST be the time to send the message and receive the reply from the server and MAY include BSON serialization and/or deserialization.
    • failure
    • Flexible
    • The error. The type and format of this value is flexible; see the logging specification <../logging/logging.md#representing-errors-in-log-messages>__ for details on representing errors in log messages. If the command is considered sensitive, the error MUST be redacted and replaced with a language-appropriate alternative for a redacted error, e.g. an empty string, empty document, or null.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Command "{{commandName}}" failed on database "{{databaseName}}" in {{durationMS}} ms using a connection with driver-generated ID {{driverConnectionId}} and server-generated ID {{serverConnectionId}} to {{serverHost}}:{{serverPort}} with service ID {{serviceId}}. The requestID is {{requestId}} and the operation ID is {{operationId}}. Error: {{error}}

Testing

See the README in the test directory for requirements and guidance.

Q&A

Why is the command document only available in the CommandStartEvent?

Some drivers may realize the command document as raw BSON, treating it as a component of the message transmitted to the server and stored in an internal buffer. By the time the server's response is received, this buffer may have been released. Requiring the retention of this buffer until command completion could result in unacceptable performance penalties, particularly when event listeners are introduced.

Changelog

:2015-09-16: Removed limit from find test with options to support 3.2. Changed find test read preference to primaryPreferred. :2015-10-01: Changed find test with a kill cursors to not run on server versions greater than 3.0. Added a find test with no kill cursors command which only runs on 3.1 and higher. Added notes on which tests should run based on server versions. :2015-10-19: Changed batchSize in the 3.2 find tests to expect the remaining value. :2015-10-31: Changed find test on 3.1 and higher to ignore being run on sharded clusters. :2015-11-22: Specify how to merge OP_MSG document sequences into command-started events. :2016-03-29: Added note on guarantee of the request ids. :2016-11-02: Added clause for not upconverting commands larger than maxBsonSize. :2018-04-16: Made inclusion of BSON serialization/deserialization in command durations to be optional. :2020-02-12: Added legacy hello speculativeAuthenticate to the list of values that should be redacted. :2021-04-15: Added serviceId field to events. :2021-05-05: Updated to use hello and legacy hello. :2021-08-30: Added serverConnectionId field to CommandStartedEvent, CommandSucceededEvent and CommandFailedEvent. :2022-05-18: Converted legacy tests to the unified test format. :2022-09-02: Remove material that only applies to MongoDB versions < 3.6. :2022-10-05: Remove spec front matter and reformat changelog. :2022-10-11: Add command logging information and tests. :2022-11-16: Update sensitive command tests to only run on server versions where the commands are supported. :2022-12-13: Updated log message serverPort field description to clarify drivers should populate it with the default port 27017 when relevant. Updated suggested unstructured forms of log messages to more clearly label connection IDs and use more readable server address representations. :2023-03-23: Updated serverConnectionId field to be Int64 as long-running servers can return Int64. :2023-06-13: Added databaseName field to CommandFailedEvent and CommandSucceededEvent. Updated suggested unstructured forms of log messages reflecting the changes. :2023-10-19: Add Q&A section

SDAM Logging and Monitoring Specification

  • Status: Accepted
  • Minimum Server Version: 2.4

Abstract

The SDAM logging and monitoring specification defines a set of behaviors in the driver for providing runtime information about server discovery and monitoring (SDAM) in log messages, as well as in events that users can consume programmatically, either directly or by integrating with third-party APM libraries.

Definitions

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Terms

ServerAddress

The term ServerAddress refers to the implementation in the driver's language of a server host/port pair. This may be an object or a string. The name of this object is NOT REQUIRED.

TopologyType

The term TopologyType refers to the implementation in the driver's language of a topology type (standalone, sharded, etc.). This may be a string or object. The name of the object is NOT REQUIRED.

Server

The term Server refers to the implementation in the driver's language of an abstraction of a mongod or mongos process, or a load balancer, as defined by the SDAM specification.

Specification

Guidance

Documentation

The documentation provided in the code below is merely for driver authors and SHOULD NOT be taken as required documentation for the driver.

Messages and Events

All drivers MUST implement the specified event types as well as log messages.

Implementation details are noted below when a specific implementation is required. Within each event and log message, all properties are REQUIRED unless noted otherwise.

Naming

All drivers MUST name types, properties, and log message values as defined in the following sections. Exceptions to this rule are noted in the appropriate section. Class and interface names may vary according to the driver and language best practices.

Publishing and Subscribing

The driver SHOULD publish events in a manner that is standard to the driver's language publish/subscribe patterns and is not strictly mandated in this specification.

Similarly, as described in the logging specification the driver SHOULD emit log messages in a manner that is standard for the language.

Guarantees

Event Order and Concurrency

Events and log messages MUST be published in the order that their corresponding changes are processed in the driver. Events MUST NOT be published concurrently for the same topology ID or server ID, but MAY be published concurrently for differing topology IDs and server IDs.

Heartbeats

The driver MUST guarantee that every ServerHeartbeatStartedEvent has either a correlating ServerHeartbeatSucceededEvent or ServerHeartbeatFailedEvent, and that every "server heartbeat started" log message has either a correlating "server heartbeat succeeded" or "server heartbeat failed" log message.

Drivers that use the streaming heartbeat protocol MUST publish a ServerHeartbeatStartedEvent and "server heartbeat started" log message before attempting to read the next hello or legacy hello exhaust response.

Error Handling

If an exception occurs while sending the hello or legacy hello operation to the server, the driver MUST generate a ServerHeartbeatFailedEvent and "server heartbeat failed" log message with the exception or message and re-raise the exception. The SDAM mandated retry of the hello or legacy hello call should be visible to consumers.

Topology IDs

These MUST be a unique value that is specific to the Topology for which the events and log messages are emitted. The language may decide how to generate the value and what type the value is, as long as it is unique to the Topology. The ID MUST be created once when the Topology is created and remain the same until the Topology is destroyed.

Initial Server Description

ServerDescription objects MUST be initialized with a default description in an "unknown" state, guaranteeing that the previous description in the events and log messages will never be null.

Initial Topology Description

The first TopologyDescriptionChangedEvent to be emitted from a monitored Topology MUST set its previousDescription property to be a TopologyDescription object in the "unknown" state.

Events API

The first TopologyDescriptionChangedEvent to be emitted from a monitored Topology MUST set its previousDescription property to be a TopologyDescription object in the "unknown" state.

Closing Topology Description

When a Topology object or equivalent is being shut-down or closed, the driver MUST change the TopologyDescription to an "unknown" state.


Events API

This specification defines 9 main events that MUST be published in the scenarios described. 6 of these events are the core behaviour within the cluster lifecycle, and the remaining 3 server heartbeat events are fired from the server monitor and follow the guidelines for publishing in the command monitoring specification.

Events that MUST be published (with their conditions) are as follows.

Event TypeCondition
TopologyOpeningEventWhen a topology description is initialized - this MUST be the first SDAM event fired.
ServerOpeningEventPublished when the server description is instantiated with its defaults, and MUST be the first operation to happen after the defaults are set. This is before the Monitor is created and the Monitor socket connection is opened.
ServerDescriptionChangedEventWhen the old server description is not equal to the new server description
TopologyDescriptionChangedEventWhen the old topology description is not equal to the new topology description.
ServerClosedEventPublished when the server monitor's connection is closed and the server is shutdown.
TopologyClosedEventWhen a topology is shut down - this MUST be the last SDAM event fired.
ServerHeartbeatStartedEventPublished when the server monitor sends its hello or legacy hello call to the server. When the monitor is creating a new connection, this event MUST be published just before the socket is created.
ServerHeartbeatSucceededEventPublished on successful completion of the server monitor's hello or legacy hello call.
ServerHeartbeatFailedEventPublished on failure of the server monitor's hello or legacy hello call, either with an ok: 0 result or a socket exception from the connection.
/**
 * Published when server description changes, but does NOT include changes to the RTT.
 */
interface ServerDescriptionChangedEvent {

  /**
   * Returns the address (host/port pair) of the server.
   */
  address: ServerAddress;

  /**
   * Returns a unique identifier for the topology.
   */
  topologyId: Object;

  /**
   * Returns the previous server description.
   */
  previousDescription: ServerDescription;

  /**
   * Returns the new server description.
   */
  newDescription: ServerDescription;
}

/**
 * Published when server is initialized.
 */
interface ServerOpeningEvent {

  /**
   * Returns the address (host/port pair) of the server.
   */
  address: ServerAddress;

  /**
   * Returns a unique identifier for the topology.
   */
  topologyId: Object;
}

/**
 * Published when server is closed.
 */
interface ServerClosedEvent {

  /**
   * Returns the address (host/port pair) of the server.
   */
  address: ServerAddress;

  /**
   * Returns a unique identifier for the topology.
   */
  topologyId: Object;
}

/**
 * Published when topology description changes.
 */
interface TopologyDescriptionChangedEvent {

  /**
   * Returns a unique identifier for the topology.
   */
  topologyId: Object;

  /**
   * Returns the old topology description.
   */
  previousDescription: TopologyDescription;

  /**
   * Returns the new topology description.
   */
  newDescription: TopologyDescription;
}

/**
 * Published when topology is initialized.
 */
interface TopologyOpeningEvent {

  /**
   * Returns a unique identifier for the topology.
   */
  topologyId: Object;
}

/**
 * Published when topology is closed.
 */
interface TopologyClosedEvent {

  /**
   * Returns a unique identifier for the topology.
   */
  topologyId: Object;
}

/**
 * Fired when the server monitor's ``hello`` or legacy hello command is started - immediately before
 * the ``hello`` or legacy hello command is serialized into raw BSON and written to the socket.
 * When the monitor is creating a new monitoring connection, this event is fired just before the
 * socket is opened.
 */
interface ServerHeartbeatStartedEvent {

 /**
   * Returns the connection id for the command. The connection id is the unique
   * identifier of the driver's Connection object that wraps the socket. For languages that
   * do not have this object, this MUST a string of "hostname:port" or an object that
   * that contains the hostname and port as attributes.
   *
   * The name of this field is flexible to match the object that is returned from the driver.
   * Examples are, but not limited to, 'address', 'serverAddress', 'connectionId',
   */
  connectionId: ConnectionId;

 /**
   * Determines if this heartbeat event is for an awaitable ``hello`` or legacy hello.
   */
  awaited: Boolean;

}

/**
 * Fired when the server monitor's ``hello`` or legacy hello succeeds.
 */
interface ServerHeartbeatSucceededEvent {

 /**
   * Returns the execution time of the event in the highest possible resolution for the platform.
   * The calculated value MUST be the time to send the message and receive the reply from the server,
   * including BSON serialization and deserialization. The name can imply the units in which the
   * value is returned, i.e. durationMS, durationNanos.
   *
   * When the awaited field is false, the time measurement used MUST be the
   * same measurement used for the RTT calculation. When the awaited field is
   * true, the time measurement is not used for RTT calculation.
   */
  duration: Int64;

  /**
   * Returns the command reply.
   */
  reply: Document;

 /**
   * Returns the connection id for the command. For languages that do not have this,
   * this MUST return the driver equivalent which MUST include the server address and port.
   * The name of this field is flexible to match the object that is returned from the driver.
   */
  connectionId: ConnectionId;

 /**
   * Determines if this heartbeat event is for an awaitable ``hello`` or legacy hello. If
   * true, then the duration field cannot be used for RTT calculation
   * because the command blocks on the server.
   */
  awaited: Boolean;

}

/**
 * Fired when the server monitor's ``hello`` or legacy hello fails, either with an "ok: 0" or a socket exception.
 */
interface ServerHeartbeatFailedEvent {

 /**
   * Returns the execution time of the event in the highest possible resolution for the platform.
   * The calculated value MUST be the time to send the message and receive the reply from the server,
   * including BSON serialization and deserialization. The name can imply the units in which the
   * value is returned, i.e. durationMS, durationNanos.
   */
  duration: Int64;

 /**
   * Returns the failure. Based on the language, this SHOULD be a message string,
   * exception object, or error document.
   */
  failure: String,Exception,Document;

 /**
   * Returns the connection id for the command. For languages that do not have this,
   * this MUST return the driver equivalent which MUST include the server address and port.
   * The name of this field is flexible to match the object that is returned from the driver.
   */
  connectionId: ConnectionId;

 /**
   * Determines if this heartbeat event is for an awaitable ``hello`` or legacy hello. If
   * true, then the duration field cannot be used for RTT calculation
   * because the command blocks on the server.
   */
  awaited: Boolean;
}

The TopologyDescription object MUST expose the new methods defined in the API below, in order for subscribers to take action on certain conditions based on the driver options.

TopologyDescription objects MAY have additional methods and properties.

/**
 * Describes the current topology.
 */
interface TopologyDescription {

  /**
   * Determines if the topology has a readable server available. See the table in the
   * following section for behaviour rules.
   */
  hasReadableServer(readPreference: Optional<ReadPreference>): Boolean

  /**
   * Determines if the topology has a writable server available. See the table in the
   * following section for behaviour rules.
   */
  hasWritableServer(): Boolean
}

Determining If A Topology Has Readable/Writable Servers

The following table describes the rules for determining if a topology type has readable or writable servers. If no read preference is passed to hasReadableServer, the driver MUST default the value to the default read preference, primary, or treat the call as if primary was provided.

Topology Type hasReadableServer hasWritableServer
Unknown false false
Single true if the server is available true if the server is available
ReplicaSetNoPrimary
Called with primary: false
Called with any other option: uses the read preference to determine if any server in the cluster is suitable for reading.
Called with no option: false
false
ReplicaSetWithPrimary
Called with any valid option: uses the read preference to determine if any server in the cluster is suitable for reading.
Called with no option: true
true
Sharded true if 1+ servers are available true if 1+ servers are available
LoadBalanced true true

Log Messages

Please refer to the logging specification for details on logging implementations in general, including log levels, log components, and structured versus unstructured logging.

Drivers MUST support logging of SDAM information via the following types of log messages. These messages MUST be logged at Debug level and use the topology log component.

A number of the log messages are intended to match the information contained in the events above. However, note that a log message regarding a server description change (which would correspond to ServerDescriptionChangedEvent) has been intentionally omitted since the information it would contain is redundant with TopologyDescriptionChangedEvent and the equivalent log message.

Drivers MAY implement SDAM logging support via an event subscriber if it is convenient to do so.

The types used in the structured message definitions below are demonstrative, and drivers MAY use similar types instead so long as the information is present (e.g. a double instead of an integer, or a string instead of an integer if the structured logging framework does not support numeric types.)

Common Fields

The following key-value pairs are common to all or several log messages and MUST be included in the "applicable messages":

KeyApplicable MessagesSuggested TypeValue
topologyIdAll messagesFlexibleThe driver's unique ID for this topology as discussed in Topology IDs. The type is flexible depending on the driver's choice of type for topology ID.
serverHostLog messages specific to a particular server, including heartbeat-related messagesStringThe hostname, IP address, or Unix domain socket path for the endpoint the pool is for.
serverPortLog messages specific to a particular server, including heartbeat-related messagesInt(Only present for server-specific log messages) The port for the endpoint the pool is for. Optional; not present for Unix domain sockets. When the user does not specify a port and the default (27017) is used, the driver SHOULD include it here.
driverConnectionIdHeartbeat-related log messagesIntThe driver-generated ID for the monitoring connection as defined in the connection monitoring and pooling specification. Unlike connectionId in the above events, this field MUST NOT contain the host/port; that information MUST be in the above fields, serverHost and serverPort. This field is optional for drivers that do not implement CMAP if they do have an equivalent concept of a connection ID.
serverConnectionIdHeartbeat-related log messagesIntThe server's ID for the monitoring connection, if known. This value will be unknown and can be omitted in certain cases, e.g. the first "heartbeat started" message for a monitoring connection. Only present on server versions 4.2+.

"Starting Topology Monitoring" Log Message

This message MUST be published under the same circumstances as a TopologyOpeningEvent as detailed in Events API.

In addition to the relevant common fields, these messages MUST contain the following key-value pair:

KeySuggested TypeValue
messageString"Starting topology monitoring"

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Starting monitoring for topology with ID {{topologyId}}

"Stopped Topology Monitoring" Log Message

This message MUST be published under the same circumstances as a TopologyClosedEvent as detailed in Events API.

In addition to the relevant common fields, these messages MUST contain the following key-value pair:

KeySuggested TypeValue
messageString"Stopped topology monitoring"

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Stopped monitoring for topology with ID {{topologyId}}

"Starting Server Monitoring" Log Message

This message MUST be published under the same circumstances as a ServerOpeningEvent as detailed in Events API.

In addition to the relevant common fields, these messages MUST contain the following key-value pair:

KeySuggested TypeValue
messageString"Starting server monitoring"

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Starting monitoring for server {{serverHost}}:{{serverPort}} in topology with ID {{topologyId}}

"Stopped Server Monitoring" Log Message

This message MUST be published under the same circumstances as a ServerClosedEvent as detailed in Events API.

In addition to the relevant common fields, these messages MUST contain the following key-value pair:

KeySuggested TypeValue
messageString"Stopped server monitoring"

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Stopped monitoring for server {{serverHost}}:{{serverPort}} in topology with ID {{topologyId}}

"Topology Description Changed" Log Message

This message MUST be published under the same circumstances as a TopologyDescriptionChangedEvent as detailed in Events API.

In addition to the relevant common fields, these messages MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Topology description changed"
previousDescriptionStringA string representation of the previous description of the topology. The format is flexible and could be e.g. the toString() implementation for a driver's topology description type, or an extended JSON representation of the topology object.
newDescriptionStringA string representation of the new description of the server. The format is flexible and could be e.g. the toString() implementation for a driver's topology description type, or an extended JSON representation of the topology object.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Description changed for topology with ID {{topologyId}}. Previous description: {{previousDescription}}. New description: {{newDescription}}

"Server Heartbeat Started" Log Message

This message MUST be published under the same circumstances as a ServerHeartbeatStartedEvent as detailed in Events API.

In addition to the relevant common fields, these messages MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Server heartbeat started"
awaitedBooleanWhether this log message is for an awaitable hello or legacy "hello".

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Heartbeat started for {{serverHost}}:{{serverPort}} on connection with driver-generated ID {{driverConnectionId}} and server-generated ID {{serverConnectionId}} in topology with ID {{topologyId}}. Awaited: {{awaited}}

"Server Heartbeat Succeeded" Log Message

This message MUST be published under the same circumstances as a ServerHeartbeatSucceededEvent as detailed in Events API.

In addition to the relevant common fields, these messages MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Server heartbeat succeeded"
awaitedBooleanWhether this log message is for an awaitable hello or legacy "hello".
durationMSIntThe execution time for the heartbeat in milliseconds. See ServerHeartbeatSucceededEvent in Events API for details on calculating this value.
replyStringRelaxed extended JSON representation of the reply to the heartbeat command.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Heartbeat succeeded in {{durationMS}} ms for {{serverHost}}:{{serverPort}} on connection with driver-generated ID {{driverConnectionId}} and server-generated ID {{serverConnectionId}} in topology with ID {{topologyId}}. Awaited: {{awaited}}. Reply: {{reply}}

"Server Heartbeat Failed" Log Message

This message MUST be published under the same circumstances as a ServerHeartbeatFailedEvent as detailed in Events API.

In addition to the relevant common fields, these messages MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Server heartbeat failed"
awaitedBooleanWhether this log message is for an awaitable hello or legacy "hello".
durationMSIntThe execution time for the heartbeat in milliseconds. See ServerHeartbeatFailedEvent in Events API for details on calculating this value.
failureFlexibleThe error. The type and format of this value is flexible; see the logging specification for details on representing errors in log messages. If the command is considered sensitive, the error MUST be redacted and replaced with a language-appropriate alternative for a redacted error, e.g. an empty string, empty document, or null.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Heartbeat failed in {{durationMS}} ms for {{serverHost}}:{{serverPort}} on connection with driver-generated ID {{driverConnectionId}} and server-generated ID {{serverConnectionId}} in topology with ID {{topologyId}}. Awaited: {{awaited}}. Failure: {{failure}}

Tests

See the README.

Changelog

  • 2024-05-02: Migrated from reStructuredText to Markdown.

  • 2024-03-29: Updated to clarify expected initial value of TopologyDescriptionChangedEvent's previousDescription field

  • 2024-01-04: Updated to clarify when ServerHeartbeatStartedEvent should be emitted

  • 2023-03-31: Renamed to include "logging" in the title. Reorganized contents and made consistent with CLAM spec, and added requirements for SDAM log messages.

  • 2022-10-05: Remove spec front matter and reformat changelog.

  • 2021-05-06: Updated to use modern terminology.

<<<<<<< HEAD :2024-03-29: Updated to clarify expected initial value of TopologyDescriptionChangedEvent's previousDescription field :2024-01-17: Updated to require that TopologyDescriptionChangedEvent should be emitted before just TopologyClosedEvent is emitted :2024-01-04: Updated to clarify when ServerHeartbeatStartedEvent should be emitted :2023-03-31: Renamed to include "logging" in the title. Reorganized contents and made consistent with CLAM spec, and added requirements for SDAM log messages. :2022-10-05: Remove spec front matter and reformat changelog. :2021-05-06: Updated to use modern terminology. :2020-04-20: Add rules for streaming heartbeat protocol and add "awaited" field to heartbeat events. :2018:12-12: Clarified table of rules for readable/writable servers :2016-08-31: Added table of rules for determining if topology has readable/writable servers. :2016-10-11: TopologyDescription objects MAY have additional methods and properties. ||||||| parent of 469393fd (DRIVERS-2789 Convert SDAM Spec to Markdown) :2024-03-29: Updated to clarify expected initial value of TopologyDescriptionChangedEvent's previousDescription field :2024-01-04: Updated to clarify when ServerHeartbeatStartedEvent should be emitted :2023-03-31: Renamed to include "logging" in the title. Reorganized contents and made consistent with CLAM spec, and added requirements for SDAM log messages. :2022-10-05: Remove spec front matter and reformat changelog. :2021-05-06: Updated to use modern terminology. :2020-04-20: Add rules for streaming heartbeat protocol and add "awaited" field to heartbeat events. :2018:12-12: Clarified table of rules for readable/writable servers :2016-08-31: Added table of rules for determining if topology has readable/writable servers. :2016-10-11: TopologyDescription objects MAY have additional methods and properties.

  • 2020-04-20: Add rules for streaming heartbeat protocol and add "awaited" field to heartbeat events.

469393fd (DRIVERS-2789 Convert SDAM Spec to Markdown)

  • 2018:12-12: Clarified table of rules for readable/writable servers

  • 2016-08-31: Added table of rules for determining if topology has readable/writable servers.

  • 2016-10-11: TopologyDescription objects MAY have additional methods and properties.


Logging

  • Title: Logging
  • Status: Accepted
  • Minimum Server Version: N/A

Abstract

This specification defines requirements for drivers' logging configuration and behavior.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Specification

Terms

Structured logging
Structured logging refers to producing log messages in a structured format, i.e. a series of key-value pairs, which can be converted to external formats such as JSON.

Unstructured logging
Unstructured logging refers to producing string log messages which embed all attached information within that string.

Implementation requirements

Drivers SHOULD implement support for logging in a manner that is idiomatic for their language and ecosystem.

Minimizing Required Code Changes

If possible, drivers SHOULD support the following configuration requirements in a manner that does not require any changes to an application's source code, and in the case of compiled languages, does not require recompilation. However, if that is impossible to do without conflicting with idiomatic logging patterns for the language ecosystem, it is acceptable for a driver to require minimal changes to enable logging, such as recompiling with a feature flag specified, or passing a logging configuration to a MongoClient constructor. In that case, drivers SHOULD strive to minimize the amount of code change needed.

Drivers SHOULD take advantage of built-in support for logging configuration within logging frameworks or their language ecosystem if available. If unavailable, drivers MUST support configuration via the listed fallback implementation methods.

In addition to supporting configuration without code changes/recompilation, drivers MAY additionally provide support for enabling and configuring logging via their API in a manner that does require code changes, and MAY support additional configuration options beyond those defined in this specification.

For languages/ecosystems where libraries depend on only a logging interface (for example, the Java driver and SLF4J) and application developers must choose a log handler, the requirements MAY be considered satisfied if there is a log handler available in the ecosystem which users can select that satisfies the requirements. Similarly, the requirements MAY be considered satisfied if the driver provides an integration with a logging framework which satisfies the requirements.

Per-Component Configuration of Log Levels

Drivers MUST support enabling logging and specifying the minimum severity level for emitted messages on a per-component level, for each component defined below in the Components section.

Fallback implementation method: Support configuration via environment variables corresponding to each component, as defined in Components, as well as via the environment variable MONGODB_LOG_ALL.

Each of these variables may be set to any of the values defined below in Log Severity Levels to indicate the minimum severity level at which messages should be emitted for the corresponding component, or in the case of MONGODB_LOG_ALL, all components. Setting a value for MONGODB_LOG_ALL is equivalent to setting that value for all of the per-component variables.

If MONGODB_LOG_ALL is specified in addition to one or more component variables, the component variable(s) MUST take precedence.

The default is to not log anything.

If a variable is set to an invalid value, it MUST be treated as if it were not specified at all, and the driver MAY attempt to warn the user about the misconfiguration via a log message or otherwise but MUST NOT throw an exception.

Configurable Log Destination

Drivers MUST support configuring where log messages should be output, including the options:

  • stdout

  • stderr

  • Output file (path MUST be configurable). For languages that are not relying on a logging interface or framework to handle file support, the driver can choose to either support this directly (i.e. the driver allows the user to specify a path and itself handles writing to that path), or to instead provide a straightforward, idiomatic way to programmatically consume the messages and in turn write them to a file, e.g. via a Node.js stream, along with a documentation example of how to do this.

    Fallback implementation method: If the environment variable MONGODB_LOG_PATH is provided:

    • If the value is "stdout" (case-insensitive), log to stdout.
    • If the value is "stderr" (case-insensitive), log to stderr.
    • Else, if direct logging to files is supported, log to a file at the specified path. If the file already exists, it MUST be appended to.

    If the variable is not provided or is set to an invalid value (which could be invalid for any reason, e.g. the path does not exist or is not writeable), the driver MUST log to stderr and the driver MAY attempt to warn the user about the misconfiguration via a log message or otherwise but MUST NOT throw an exception.

Configurable Max Document Length

Drivers MUST support configuring the maximum logged length for extended JSON documents in log messages. The unit here is flexible and can be bytes, Unicode code points, code units, or graphemes, depending on what a driver is able to accomplish with its language's string APIs. The default max length is 1000 of whichever unit is selected. If the chosen unit is anything other than a Unicode code point, the driver MUST ensure that it gracefully handles cases where the truncation length falls mid code point, by either rounding the length up or down to the closest code point boundary or using the Unicode replacement character, to avoid producing invalid Unicode data. Drivers MUST implement truncation naively by simply truncating the output at the required length; i.e. do not attempt to implement truncation such that the output is still valid JSON. Truncated extended JSON MUST have a trailing ellipsis ... appended to indicate to the user that truncation occurred. The ellipsis MUST NOT count toward the max length.

Fallback Implementation method: Environment variable MONGOB_LOG_MAX_DOCUMENT_LENGTH. When unspecified, any extended JSON representation of a document which is longer than the default max length MUST be truncated to that length. When set to an integer value, any extended JSON document longer than that value MUST be truncated to that length. If a variable is set to an invalid value, it MUST be treated as if it were not specified at all, and the driver MAY attempt to warn the user about the misconfiguration via a log message or otherwise but MUST NOT throw an exception.

Components

Drivers MUST support configuring minimum log severity levels on a per-component level. The below components currently exist and correspond to the listed specifications. This list is expected to grow over time as logging is added to more specifications.

Drivers SHOULD specify the component names in whatever the idiomatic way is for their language. For example, the Java command component could be named org.mongodb.driver.protocol.command.

Drivers MAY define additional language-specific components in addition to these for any driver-specific messages they produce.

Component NameSpecification(s)Environment Variable
commandCommand Logging and MonitoringMONGODB_LOG_COMMAND
topologyServer Discovery and MonitoringMONGODB_LOG_TOPOLOGY
serverSelectionServer SelectionMONGODB_LOG_SERVER_SELECTION
connectionConnection Monitoring and PoolingMONGODB_LOG_CONNECTION

Log Severity Levels

Driver specifications defining log messages MUST use log levels from the following list, inspired by the Syslog Protocol as described in RFC 5424:

CodeLevel NameMeaningEnvironment Variable value (case-insensitive)
-OffNothing is logged.off
0Emergency-emergency
1Alert-alert
2Critical-critical
3ErrorAny error that we are unable to report to the user via driver API.error
4WarningIndicates a situation where undesirable application behavior may occur. Example: The driver ignores an unrecognized option in a connection string.warn
5NoticeIndicates an event that is unusual but not problematic. Example: a change stream is automatically resumed.notice
6InformationalHigh-level information about normal driver behavior. Example: MongoClient creation or close.info
7DebugDetailed information that may be helpful when debugging an application. Example: A command starting.debug
8TraceVery fine-grained details related to logic flow. Example: entering and exiting function bodies.trace

Note that the Emergency, Alert, and Critical levels have been intentionally left undefined. At the time of writing this specification, we do not expect any driver specifications to need to log at these levels, but we have included them in the list of permitted levels for consistency with Syslog.

The levels above are defined in order from most to least severe. Not all logging frameworks will necessarily support all of these levels. If an equivalent level is not available, drivers MUST emit messages for that level at the closest, less severe level if one is available, or the closest more severe level otherwise.

For example:

  • If an Informational level is not available and Debug is, messages defined as Informational in a specification MUST be emitted at Debug level.
  • If a Trace level is not available, Trace messages MUST be emitted at Debug level.

Structured versus Unstructured Logging

If structured logging is available in and idiomatic for the driver's language/ecosystem, the driver SHOULD produce structured log messages. Note that some ecosystems with structured logging support may also have support available to convert structured output to traditional unstructured messages for users who want it (for example, the log feature in Rust's tracing crate). If such support is available, drivers SHOULD utilize it to support both types of logging.

Note that drivers implementing unstructured logging MUST still support some internal way to intercept the data contained in messages in a structured form, as this is required to implement the unified tests for logging conformance. See the unified test format specification for details.

Representing Documents in Log Messages

BSON documents MUST be represented in relaxed extended JSON when they appear in log messages to improve readability.

Representing Errors in Log Messages

Drivers MAY represent errors in log messages in whatever format is idiomatic for their language and existing error types. For example, if a driver's error classes have existing toString() implementations, those MAY be used. Alternatively, if a driver emits structured log messages, a structured format containing error data could be used. Any information which a driver reports via its error classes MUST be included in the log representations. Note that if the driver includes full server responses in its errors these MUST be truncated in accordance with the max document length option.

Logging Failures

Specifications MAY define log messages that correspond to failures which also are reported via exceptions in the API, for example a "command failed" log message. Such messages MUST NOT use log levels more severe than debug.

While it might seem natural that such messages would be logged at error level, not all failures that the driver considers worthy of an error will be considered a true error by the application. For example, consider an application that unconditionally creates a collection on startup, and ignores any NamespaceExists errors received in response: it would be undesirable for those failures to show up in application logs at error level.

Additionally, some applications will already have infrastructure in place to log any unhandled exceptions at error level. If the driver were to log exception-related messages at error level, such applications would end up with duplicate, error-level messages for these exceptions. On the other hand, some applications may not log exceptions at all, or might not include all of the relevant information about an encountered exception in their custom log messages; for these applications, there is value in a driver emitting such log messages.

Given this, logging such messages at debug level strikes the best balance between making the diagnostic information available, but not overloading users with overly severe and/or duplicative messages. Users who do not log exceptions will have a way to see driver exceptions, by turning on debug logging, while users who do log exceptions can filter for the true exceptions by only looking at more severe log levels.

Omitting Null Values from Log Messages

Some log messages will include fields that are only present under particular circumstances, for example on certain server versions. When such a field is not present:

  • If the driver does structured logging, the field MUST be omitted from the message altogether, i.e. the field MUST not be present with an explicit null value.
  • If the driver does unstructured logging, the corresponding segment of the message string MUST be omitted altogether.

Performance Considerations

The computation required to generate certain log messages can be significant, e.g. if extended JSON serialization is required. If possible, drivers SHOULD check whether a log message would actually be emitted and consumed based on the users' configuration before doing such computation. For example, this can be checked in Rust via log::log_enabled.

Drivers SHOULD optimize extended JSON generation to avoid generating JSON strings longer than will be emitted, such that the complexity is O(N) where N = <max document length>, rather than N = <actual document length>.

Standard Naming in Structured Log Messages

Driver specifications typically allow for language-appropriate naming variations, e.g. using snakecase or camelcase to name a property. However, for log messages, drivers doing structured logging MUST use the exact names and casing specified for the names of fields included in messages. This will be easier for our support team since the names will be consistent across languages, and will simplify writing language-agnostic tooling to search through and parse structured logs.

Including Timestamps in Log Messages

Drivers MAY add timestamps to their log messages if one will not be added automatically by the logging framework(s) they use.

Supporting Both Programmatic and Environment Variable Configuration

If a driver supports configuration via both environment variables and programmatically via API, programmatic configuration MUST take precedence over environment variables. Drivers supporting both forms of configuration MUST document this behavior and MUST provide an example of how users can implement custom logic to allow an environment variable to override a programmatic default, so that users who prefer the opposite behavior have a way to achieve it.

Test Plan

Tests for logging behavior are defined in each corresponding specification. The unified test runner specification has support for specifying logging expectations in tests.

Motivation for Change

A common complaint from our support team is that they don't know how to easily get debugging information from drivers. Some drivers provide debug logging, but others do not. For drivers that do provide it, the log messages produced and the mechanisms for enabling debug logging are inconsistent.

Although users can implement their own debug logging support via existing driver events (SDAM, APM, etc), this requires code changes. It is often difficult to quickly implement and deploy such changes in production at the time they are needed, and to remove the changes afterward. Additionally, there are useful scenarios to log that do not correspond to existing events. Standardizing on debug log messages that drivers produce and how to enable/configure logging will provide TSEs, CEs, and MongoDB users an easier way to get debugging information out of our drivers, facilitate support of drivers for our internal teams, and improve our documentation around troubleshooting.

Design Rationale

Truncation of large documents

  1. Why have an option?
    We considered a number of approaches for dealing with documents of potentially very large size in log messages, e.g. command documents, including 1) always logging the full document, 2) only logging documents with the potential to be large when the user opts in, and 3) truncating large documents by default, but allowing the user to adjust the maximum length logged. We chose the third option as we felt it struck the best balance between concerns around readability and usability of log messages. In the case where data is sufficiently small, the default behavior will show the user the full data. In the case where data is large, the user will receive a readable message with truncated data, but have the option to see more or all of the data.

  2. Why are the units for max document length flexible?
    String APIs vary across languages, and not all drivers will be able to easily and efficiently truncate strings in the same exact manner. The important thing is that the option exists and that its default value is reasonable, and for all possible unit choices (byte, code point, code unit, or grapheme) we felt 1000 was a reasonable default. See here for a helpful primer on related Unicode concepts.

  3. Why do we implement naive truncation rather than truncating the JSON so it is still valid?
    Designing and implementing a truncation algorithm for JSON that outputs valid JSON, but fits in as much of the original JSON as possible, would be non-trivial. The server team wrote an entire separate truncation design document when they implemented this for their log messages. This is more of a necessity for the server where the entire log message is JSON, but we don't know if parsing the documents included in log messages is something that users will actually need to do. Furthermore, any users who want parseable documents have an escape hatch to do so: they can set the max document length to a very large value. If we hear of use cases in the future for parsing the documents in log messages, we could make an additive change to this specification to permit a smarter truncation algorithm.

Structured versus Unstructured Logging

The MongoDB server produces structured logs as of 4.4, so it seems natural that MongoDB drivers might too. However, structured logging is not idiomatic or common in some language ecosystems, so we have chosen not to require it.

Omitting Null Values from Log Messages

We considered alternatives such as allowing, or requiring, drivers to explicitly include null values in log messages. While this might make it easier to identify cases where a value is unexpectedly null, we decided against it because there are a number of values that will often be null, or even always be null for certain applications (e.g. serviceId when not connected to a load-balanced topology) and their inclusion may confuse users and lead them to think the null value is meaningful. Additionally, always including null values would increase the size of log messages. The omission of null values is left to the drivers' discretion for any driver-specific logs not covered by common specification components.

Invalid Values of Environment Variables

For drivers supporting configuration via environment variables, the spec requires that if an environment variable is set to an invalid value the driver behaves as if the value were not specified at all, and optionally warns the user but does not throw an error. We considered the following alternatives:

  1. Drivers could be required to throw an exception if a value is invalid: This was rejected because of concerns around the implications for environments/applications where multiple versions of the driver or multiple drivers may be present and where the validation logic may not match, meaning a value considered valid for one driver/version might not be by another. Additionally, there is no obvious place to throw an exception from about invalid environment variables; MongoClient constructors would be one possibility, but not all languages will support per-client configuration so throwing there regarding an environment variable might be surprising to users.

    Note that these same concerns do not apply to logging options that are specified via driver API: there is no risk of such options propagating to other drivers/driver versions present, and drivers can report exceptions at the point the options are specified, either globally or per-client. Therefore, drivers MUST validate programmatic logging options in a manner consistent with how they validate all other programmatic options, and if possible SHOULD prefer to throw exceptions for invalid configuration.

  2. Drivers could be required to log a warning if a value is invalid: While drivers MAY do this, requiring it was rejected because depending on the language/framework log messages may not be a viable way to communicate a warning: if a language's default behavior is to log nothing, or only log messages at a more severe level than warn, the user will not actually receive the message unless it is logged at a level and component they have successfully enabled.

Programmatic Configuration Taking Precedence

We chose to have programmatic configuration win out over environment variables because:

  1. This allows applications built atop drivers (e.g. mongosh) to fully control the driver's logging behavior by setting options for it programmatically.
  2. This is consistent with how many drivers treat options specified both in a connection string and programmatically: programmatic options win out.
  3. It is straightforward for users to override this behavior (by writing logic to read in environment variables and override programmatic defaults), but if we went with the opposite default, it would be more complicated for users to override: not all languages will necessarily have an easy way to override/unset an environment variable from within application code.

Backwards Compatibility

This specification takes the stance that the contents of log messages (both structured and unstructured) are not covered by semantic versioning, but that logging components are, since changing the name of a component or removing a component altogether has the potential to break user logging configuration and cause users to silently miss log messages.

As a result, any drivers that already support logging should be free to update the messages they log to match those defined in various specifications. However, drivers should take care to avoid removing or renaming existing logging components except in major version releases.

Since this specification defines no particular API, drivers are free to keep any existing programmatic APIs they have for configuring logging. If such APIs are incompatible with the logging specification requirements (for example, the driver defines its own set of log levels in a public type, which do not match the spec-defined levels), changes to match the specification should be staged in via semantic versioning.

Reference Implementation

Links to be added once Rust and C# implementations have been merged.

Security Implication

Drivers must take care to avoid exposing sensitive information (e.g. authentication credentials) in log messages. It will be up to each individual specification that defines log messages to define which information should be redacted and add tests confirming its redaction.

Future work

Additional Log Components

Additional log components may be added as logging is added to more specifications.

Preventing Recursion When Using MongoDB as a Log Sink

If a user chooses to store log messages produced by a driver in MongoDB, it may be possible for them to end up recursing infinitely if each write to store a log message generates additional log messages. This has historically not been an issue in drivers that already produce log messages, or with the command monitoring API, but if users start to run into this issue, we could try to address it at the specification level by e.g. requiring drivers to support disabling logging on individual clients or for particular namespaces.

Changelog

  • 2024-03-06: Migrated from reStructuredText to Markdown.

  • 2022-11-16: Add policy on severity level for log messages corresponding to driver exceptions.

  • 2022-11-11: Clarify that guidance around null values only strictly applies to spec-defined log messages.

  • 2022-10-26: Allow drivers to add timestamps to log messages.

  • 2022-11-10: Clarify driver-specific null omission.

  • 2022-12-29: Fix typo in trace log level example

  • 2023-01-04: Elaborate on treatment of invalid values of environment variables.
    Permit drivers to omit direct support for logging to file so long as they provide a straightforward way for users to consume the log messages programmatically and write to a file themselves. Require that programmatic configuration take precedence over environment variables.

Connection Monitoring and Pooling

  • Status: Accepted
  • Minimum Server Version: N/A

Abstract

Drivers currently support a variety of options that allow users to configure connection pooling behavior. Users are confused by drivers supporting different subsets of these options. Additionally, drivers implement their connection pools differently, making it difficult to design cross-driver pool functionality. By unifying and codifying pooling options and behavior across all drivers, we will increase user comprehension and code base maintainability.

This specification does not apply to drivers that do not support multitasking.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Definitions

Connection

A Connection (when linked) refers to the Connection type defined in the Connection Pool Members section of this specification. It does not refer to an actual TCP connection to an Endpoint. A Connection will attempt to create and wrap such a TCP connection over the course of its existence, but it is not equivalent to one nor does it wrap an active one at all times.

For the purposes of testing, a mocked Connection type could be used with the pool that never actually creates a TCP connection or performs any I/O.

Endpoint

For convenience, an Endpoint refers to either a mongod or mongos instance.

Thread

For convenience, a Thread refers to:

  • A shared-address-space process (a.k.a. a thread) in multi-threaded drivers
  • An Execution Frame / Continuation in asynchronous drivers
  • A goroutine in Go

Behavioral Description

Which Drivers this applies to

This specification is solely concerned with drivers that implement a connection pool. A driver SHOULD implement a connection pool, but is not required to.

Connection Pool Options

All drivers that implement a connection pool MUST implement and conform to the same MongoClient options. There can be slight deviation in naming to make the options idiomatic to the driver language.

Connection Pool Behaviors

All driver connection pools MUST provide an API that allows the driver to check out a connection, check in a connection back to the pool, and clear all connections in the pool. This API is for internal use only, and SHOULD NOT be documented as a public API.

Connection Pool Monitoring

All drivers that implement a connection pool MUST provide an API that allows users to subscribe to events emitted from the pool. Conceptually, event emission is instantaneous, i.e., one may talk about the instant an event is emitted, and represents the start of an activity of delivering the event to a subscribed user.

Detailed Design

Connection Pool Options

Drivers that implement a Connection Pool MUST support the following ConnectionPoolOptions:

interface ConnectionPoolOptions {
  /**
   *  The maximum number of Connections that may be associated
   *  with a pool at a given time. This includes in use and
   *  available connections.
   *  If specified, MUST be an integer >= 0.
   *  A value of 0 means there is no limit.
   *  Defaults to 100.
   */
  maxPoolSize?: number;

  /**
   *  The minimum number of Connections that MUST exist at any moment
   *  in a single connection pool.
   *  If specified, MUST be an integer >= 0. If maxPoolSize is > 0
   *  then minPoolSize must be <= maxPoolSize
   *  Defaults to 0.
   */
  minPoolSize?: number;

  /**
   *  The maximum amount of time a Connection should remain idle
   *  in the connection pool before being marked idle.
   *  If specified, MUST be a number >= 0.
   *  A value of 0 means there is no limit.
   *  Defaults to 0.
   */
  maxIdleTimeMS?: number;

  /**
   *  The maximum number of Connections a Pool may be establishing concurrently.
   *  Establishment of a Connection is a part of its life cycle
   *  starting after a ConnectionCreatedEvent and ending before a ConnectionReadyEvent.
   *  If specified, MUST be a number > 0.
   *  Defaults to 2.
   */
  maxConnecting?: number;
}

Additionally, Drivers that implement a Connection Pool MUST support the following ConnectionPoolOptions UNLESS that driver meets ALL of the following conditions:

interface ConnectionPoolOptions {
  /**
   *  NOTE: This option has been deprecated in favor of timeoutMS.
   *
   *  The maximum amount of time a thread can wait for
   *  either an available non-perished connection (limited by `maxPoolSize`),
   *  or a pending connection (limited by `maxConnecting`).
   *  If specified, MUST be a number >= 0.
   *  A value of 0 means there is no limit.
   *  Defaults to 0.
   */
  waitQueueTimeoutMS?: number;
}

These options MUST be specified at the MongoClient level, and SHOULD be named in a manner idiomatic to the driver's language. All connection pools created by a MongoClient MUST use the same ConnectionPoolOptions.

When parsing a mongodb connection string, a user MUST be able to specify these options using the default names specified above.

Deprecated Options

The following ConnectionPoolOptions are considered deprecated. They MUST NOT be implemented if they do not already exist in a driver, and they SHOULD be deprecated and removed from drivers that implement them as early as possible:

interface ConnectionPoolOptions {
  /**
   *  The maximum number of threads that can simultaneously wait
   *  for a Connection to become available.
   */
  waitQueueSize?: number;

  /**
   *  An alternative way of setting waitQueueSize, it specifies
   *  the maximum number of threads that can wait per connection.
   *  waitQueueSize === waitQueueMultiple \* maxPoolSize
   */
  waitQueueMultiple?: number
}

Connection Pool Members

Connection

A driver-defined wrapper around a single TCP connection to an Endpoint. A Connection has the following properties:

  • Single Endpoint: A Connection MUST be associated with a single Endpoint. A Connection MUST NOT be associated with multiple Endpoints.
  • Single Lifetime: A Connection MUST NOT be used after it is closed.
  • Single Owner: A Connection MUST belong to exactly one Pool, and MUST NOT be shared across multiple pools
  • Single Track: A Connection MUST limit itself to one request / response at a time. A Connection MUST NOT multiplex/pipeline requests to an Endpoint.
  • Monotonically Increasing ID: A Connection MUST have an ID number associated with it. Connection IDs within a Pool MUST be assigned in order of creation, starting at 1 and increasing by 1 for each new Connection.
  • Valid Connection: A connection MUST NOT be checked out of the pool until it has successfully and fully completed a MongoDB Handshake and Authentication as specified in the Handshake, OP_COMPRESSED, and Authentication specifications.
  • Perishable: it is possible for a Connection to become Perished. A Connection is considered perished if any of the following are true:
    • Stale: The Connection 's generation does not match the generation of the parent pool
    • Idle: The Connection is currently "available" (as defined below) and has been for longer than maxIdleTimeMS.
    • Errored: The Connection has experienced an error that indicates it is no longer recommended for use. Examples include, but are not limited to:
      • Network Error
      • Network Timeout
      • Endpoint closing the connection
      • Driver-Side Timeout
      • Wire-Protocol Error
interface Connection {
  /**
   *  An id number associated with the Connection
   */
  id: number;

  /**
   *  The address of the pool that owns this Connection
   */
  address: string;

  /**
   *  An integer representing the "generation" of the pool
   *  when this Connection was created.
   */
  generation: number;

  /**
   * The current state of the Connection.
   *
   * Possible values are the following:
   *   - "pending":       The Connection has been created but has not yet been established. Contributes to
   *                      totalConnectionCount and pendingConnectionCount.
   *
   *   - "available":     The Connection has been established and is waiting in the pool to be checked
   *                      out. Contributes to both totalConnectionCount and availableConnectionCount.
   *
   *   - "in use":        The Connection has been established, checked out from the pool, and has yet
   *                      to be checked back in. Contributes to totalConnectionCount.
   *
   *   - "closed":        The Connection has had its socket closed and cannot be used for any future
   *                      operations. Does not contribute to any connection counts.
   *
   * Note: this field is mainly used for the purposes of describing state
   * in this specification. It is not required that drivers
   * actually include this field in their implementations of Connection.
   */
  state: "pending" | "available" | "in use" | "closed";
}

WaitQueue

A concept that represents pending requests for Connections. When a thread requests a Connection from a Pool, the thread enters the Pool's WaitQueue. A thread stays in the WaitQueue until it either receives a Connection or times out. A WaitQueue has the following traits:

  • Thread-Safe: When multiple threads attempt to enter or exit a WaitQueue, they do so in a thread-safe manner.
  • Ordered/Fair: When Connections are made available, they are issued out to threads in the order that the threads entered the WaitQueue.
  • Timeout aggressively: Members of a WaitQueue MUST timeout if they are enqueued for longer than the computed timeout and MUST leave the WaitQueue immediately in this case.

The implementation details of a WaitQueue are left to the driver. Example implementations include:

  • A fair Semaphore
  • A Queue of callbacks

Connection Pool

A driver-defined entity that encapsulates all non-monitoring Connections associated with a single Endpoint. The pool has the following properties:

  • Thread Safe: All Pool behaviors MUST be thread safe.
  • Not Fork-Safe: A Pool is explicitly not fork-safe. If a Pool detects that is it being used by a forked process, it MUST immediately clear itself and update its pid
  • Single Owner: A Pool MUST be associated with exactly one Endpoint, and MUST NOT be shared between Endpoints.
  • Emit Events and Log Messages: A Pool MUST emit pool events and log messages when dictated by this spec (see Connection Pool Monitoring). Users MUST be able to subscribe to emitted events and log messages in a manner idiomatic to their language and driver.
  • Closeable: A Pool MUST be able to be manually closed. When a Pool is closed, the following behaviors change:
  • Clearable: A Pool MUST be able to be cleared. Clearing the pool marks all pooled and checked out Connections as stale and lazily closes them as they are checkedIn or encountered in checkOut. Additionally, all requests are evicted from the WaitQueue and return errors that are considered non-timeout network errors.
  • Pausable: A Pool MUST be able to be paused and resumed. A Pool is paused automatically when it is cleared, and it can be resumed by being marked as "ready". While the Pool is paused, it exhibits the following behaviors:
    • Attempting to check out a Connection from the Pool results in a non-timeout network error
    • Connections are not created in the background to satisfy minPoolSize
  • Capped: a pool is capped if maxPoolSize is set to a non-zero value. If a pool is capped, then its total number of Connections (including available and in use) MUST NOT exceed maxPoolSize
  • Rate-limited: A Pool MUST limit the number of Connections being established concurrently via the maxConnecting pool option.
interface ConnectionPool {
  /**
   *  The Queue of threads waiting for a Connection to be available
   */
  waitQueue: WaitQueue;

  /**
   *  A generation number representing the SDAM generation of the pool.
   */
  generation: number;

  /**
   * A map representing the various generation numbers for various services
   * when in load balancer mode.
   */
  serviceGenerations: Map<ObjectId, [number, number]>;

  /**
   * The state of the pool.
   *
   * Possible values are the following:
   *   - "paused":        The initial state of the pool. Connections may not be checked out nor can they
   *                      be established in the background to satisfy minPoolSize. Clearing a pool
   *                      transitions it to this state.
   *
   *   - "ready":         The healthy state of the pool. It can service checkOut requests and create
   *                      connections in the background. The pool can be set to this state via the
   *                      ready() method.
   *
   *   - "closed":        The pool is destroyed. No more Connections may ever be checked out nor any
   *                      created in the background. The pool can be set to this state via the close()
   *                      method. The pool cannot transition to any other state after being closed.
   */
  state: "paused" | "ready" | "closed";

  // Any of the following connection counts may be computed rather than
  // actually stored on the pool.

  /**
   *  An integer expressing how many total Connections
   *  ("pending" + "available" + "in use") the pool currently has
   */
  totalConnectionCount: number;

  /**
   *  An integer expressing how many Connections are currently
   *  available in the pool.
   */
  availableConnectionCount: number;

  /**
   *  An integer expressing how many Connections are currently
   *  being established.
   */
  pendingConnectionCount: number;

  /**
   *  Returns a Connection for use
   */
  checkOut(): Connection;

  /**
   *  Check in a Connection back to the Connection pool
   */
  checkIn(connection: Connection): void;

  /**
   *  Mark all current Connections as stale, clear the WaitQueue, and mark the pool as "paused".
   *  No connections may be checked out or created in this pool until ready() is called again.
   *  interruptInUseConnections specifies whether the pool will force interrupt "in use" connections as part of the clear. 
   *  Default false.
   */
  clear(interruptInUseConnections: Optional<Boolean>): void;

  /**
   *  Mark the pool as "ready", allowing checkOuts to resume and connections to be created in the background.
   *  A pool can only transition from "paused" to "ready". A "closed" pool
   *  cannot be marked as "ready" via this method.
   */
  ready(): void;

  /**
   *  Marks the pool as "closed", preventing the pool from creating and returning new Connections
   */
  close(): void;
}

Connection Pool Behaviors

Creating a Connection Pool

This specification does not define how a pool is to be created, leaving it up to the driver. Creation of a connection pool is generally an implementation detail of the driver, i.e., is not a part of the public API of the driver. The SDAM specification defines when the driver should create connection pools.

When a pool is created, its state MUST initially be set to "paused". Even if minPoolSize is set, the pool MUST NOT begin being populated with Connections until it has been marked as "ready". SDAM will mark the pool as "ready" on each successful check. See Connection Pool Management section in the SDAM specification for more information.

set generation to 0
set state to "paused"
emit PoolCreatedEvent and equivalent log message

Closing a Connection Pool

When a pool is closed, it MUST first close all available Connections in that pool. This results in the following behavior changes:

  • In use Connections MUST be closed when they are checked in to the closed pool.
  • Attempting to check out a Connection MUST result in an error.
mark pool as "closed"
for connection in availableConnections:
  close connection
emit PoolClosedEvent and equivalent log message

Marking a Connection Pool as Ready

Connection Pools start off as "paused", and they are marked as "ready" by monitors after they perform successful server checks. Once a pool is "ready", it can start checking out Connections and populating them in the background.

If the pool is already "ready" when this method is invoked, then this method MUST immediately return and MUST NOT emit a PoolReadyEvent.

mark pool as "ready"
emit PoolReadyEvent and equivalent log message
allow background thread to create connections

Note that the PoolReadyEvent MUST be emitted before the background thread is allowed to resume creating new connections, and it must be the case that no observer is able to observe actions of the background thread related to creating new connections before observing the PoolReadyEvent event.

Creating a Connection (Internal Implementation)

When creating a Connection, the initial Connection is in a "pending" state. This only creates a "virtual" Connection, and performs no I/O.

connection = new Connection()
increment totalConnectionCount
increment pendingConnectionCount
set connection state to "pending"
tConnectionCreated = current instant (use a monotonic clock if possible)
emit ConnectionCreatedEvent and equivalent log message
return connection

Establishing a Connection (Internal Implementation)

Before a Connection can be marked as either "available" or "in use", it must be established. This process involves performing the initial handshake, handling OP_COMPRESSED, and performing authentication.

try:
  connect connection via TCP / TLS
  perform connection handshake
  handle OP_COMPRESSED
  perform connection authentication
  tConnectionReady = current instant (use a monotonic clock if possible)
  emit ConnectionReadyEvent(duration = tConnectionReady - tConnectionCreated) and equivalent log message
  return connection
except error:
  close connection
  throw error # Propagate error in manner idiomatic to language.

Closing a Connection (Internal Implementation)

When a Connection is closed, it MUST first be marked as "closed", removing it from being counted as "available" or "in use". Once that is complete, the Connection can perform whatever teardown is necessary to close its underlying socket. The Driver SHOULD perform this teardown in a non-blocking manner, such as via the use of a background thread or async I/O.

original state = connection state
set connection state to "closed"

if original state is "available":
  decrement availableConnectionCount
else if original state is "pending":
  decrement pendingConnectionCount

decrement totalConnectionCount
emit ConnectionClosedEvent and equivalent log message

# The following can happen at a later time (i.e. in background
# thread) or via non-blocking I/O.
connection.socket.close()

Marking a Connection as Available (Internal Implementation)

A Connection is "available" if it is able to be checked out. A Connection MUST NOT be marked as "available" until it has been established. The pool MUST keep track of the number of currently available Connections.

increment availableConnectionCount
set connection state to "available"
add connection to availableConnections

Populating the Pool with a Connection (Internal Implementation)

"Populating" the pool involves preemptively creating and establishing a Connection which is marked as "available" for use in future operations.

Populating the pool MUST NOT block any application threads. For example, it could be performed on a background thread or via the use of non-blocking/async I/O. Populating the pool MUST NOT be performed unless the pool is "ready".

If an error is encountered while populating a connection, it MUST be handled via the SDAM machinery according to the Application Errors section in the SDAM specification.

If minPoolSize is set, the Connection Pool MUST be populated until it has at least minPoolSize total Connections. This MUST occur only while the pool is "ready". If the pool implements a background thread, it can be used for this. If the pool does not implement a background thread, the checkOut method is responsible for ensuring this requirement is met.

When populating the Pool, pendingConnectionCount has to be decremented after establishing a Connection similarly to how it is done in Checking Out a Connection to signal that another Connection is allowed to be established. Such a signal MUST become observable to any Thread after the action that marks the established Connection as "available" becomes observable to the Thread. Informally, this order guarantees that no Thread tries to start establishing a Connection when there is an "available" Connection established as a result of populating the Pool.

wait until pendingConnectionCount < maxConnecting and pool is "ready"
create connection
try:
  establish connection
  mark connection as available
except error:
  # Defer error handling to SDAM.
  topology.handle_pre_handshake_error(error)

Checking Out a Connection

A Pool MUST have a method that allows the driver to check out a Connection. Checking out a Connection involves submitting a request to the WaitQueue and, once that request reaches the front of the queue, having the Pool find or create a Connection to fulfill that request. Requests MUST be subject to a timeout which is computed per the rules in Client Side Operations Timeout: Server Selection.

To service a request for a Connection, the Pool MUST first iterate over the list of available Connections, searching for a non-perished one to be returned. If a perished Connection is encountered, such a Connection MUST be closed (as described in Closing a Connection) and the iteration of available Connections MUST continue until either a non-perished available Connection is found or the list of available Connections is exhausted.

If the list is exhausted, the total number of Connections is less than maxPoolSize, and pendingConnectionCount < maxConnecting, the pool MUST create a Connection, establish it, mark it as "in use" and return it. If totalConnectionCount == maxPoolSize or pendingConnectionCount == maxConnecting, then the pool MUST wait to service the request until neither of those two conditions are met or until a Connection becomes available, re-entering the checkOut loop in either case. This waiting MUST NOT prevent Connections from being checked into the pool. Additionally, the Pool MUST NOT service any newer checkOut requests before fulfilling the original one which could not be fulfilled. For drivers that implement the WaitQueue via a fair semaphore, a condition variable may also be needed to to meet this requirement. Waiting on the condition variable SHOULD also be limited by the WaitQueueTimeout, if the driver supports one and it was specified by the user.

If the pool is "closed" or "paused", any attempt to check out a Connection MUST throw an Error. The error thrown as a result of the pool being "paused" MUST be considered a retryable error and MUST NOT be an error that marks the SDAM state unknown.

If the pool does not implement a background thread, the checkOut method is responsible for ensuring that the pool is populated with at least minPoolSize Connections.

A Connection MUST NOT be checked out until it is established. In addition, the Pool MUST NOT prevent other threads from checking out Connections while establishing a Connection.

Before a given Connection is returned from checkOut, it must be marked as "in use", and the pool's availableConnectionCount MUST be decremented.

connection = Null
tConnectionCheckOutStarted = current instant (use a monotonic clock if possible)
emit ConnectionCheckOutStartedEvent and equivalent log message
try:
  enter WaitQueue
  wait until at top of wait queue
  # Note that in a lock-based implementation of the wait queue would
  # only allow one thread in the following block at a time
  while connection is Null:
    if a connection is available:
      while connection is Null and a connection is available:
        connection = next available connection
        if connection is perished:
          close connection
          connection = Null
    else if totalConnectionCount < maxPoolSize:
      if pendingConnectionCount < maxConnecting:
        connection = create connection
      else:
        # this waiting MUST NOT prevent other threads from checking Connections
        # back in to the pool.
        wait until pendingConnectionCount < maxConnecting or a connection is available
        continue

except pool is "closed":
  tConnectionCheckOutFailed = current instant (use a monotonic clock if possible)
  emit ConnectionCheckOutFailedEvent(reason="poolClosed", duration = tConnectionCheckOutFailed - tConnectionCheckOutStarted) and equivalent log message
  throw PoolClosedError
except pool is "paused":
  tConnectionCheckOutFailed = current instant (use a monotonic clock if possible)
  emit ConnectionCheckOutFailedEvent(reason="connectionError", duration = tConnectionCheckOutFailed - tConnectionCheckOutStarted) and equivalent log message
  throw PoolClearedError
except timeout:
  tConnectionCheckOutFailed = current instant (use a monotonic clock if possible)
  emit ConnectionCheckOutFailedEvent(reason="timeout", duration = tConnectionCheckOutFailed - tConnectionCheckOutStarted) and equivalent log message
  throw WaitQueueTimeoutError
finally:
  # This must be done in all drivers
  leave wait queue

# If the Connection has not been established yet (TCP, TLS,
# handshake, compression, and auth), it must be established
# before it is returned.
# This MUST NOT block other threads from acquiring connections.
if connection state is "pending":
  try:
    establish connection
  except connection establishment error:
    tConnectionCheckOutFailed = current instant (use a monotonic clock if possible)
    emit ConnectionCheckOutFailedEvent(reason="connectionError", duration = tConnectionCheckOutFailed - tConnectionCheckOutStarted) and equivalent log message
    decrement totalConnectionCount
    throw
  finally:
    decrement pendingConnectionCount
else:
    decrement availableConnectionCount
set connection state to "in use"

# If there is no background thread, the pool MUST ensure that
# there are at least minPoolSize total connections.
do asynchronously:
  while totalConnectionCount < minPoolSize:
    populate the pool with a connection

tConnectionCheckedOut = current instant (use a monotonic clock if possible)
emit ConnectionCheckedOutEvent(duration = tConnectionCheckedOut - tConnectionCheckOutStarted) and equivalent log message
return connection

Checking In a Connection

A Pool MUST have a method of allowing the driver to check in a Connection. The driver MUST NOT be allowed to check in a Connection to a Pool that did not create that Connection, and MUST throw an Error if this is attempted.

When the Connection is checked in, it MUST be closed if any of the following are true:

  • The Connection is perished.
  • The pool has been closed.

Otherwise, the Connection is marked as available.

emit ConnectionCheckedInEvent and equivalent log message
if connection is perished OR pool is closed:
  close connection
else:
  mark connection as available

Clearing a Connection Pool

Clearing the pool involves different steps depending on whether the pool is in load balanced mode or not. The traditional / non-load balanced clearing behavior MUST NOT be used by pools in load balanced mode, and the load balanced pool clearing behavior MUST NOT be used in non-load balanced pools.

Clearing a non-load balanced pool

A Pool MUST have a method of clearing all Connections when instructed. Rather than iterating through every Connection, this method should simply increment the generation of the Pool, implicitly marking all current Connections as stale. It should also transition the pool's state to "paused" to halt the creation of new connections until it is marked as "ready" again. The checkOut and checkIn algorithms will handle clearing out stale Connections. If a user is subscribed to Connection Monitoring events and/or connection log messages, a PoolClearedEvent and log message MUST be emitted after incrementing the generation / marking the pool as "paused". If the pool is already "paused" when it is cleared, then the pool MUST NOT emit a PoolCleared event or log message.

As part of clearing the pool, the WaitQueue MUST also be cleared, meaning all requests in the WaitQueue MUST fail with errors indicating that the pool was cleared while the checkOut was being performed. The error returned as a result of the pool being cleared MUST be considered a retryable error and MUST NOT be an error that marks the SDAM state unknown. Clearing the WaitQueue MUST happen eagerly so that any operations waiting on Connections can retry as soon as possible. The pool MUST NOT rely on WaitQueueTimeoutMS to clear requests from the WaitQueue.

The clearing method MUST provide the option to interrupt any in-use connections as part of the clearing (henceforth referred to as the interruptInUseConnections flag in this specification). "Interrupting a Connection" is defined as canceling whatever task the Connection is currently performing and marking the Connection as perished (e.g. by closing its underlying socket). The interrupting of these Connections MUST be performed as soon as possible but MUST NOT block the pool or prevent it from processing further requests. If the pool has a background thread, and it is responsible for interrupting in-use connections, its next run MUST be scheduled as soon as possible.

The pool MUST only interrupt in-use Connections whose generation is less than or equal to the generation of the pool at the moment of the clear (before the increment) that used the interruptInUseConnections flag. Any operations that have their Connections interrupted in this way MUST fail with a retryable error. If possible, the error SHOULD be a PoolClearedError with the following message: "Connection to <pool address> interrupted due to server monitor timeout".

Clearing a load balanced pool

A Pool MUST also have a method of clearing all Connections for a specific serviceId for use when in load balancer mode. This method increments the generation of the pool for that specific serviceId in the generation map. A PoolClearedEvent and log message MUST be emitted after incrementing the generation. Note that this method MUST NOT transition the pool to the "paused" state and MUST NOT clear the WaitQueue.

Load Balancer Mode

For load-balanced deployments, pools MUST maintain a map from serviceId to a tuple of (generation, connection count) where the connection count refers to the total number of connections that exist for a specific serviceId. The pool MUST remove the entry for a serviceId once the connection count reaches 0. Once the MongoDB handshake is done, the connection MUST get the generation number that applies to its serviceId from the map and update the map to increment the connection count for this serviceId.

See the Load Balancer Specification for details.

Forking

A Connection is explicitly not fork-safe. The proper behavior in the case of a fork is to ResetAfterFork by:

  • clear all Connection Pools in the child process
  • closing all Connections in the child-process.

Drivers that support forking MUST document that Connections to an Endpoint are not fork-safe, and document the proper way to ResetAfterFork in the driver.

Drivers MAY aggressively ResetAfterFork if the driver detects it has been forked.

Optional Behaviors

The following features of a Connection Pool SHOULD be implemented if they make sense in the driver and driver's language.

Background Thread

A Pool SHOULD have a background Thread that is responsible for monitoring the state of all available Connections. This background thread SHOULD

A pool SHOULD allow immediate scheduling of the next background thread iteration after a clear is performed.

Conceptually, the aforementioned activities are organized into sequential Background Thread Runs. A Run MUST do as much work as readily available and then end instead of waiting for more work. For example, instead of waiting for pendingConnectionCount to become less than maxConnecting when satisfying minPoolSize, a Run MUST either proceed with the rest of its duties, e.g., closing available perished connections, or end.

The duration of intervals between the end of one Run and the beginning of the next Run is not specified, but the Test Format and Runner Specification may restrict this duration, or introduce other restrictions to facilitate testing.

withConnection

A Pool SHOULD implement a scoped resource management mechanism idiomatic to their language to prevent Connections from not being checked in. Examples include Python's "with" statement and C#'s "using" statement. If implemented, drivers SHOULD use this method as the default method of checking out and checking in Connections.

Connection Pool Monitoring

All drivers that implement a connection pool MUST provide an API that allows users to subscribe to events emitted from the pool. If a user subscribes to Connection Monitoring events, these events MUST be emitted when specified in "Connection Pool Behaviors". Events SHOULD be created and subscribed to in a manner idiomatic to their language and driver.

Events

See the Load Balancer Specification for details on the serviceId field.

/**
 *  Emitted when a Connection Pool is created
 */
interface PoolCreatedEvent {
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;

  /**
   *  Any non-default pool options that were set on this Connection Pool.
   */
  options: {...}
}

/**
 *  Emitted when a Connection Pool is marked as ready.
 */
interface PoolReadyEvent {
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;
}

/**
 *  Emitted when a Connection Pool is cleared
 */
interface PoolClearedEvent {
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;

  /**
   * The service id for which the pool was cleared for in load balancing mode.
   * See load balancer specification for more information about this field.
   */
  serviceId: Optional<ObjectId>;

  /**
   * A flag whether the pool forced interrupting "in use" connections as part of the clear.
  */
  interruptInUseConnections: Optional<Boolean>;
}

/**
 *  Emitted when a Connection Pool is closed
 */
interface PoolClosedEvent {
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;
}

/**
 *  Emitted when a Connection Pool creates a Connection object.
 *  NOTE: This does not mean that the Connection is ready for use.
 */
interface ConnectionCreatedEvent { 
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;

  /**
   *  The ID of the Connection
   */
  connectionId: int64;
}

/**
 *  Emitted when a Connection has finished its setup, and is now ready to use
 */
interface ConnectionReadyEvent {
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;

  /**
   *  The ID of the Connection
   */
  connectionId: int64;

  /**
   * The time it took to establish the connection.
   * In accordance with the definition of establishment of a connection
   * specified by `ConnectionPoolOptions.maxConnecting`,
   * it is the time elapsed between emitting a `ConnectionCreatedEvent`
   * and emitting this event as part of the same checking out.
   *
   * Naturally, when establishing a connection is part of checking out,
   * this duration is not greater than
   * `ConnectionCheckedOutEvent`/`ConnectionCheckOutFailedEvent.duration`.
   *
   * A driver MAY choose the type idiomatic to the driver.
   * If the type chosen does not convey units, e.g., `int64`,
   * then the driver MAY include units in the name, e.g., `durationMS`.
   */
  duration: Duration;
}

/**
 *  Emitted when a Connection Pool closes a Connection
 */
interface ConnectionClosedEvent {
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;

  /**
   *  The ID of the Connection
   */
  connectionId: int64;

  /**
   * A reason explaining why this Connection was closed.
   * Can be implemented as a string or enum.
   * Current valid values are:
   *   - "stale":           The pool was cleared, making the Connection no longer valid
   *   - "idle":            The Connection became stale by being available for too long
   *   - "error":           The Connection experienced an error, making it no longer valid
   *   - "poolClosed":      The pool was closed, making the Connection no longer valid
   */
  reason: string|Enum;
}

/**
 *  Emitted when the driver starts attempting to check out a Connection
 */
interface ConnectionCheckOutStartedEvent {
  /**
   * The ServerAddress of the Endpoint the pool is attempting
   * to connect to.
   */
  address: string;
}

/**
 *  Emitted when the driver's attempt to check out a Connection fails
 */
interface ConnectionCheckOutFailedEvent {
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;

  /**
   *  A reason explaining why Connection check out failed.
   *  Can be implemented as a string or enum.
   *  Current valid values are:
   *   - "poolClosed":      The pool was previously closed, and cannot provide new Connections
   *   - "timeout":         The Connection check out attempt exceeded the specified timeout
   *   - "connectionError": The Connection check out attempt experienced an error while setting up a new Connection
   */
  reason: string|Enum;

  /**
   * See `ConnectionCheckedOutEvent.duration`.
   */
  duration: Duration;
}

/**
 *  Emitted when the driver successfully checks out a Connection
 */
interface ConnectionCheckedOutEvent {
  /**
   *  The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;

  /**
   *  The ID of the Connection
   */
  connectionId: int64;

  /**
   * The time it took to check out the connection.
   * More specifically, the time elapsed between
   * emitting a `ConnectionCheckOutStartedEvent`
   * and emitting this event as part of the same checking out.
   *
   * Naturally, if a new connection was not created (`ConnectionCreatedEvent`)
   * and established (`ConnectionReadyEvent`) as part of checking out,
   * this duration is usually
   * not greater than `ConnectionPoolOptions.waitQueueTimeoutMS`,
   * but MAY occasionally be greater than that,
   * because a driver does not provide hard real-time guarantees.
   *
   * A driver MAY choose the type idiomatic to the driver.
   * If the type chosen does not convey units, e.g., `int64`,
   * then the driver MAY include units in the name, e.g., `durationMS`.
   */
  duration: Duration;
}

/**
 *  Emitted when the driver checks in a Connection back to the Connection Pool
 */
interface ConnectionCheckedInEvent {
  /**
   * The ServerAddress of the Endpoint the pool is attempting to connect to.
   */
  address: string;

  /**
   *  The ID of the Connection
   */
  connectionId: int64;
}

Connection Pool Logging

Please refer to the logging specification for details on logging implementations in general, including log levels, log components, handling of null values in log messages, and structured versus unstructured logging.

Drivers MUST support logging of connection pool information via the following types of log messages. These messages MUST be logged at Debug level and use the connection log component. These messages MUST be emitted when specified in "Connection Pool Behaviors".

The log messages are intended to match the information contained in the events above. Drivers MAY implement connection logging support via an event subscriber if it is convenient to do so.

The types used in the structured message definitions below are demonstrative, and drivers MAY use similar types instead so long as the information is present (e.g. a double instead of an integer, or a string instead of an integer if the structured logging framework does not support numeric types).

Common Fields

All connection log messages MUST contain the following key-value pairs:

KeySuggested TypeValue
serverHostStringthe hostname, IP address, or Unix domain socket path for the endpoint the pool is for.
serverPortIntThe port for the endpoint the pool is for. Optional; not present for Unix domain sockets. When the user does not specify a port and the default (27017) is used, the driver SHOULD include it here.

Pool Created Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection pool created"
maxIdleTimeMSIntThe maxIdleTimeMS value for this pool. Optional; only required to include if the user specified a value.
minPoolSizeIntThe minPoolSize value for this pool. Optional; only required to include if the user specified a value.
maxPoolSizeIntThe maxPoolSize value for this pool. Optional; only required to include if the user specified a value.
maxConnectingIntThe maxConnecting value for this pool. Optional; only required to include if the driver supports this option and the user specified a value.
waitQueueTimeoutMSIntThe waitQueueTimeoutMS value for this pool. Optional; only required to include if the driver supports this option and the user specified a value.
waitQueueSizeIntThe waitQueueSize value for this pool. Optional; only required to include if the driver supports this option and the user specified a value.
waitQueueMultipleIntThe waitQueueMultiple value for this pool. Optional; only required to include if the driver supports this option and the user specified a value.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection pool created for {{serverHost}}:{{serverPort}} using options maxIdleTimeMS={{maxIdleTimeMS}}, minPoolSize={{minPoolSize}}, maxPoolSize={{maxPoolSize}}, maxConnecting={{maxConnecting}}, waitQueueTimeoutMS={{waitQueueTimeoutMS}}, waitQueueSize={{waitQueueSize}}, waitQueueMultiple={{waitQueueMultiple}}

Pool Ready Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection pool ready"

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection pool ready for {{serverHost}}:{{serverPort}}

Pool Cleared Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection pool cleared"
serviceIdStringThe hex string representation of the service ID which the pool was cleared for. Optional; only present in load balanced mode.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection pool for {{serverHost}}:{{serverPort}} cleared for serviceId {{serviceId}}

Pool Closed Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection pool closed"

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection pool closed for {{serverHost}}:{{serverPort}}

Connection Created Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection created"
driverConnectionIdInt64The driver-generated ID for the connection as defined in Connection.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection created: address={{serverHost}}:{{serverPort}}, driver-generated ID={{driverConnectionId}}

Connection Ready Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection ready"
driverConnectionIdInt64The driver-generated ID for the connection as defined in Connection.
durationMSInt64ConnectionReadyEvent.duration converted to milliseconds.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection ready: address={{serverHost}}:{{serverPort}}, driver-generated ID={{driverConnectionId}}, established in={{durationMS}} ms

Connection Closed Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection closed"
driverConnectionIdInt64The driver-generated ID for the connection as defined in a Connection.
reasonStringA string describing the reason the connection was closed. The following strings MUST be used for each possible reason as defined in Events above:
- Stale: "Connection became stale because the pool was cleared
- Idle: "Connection has been available but unused for longer than the configured max idle time"
- Error: "An error occurred while using the connection"
- Pool closed: "Connection pool was closed"
errorFlexibleIf reason is Error, the associated error.
The type and format of this value is flexible; see the logging specification for details on representing errors in log messages.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection closed: address={{serverHost}}:{{serverPort}}, driver-generated ID={{driverConnectionId}}. Reason: {{reason}}. Error: {{error}}

Connection Checkout Started Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection checkout started"

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Checkout started for connection to {{serverHost}}:{{serverPort}}

Connection Checkout Failed Message

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection checkout failed"
reasonStringA string describing the reason checkout. The following strings MUST be used for each possible reason as defined in Events above:
- Timeout: "Wait queue timeout elapsed without a connection becoming available"
- ConnectionError: "An error occurred while trying to establish a new connection"
- Pool closed: "Connection pool was closed"
errorFlexibleIf reason is ConnectionError, the associated error. The type and format of this value is flexible; see the logging specification for details on representing errors in log messages.
durationMSInt64ConnectionCheckOutFailedEvent.duration converted to milliseconds.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Checkout failed for connection to {{serverHost}}:{{serverPort}}. Reason: {{reason}}. Error: {{error}}. Duration: {{durationMS}} ms

Connection Checked Out

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection checked out"
driverConnectionIdInt64The driver-generated ID for the connection as defined in Connection.
durationMSInt64ConnectionCheckedOutEvent.duration converted to milliseconds.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection checked out: address={serverHost}}:{{serverPort}}, driver-generated ID={{driverConnectionId}}, duration={{durationMS}} ms

Connection Checked In

In addition to the common fields defined above, this message MUST contain the following key-value pairs:

KeySuggested TypeValue
messageString"Connection checked in"
driverConnectionIdInt64The driver-generated ID for the connection as defined in Connection.

The unstructured form SHOULD be as follows, using the values defined in the structured format above to fill in placeholders as appropriate:

Connection checked in: address={{serverHost}}:{{serverPort}}, driver-generated ID={{driverConnectionId}}

Connection Pool Errors

A connection pool throws errors in specific circumstances. These Errors MUST be emitted by the pool. Errors SHOULD be created and dispatched in a manner idiomatic to the Driver and Language.

/**
 *  Thrown when the driver attempts to check out a
 *  Connection from a closed Connection Pool
 */
interface PoolClosedError {
  message: 'Attempted to check out a Connection from closed connection pool';
  address: <pool address>;
}

/**
 *  Thrown when the driver attempts to check out a
 *  Connection from a paused Connection Pool
 */
interface PoolClearedError extends RetryableError {
  message: 'Connection pool for <pool address> was cleared because another operation failed with: <original error which cleared the pool>';
  address: <pool address>;
}

/**
 *  Thrown when a driver times out when attempting to check out
 *  a Connection from a Pool
 */
interface WaitQueueTimeoutError {
  message: 'Timed out while checking out a Connection from connection pool';
  address: <pool address>;
}

Test Plan

See tests/README

Design Rationale

Why do we set minPoolSize across all members of a replicaSet, when most traffic will be against a Primary?

Currently, we are attempting to codify our current pooling behavior with minimal changes, and minPoolSize is currently uniform across all members of a replicaSet. This has the benefit of offsetting connection swarming during a Primary Step-Down, which will be further addressed in our Advanced Pooling Behaviors.

Why do we have separate ConnectionCreated and ConnectionReady events, but only one ConnectionClosed event?

ConnectionCreated and ConnectionReady each involve different state changes in the pool.

  • ConnectionCreated adds a new "pending" Connection, meaning the totalConnectionCount and pendingConnectionCount increase by one
  • ConnectionReady establishes that the Connection is ready for use, meaning the availableConnectionCount increases by one

ConnectionClosed indicates that the Connection is no longer a member of the pool, decrementing totalConnectionCount and potentially availableConnectionCount. After this point, the Connection is no longer a part of the pool. Further hypothetical events would not indicate a change to the state of the pool, so they are not specified here.

Why are waitQueueSize and waitQueueMultiple deprecated?

These options were originally only implemented in three drivers (Java, C#, and Python), and provided little value. While these fields would allow for faster diagnosis of issues in the connection pool, they would not actually prevent an error from occurring.

Additionally, these options have the effect of prioritizing older requests over newer requests, which is not necessarily the behavior that users want. They can also result in cases where queue access oscillates back and forth between full and not full. If a driver has a full waitQueue, then all requests for Connections will be rejected. If the client is continually spammed with requests, you could wind up with a scenario where as soon as the waitQueue is no longer full, it is immediately filled. It is not a favorable situation to be in, partially b/c it violates the fairness guarantee that the waitQueue normally provides.

Because of these issues, it does not make sense to go against driver mantras and provide an additional knob. We may eventually pursue an alternative configuration to address wait queue size in Advanced Pooling Behaviors.

Users that wish to have this functionality can achieve similar results by utilizing other methods to limit concurrency. Examples include implementing either a thread pool or an operation queue with a capped size in the user application. Drivers that need to deprecate waitQueueSize and/or waitQueueMultiple SHOULD refer users to these examples.

Why is waitQueueTimeoutMS optional for some drivers?

We are anticipating eventually introducing a single client-side timeout mechanism, making us hesitant to introduce another granular timeout control. Therefore, if a driver/language already has an idiomatic way to implement their timeouts, they should leverage that mechanism over implementing waitQueueTimeoutMS.

Why must populating the pool require the use of a background thread or async I/O?

Without the use of a background thread, the pool is populated with enough connections to satisfy minPoolSize during checkOut. Connections are established as part of populating the pool though, so if Connection establishment were done in a blocking fashion, the first operations after a clearing of the pool would experience unacceptably high latency, especially for larger values of minPoolSize. Thus, populating the pool must occur on a background thread (which is acceptable to block) or via the usage of non-blocking (async) I/O.

Why should closing a connection be non-blocking?

Because idle and perished Connections are cleaned up as part of checkOut, performing blocking I/O while closing such Connections would block application threads, introducing unnecessary latency. Once a Connection is marked as "closed", it will not be checked out again, so ensuring the socket is torn down does not need to happen immediately and can happen at a later time, either via async I/O or a background thread.

Why can the pool be paused?

The distinction between the "paused" state and the "ready" state allows the pool to determine whether or not the endpoint it is associated with is available or not. This enables the following behaviors:

  1. The pool can halt the creation of background connection establishments until the endpoint becomes available again. Without the "paused" state, the pool would have no way of determining when to begin establishing background connections again, so it would just continually attempt, and often fail, to create connections until minPoolSize was satisfied, even after repeated failures. This could unnecessarily waste resources both server and driver side.
  2. The pool can evict requests that enter the WaitQueue after the pool was cleared but before the server was in a known state again. Such requests can occur when a server is selected at the same time as it becomes marked as Unknown in highly concurrent workloads. Without the "paused" state, the pool would attempt to service these requests, since it would assume they were routed to the pool because its endpoint was available, not because of a race between SDAM and Server Selection. These requests would then likely fail with potentially high latency, again wasting resources both server and driver side.

Why not emit PoolCleared events and log messages when clearing a paused pool?

If a pool is already paused when it is cleared, that means it was previously cleared and no new connections have been created since then. Thus, clearing the pool in this case is essentially a no-op, so there is no need to notify any listeners that it has occurred. The generation is still incremented, however, to ensure future errors that caused the duplicate clear will stop attempting to clear the pool again. This situation is possible if the pool is cleared by the background thread after it encounters an error establishing a connection, but the ServerDescription for the endpoint was not updated accordingly yet.

Why does the pool need to support interrupting in use connections as part of its clear logic?

If a SDAM monitor has observed a network timeout, we assume that all connections including "in use" connections are no longer healthy. In some cases connections will fail to detect the network timeout fast enough. For example, a server request can hang at the OS level in TCP retry loop up for 17 minutes before failing. Therefore these connections MUST be proactively interrupted in the case of a server monitor network timeout. Requesting an immediate background thread run will speed up this process.

Why don't we configure TCP_USER_TIMEOUT?

Ideally, a reasonable TCP_USER_TIMEOUT can help with detecting stale connections as an alternative to interruptInUseConnections in Clear. Unfortunately this approach is platform dependent and not each driver allows easily configuring it. For example, C# driver can configure this socket option on linux only with target frameworks higher or equal to .net 5.0. On macOS, there is no straight equivalent for this option, it's possible that we can find some equivalent configuration, but this configuration will also require target frameworks higher than or equal to .net 5.0. The advantage of using Background Thread to manage perished connections is that it will work regardless of environment setup.

Backwards Compatibility

As mentioned in Deprecated Options, some drivers currently implement the options waitQueueSize and/or waitQueueMultiple. These options will need to be deprecated and phased out of the drivers that have implemented them.

Reference Implementations

  • JAVA (JAVA-3079)
  • RUBY (RUBY-1560)

Future Development

SDAM

This specification does not dictate how SDAM Monitoring connections are managed. SDAM specifies that "A monitor SHOULD NOT use the client's regular Connection pool". Some possible solutions for this include:

  • Having each Endpoint representation in the driver create and manage a separate dedicated Connection for monitoring purposes
  • Having each Endpoint representation in the driver maintain a separate pool of maxPoolSize 1 for monitoring purposes.
  • Having each Pool maintain a dedicated Connection for monitoring purposes, with an API to expose that Connection.

Advanced Pooling Behaviors

This spec does not address all advanced pooling behaviors like predictive pooling or aggressive Connection creation. Future work may address this.

Add support for OP_MSG exhaustAllowed

Exhaust Cursors may require changes to how we close Connections in the future, specifically to add a way to close and remove from its pool a Connection which has unread exhaust messages.

Changelog

  • 2024-01-23: Migrated from reStructuredText to Markdown.

  • 2019-06-06: Add "connectionError" as a valid reason for ConnectionCheckOutFailedEvent

  • 2020-09-03: Clarify Connection states and definition. Require the use of a
    background thread and/or async I/O. Add tests to ensure ConnectionReadyEvents are fired after ConnectionCreatedEvents.

  • 2020-09-24: Introduce maxConnecting requirement

  • 2020-12-17: Introduce "paused" and "ready" states. Clear WaitQueue on pool clear.

  • 2021-01-12: Clarify "clear" method behavior in load balancer mode.

  • 2021-01-19: Require that timeouts be applied per the client-side operations
    timeout specification.

  • 2021-04-12: Adding in behaviour for load balancer mode.

  • 2021-06-02: Formalize the behavior of a Background Thread.

  • 2021-11-08: Make maxConnecting configurable.

  • 2022-04-05: Preemptively cancel in progress operations when SDAM heartbeats timeout.

  • 2022-10-05: Remove spec front matter and reformat changelog.

  • 2022-10-14: Add connection pool log messages and associated tests.

  • 2023-04-17: Fix duplicate logging test description.

  • 2023-08-04: Add durations to connection pool events.

  • 2023-10-04: Commit to the currently specified requirements regarding durations in events.


Unified Test Format

  • Status: Accepted
  • Minimum Server Version: N/A
  • Current Schema Version: 1.19.0

Abstract

This project defines a unified schema for YAML and JSON specification tests, which run operations against a MongoDB deployment. By conforming various spec tests to a single schema, drivers can implement a single test runner to execute acceptance tests for multiple specifications, thereby reducing maintenance of existing specs and implementation time for new specifications.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

This document tends to use "SHOULD" more frequently than other specifications, but mainly in the context of providing guidance on writing test files. This is discussed in more detail in Design Rationale.

Goals

This test format can be used to define tests for the following specifications:

This is not an exhaustive list. Specifications that are known to not be supported by this format may be discussed under Future Work.

Specification

Terms

Entity Any object or value that is indexed by a unique name and stored in the Entity Map. This will typically be a driver object (e.g. client, session) defined in createEntities but may also be a saved operation result. A exhaustive list of supported types is presented in Supported Entity Types. Entities are referenced by name throughout the test file (e.g. Entity Test Operations).

Internal MongoClient A MongoClient created specifically for use with internal test operations, such as inserting collection data before a test, performing special assertions during a test, or asserting collection data after a test.

Iterable This term is used by various specifications as the return type for operations that return a sequence of items, which may be iterated. For example, the CRUD spec uses this as the return value for find and permit API flexibility rather than stipulate that a cursor object be returned directly.

Server Compatibility

The unified test format is intended to only run against enterprise servers. Additionally, many tests make use of Server Fail Points, so servers MUST be launched with the enableTestCommands parameter enabled.

Schema Version

This specification and the Test Format follow semantic versioning. The version is primarily used to validate test files with a JSON schema and also allow test runners to determine whether a particular test file is supported.

New tests files SHOULD always be written using the latest major version of this specification; however, test files SHOULD be conservative in the minor version they specify (as noted in schemaVersion).

JSON Schema Validation

Each major or minor version that changes the Test Format SHALL have a corresponding JSON schema. When a new schema file is introduced, any existing schema files MUST remain in place since they may be needed for validation. For example: if an additive change is made to version 1.0 of the spec, the schema-1.0.json file will be copied to schema-1.1.json and modified accordingly. A new or existing test file using schemaVersion "1.0" would then be expected to validate against both schema files. Schema version bumps MUST be noted in the Changelog.

A particular minor version MUST be capable of validating any and all test files in that major version series up to and including the minor version. For example, schema-2.1.json should validate test files with schemaVersion "2.0" and "2.1", but would not be expected to validate files specifying "1.0", "2.2", or "3.0".

The JSON schema MUST remain consistent with the Test Format section. If and when a new major version is introduced, the Breaking Changes section MUST be updated.

Ajv MAY be used to programmatically validate both YAML and JSON files using the JSON schema. The JSON schema MUST NOT use syntax that is unsupported by this tool, which bears mentioning because there are multiple versions of the JSON schema specification.

Test Runner Support

Each test file defines a schemaVersion, which test runners will use to determine compatibility (i.e. whether and how the test file will be interpreted). Test files are considered compatible with a test runner if their schemaVersion is less than or equal to a supported version in the test runner, given the same major version component. For example:

  • A test runner supporting version 1.5.1 could execute test files with versions 1.0 and 1.5 but not 1.6 and 2.0.
  • A test runner supporting version 2.1 could execute test files with versions 2.0 and 2.1 but not 1.0 and 1.5.
  • A test runner supporting both versions 1.5.1 and 2.0 could execute test files with versions 1.4, 1.5, and 2.0, but not 1.6, 2.1, or 3.0.
  • A test runner supporting version 2.0.1 could execute test files with versions 2.0 and 2.0.1 but not 2.0.2 or 2.1. This example is provided for completeness, but test files SHOULD NOT need to refer to patch versions (as previously mentioned).

Test runners MUST NOT process incompatible files and MUST raise an error if they encounter an incompatible file (as discussed in Executing a Test File). Test runners MAY support multiple schema versions (as demonstrated in the example above).

Impact of Spec Changes on Schema Version

Backwards-breaking changes SHALL warrant a new major version. These changes include, but are not limited to:

  • Subtractive changes, such as removing a field, operation, or type of supported entity or event
  • Changing an existing field from optional to required
  • Introducing a new, required field in the test format
  • Significant changes to test file execution (not BC)

Backwards-compatible changes SHALL warrant a new minor version. These changes include, but are not limited to:

  • Additive changes, such as a introducing a new Special Test Operations or type of supported entity or event
  • Changing an existing field from required to optional
  • Introducing a new, optional field in the test format
  • Minor changes to test file execution (BC)

Small fixes and internal spec changes (e.g. grammar, adding clarifying text to the spec) MAY warrant a new patch version; however, patch versions SHOULD NOT alter the structure of the test format and thus SHOULD NOT be relevant to test files (as noted in schemaVersion).

Entity Map

The entity map indexes arbitrary objects and values by unique names, so that they can be referenced from test constructs (e.g. operation.object). To ensure each test is executed in isolation, test runners MUST NOT share entity maps between tests. Most entities will be driver objects created by the createEntities directive during test setup, but the entity map may also be modified during test execution via the operation.saveResultAsEntity directive.

Test runners MAY choose to implement the entity map in a fashion most suited to their language, but implementations MUST enforce both uniqueness of entity names and referential integrity when fetching an entity. Test runners MUST raise an error if an attempt is made to store an entity with a name that already exists in the map and MUST raise an error if an entity is not found for a name or is found but has an unexpected type.

Test runners MUST provide a mechanism to retrieve entities from the entity map prior to the clearing of the entity map, as discussed in Executing a Test. There MUST be a way to retrieve an entity by its name (for example, to support retrieving the iteration count stored by the storeIterationsAsEntity option).

Test runners MAY restrict access to driver objects (e.g. MongoClient, ChangeStream) and only allow access to BSON types (see: Supported Entity Types). This restriction may be necessary if the test runner needs to ensure driver objects in its entity map are properly freed/destroyed between tests.

The entity map MUST be implemented in a way that allows for safe concurrent access, since a test may include multiple thread entities that all need to access the map concurrently. See entity_thread for more information on test runner threads.

Consider the following examples:

# Error due to a duplicate name (client0 was already defined)
createEntities:
  - client: { id: client0 }
  - client: { id: client0 }

# Error due to a missing entity (client1 is undefined)
createEntities:
  - client: { id: client0 }
  - session: { id: session0, client: client1 }

# Error due to an unexpected entity type (session instead of client)
createEntities:
  - client: { id: client0 }
  - session: { id: session0, client: client0 }
  - session: { id: session1, client: session0 }

Supported Entity Types

Test runners MUST support the following types of entities:

  • ChangeStream. Change stream entities are special in that they are not defined in createEntities but are instead created by using operation.saveResultAsEntity with a client_createChangeStream, database_createChangeStream, or collection_createChangeStream operation.

    Test files SHOULD NOT use a watch operation to create a change stream, as the implementation of that method may vary among drivers. For example, some implementations of watch immediately execute aggregate and construct the server-side cursor, while others may defer aggregate until the change stream object is iterated.

    See Cursor Operations for a list of operations.

  • FindCursor. These entities are not defined in createEntities but are instead created by using operation.saveResultAsEntity with a collection_createFindCursor operation.

    See Cursor Operations for a list of operations.

  • CommandCursor. These entities are not defined in createEntities but are instead created by using operation.saveResultAsEntity with a createCommandCursor operation.

    See Cursor Operations for a list of operations.

  • Event list. See storeEventsAsEntities. The event list MUST store BSON documents. The type of the list itself is not prescribed by this specification. Test runner MAY use a BSON array or a thread-safe list data structure to implement the event list.

  • All known BSON types and/or equivalent language types for the target driver. For the present version of the spec, the following BSON types are known: 0x01-0x13, 0x7F, 0xFF.

    Tests SHOULD NOT utilize deprecated types (e.g. 0x0E: Symbol), since they may not be supported by all drivers and could yield runtime errors (e.g. while loading a test file with an Extended JSON parser).

  • Test runner thread. An entity representing a "thread" that can be used to concurrently execute operations. Thread entities MUST be able to run concurrently with the main test runner thread and other thread entities, but they do not have to be implemented as actual OS threads (e.g. they can be goroutines or async tasks). See entity_thread_object for more information on how they are created.

This is an exhaustive list of supported types for the entity map. Test runners MUST raise an error if an attempt is made to store an unsupported type in the entity map.

Adding new entity types (including known BSON types) to this list will require a minor version bump to the spec and schema version. Removing entity types will require a major version bump. See Impact of Spec Changes on Schema Version for more information.

Test Format

Each specification test file can define one or more tests, which inherit some top-level configuration (e.g. namespace, initial data). YAML and JSON test files are parsed as an object by the test runner. This section defines the top-level keys for that object and links to various sub-sections for definitions of nested structures (e.g. individual test, operation).

Although test runners are free to process YAML or JSON files, YAML is the canonical format for writing tests. YAML files may be converted to JSON using a tool such as js-yaml .

Top-level Fields

The top-level fields of a test file are as follows:

  • description: Required string. The name of the test file.

    This SHOULD describe the common purpose of tests in this file and MAY refer to the filename (e.g. "updateOne-hint").

  • schemaVersion: Required string. Version of this specification with which the test file complies.

    Test files SHOULD be conservative when specifying a schema version. For example, if the latest schema version is 1.1 but the test file complies with schema version 1.0, the test file should specify 1.0.

    Test runners will use this to determine compatibility (i.e. whether and how the test file will be interpreted). The format of this string is defined in Version String; however, test files SHOULD NOT need to refer to specific patch versions since patch-level changes SHOULD NOT alter the structure of the test format (as previously noted in Schema Version).

  • runOnRequirements: Optional array of one or more runOnRequirement objects. List of server version and/or topology requirements for which the tests in this file can be run. If no requirements are met, the test runner MUST skip this test file.
  • createEntities: Optional array of one or more entity objects. List of entities (e.g. client, collection, session objects) that SHALL be created before each test case is executed.

    Test files SHOULD define entities in dependency order, such that all referenced entities (e.g. client) are defined before any of their dependent entities (e.g. database, session).

  • initialData: Optional array of one or more collectionData objects. Data that will exist in collections before each test case is executed.
  • tests: Required array of one or more test objects. List of test cases to be executed independently of each other.
  • _yamlAnchors: Optional object containing arbitrary data. This is only used to define anchors within the YAML files and MUST NOT be used by test runners.

runOnRequirement

A combination of server version and/or topology requirements for running the test(s).

The format of server version strings is defined in Version String. When comparing server version strings, each component SHALL be compared numerically. For example, "4.0.10" is greater than "4.0.9" and "3.6" and less than "4.2.0".

The structure of this object is as follows:

  • minServerVersion: Optional string. The minimum server version (inclusive) required to successfully run the tests. If this field is omitted, there is no lower bound on the required server version. The format of this string is defined in Version String.

  • maxServerVersion: Optional string. The maximum server version (inclusive) against which the tests can be run successfully. If this field is omitted, there is no upper bound on the required server version. The format of this string is defined in Version String.

  • topologies: Optional array of one or more strings. Server topologies against which the tests can be run successfully. Valid topologies are "single", "replicaset", "sharded", "load-balanced", and "sharded-replicaset" (i.e. sharded cluster backed by replica sets). If this field is omitted, there is no topology requirement for the test.

    When matching a "sharded-replicaset" topology, test runners MUST ensure that all shards are backed by a replica set. The process for doing so is described in Determining if a Sharded Cluster Uses Replica Sets. When matching a "sharded" topology, test runners MUST accept any type of sharded cluster (i.e. "sharded" implies "sharded-replicaset", but not vice versa).

    The "sharded-replicaset" topology type is deprecated. MongoDB 3.6+ requires that all shard servers be replica sets. Therefore, tests SHOULD use "sharded" instead of "sharded-replicaset" when targeting 3.6+ server versions in order to avoid unnecessary overhead.

    Note: load balancers were introduced in MongoDB 5.0. Therefore, any sharded cluster behind a load balancer implicitly uses replica sets for its shards.

  • serverless: Optional string. Whether or not the test should be run on Atlas Serverless instances. Valid values are "require", "forbid", and "allow". If "require", the test MUST only be run on Atlas Serverless instances. If "forbid", the test MUST NOT be run on Atlas Serverless instances. If omitted or "allow", this option has no effect.

    The test runner MUST be informed whether or not Atlas Serverless is being used in order to determine if this requirement is met (e.g. through an environment variable or configuration option).

    Note: the Atlas Serverless proxy imitates mongos, so the test runner is not capable of determining if Atlas Serverless is in use by issuing commands such as buildInfo or hello. Furthermore, connections to Atlas Serverless use a load balancer, so the topology will appear as "load-balanced".

  • serverParameters: Optional object of server parameters to check against. To check server parameters, drivers send a { getParameter: 1, <parameter>: 1 } command to the server using an internal MongoClient. Drivers MAY also choose to send a { getParameter: '*' } command and fetch all parameters at once. The result SHOULD be cached to avoid repeated calls to fetch the same parameter. Test runners MUST apply the rules specified in Flexible Numeric Comparisons when comparing values. If a server does not support a parameter, test runners MUST treat the comparison as not equal and skip the test. This includes errors that occur when fetching a single parameter using getParameter.

  • auth: Optional boolean. If true, the tests MUST only run if authentication is enabled. If false, tests MUST NOT run if authentication is enabled. If this field is omitted, there is no authentication requirement.

  • authMechanism: Optional string. Specifies an authentication mechanism that the server needs to support for the test. If set, tests MUST only run if the given string matches (case-insensitive) one of the strings in the authenticationMechanisms server parameter. If this field is omitted, there is no authentication mechanism requirement.

  • csfle: Optional boolean. If true, the tests MUST only run if the driver and server support Client-Side Field Level Encryption. CSFLE is supported when all of the following are true:

    • Server version is 4.2.0 or higher
    • Driver has libmongocrypt enabled
    • At least one of crypt_shared and/or mongocryptd is available

    If false, tests MUST NOT run if CSFLE is supported. If this field is omitted, there is no CSFLE requirement.

Test runners MAY evaluate these conditions in any order. For example, it may be more efficient to evaluate serverless or auth before communicating with a server to check its version.

entity

An entity (e.g. client, collection, session object) that will be created in the Entity Map before each test is executed.

This object MUST contain exactly one top-level key that identifies the entity type and maps to a nested object, which specifies a unique name for the entity (id key) and any other parameters necessary for its construction. Tests SHOULD use sequential names based on the entity type (e.g. "session0", "session1").

When defining an entity object in YAML, a node anchor SHOULD be created on the entity's id key. This anchor will allow the unique name to be referenced with an alias node later in the file (e.g. from another entity or operation object) and also leverage YAML's parser for reference validation.

The structure of this object is as follows:

  • client: Optional object. Defines a MongoClient object. In addition to the configuration defined below, test runners for drivers that implement connection pooling MUST track the number of connections checked out at any given time for the constructed MongoClient. This can be done using a single counter and CMAP events. Each ConnectionCheckedOutEvent should increment the counter and each ConnectionCheckedInEvent should decrement it.

    The structure of this object is as follows:

    • id: Required string. Unique name for this entity. The YAML file SHOULD define a node anchor for this field (e.g. id: &client0 client0).

    • uriOptions: Optional object. Additional URI options to apply to the test suite's connection string that is used to create this client. Any keys in this object MUST override conflicting keys in the connection string.

      Documentation for supported options may be found in the URI Options spec, with one notable exception: if readPreferenceTags is specified in this object, the key will map to an array of strings, each representing a tag set, since it is not feasible to define multiple readPreferenceTags keys in the object.

      Note also that when specifying directConnection as true, the connection string used to instantiate a client MUST only have a single seed and MUST NOT specify the replicaSet option. See the URI Options spec for more information.

      Any field in uriOptions may be a $$placeholder document and the test runner MUST support replacing the placeholder document with values loaded from the test environment. For example:

      uriOptions:
        authMechanism: "MONGODB-OIDC"
        authMechanismProperties:
          ENVIRONMENT: { $$placeholder: 1 }
      
    • useMultipleMongoses: Optional boolean. If true and the topology is a sharded cluster, the test runner MUST assert that this MongoClient connects to multiple mongos hosts (e.g. by inspecting the connection string). If false and the topology is a sharded cluster, the test runner MUST ensure that this MongoClient connects to only a single mongos host (e.g. by modifying the connection string).

      If this option is not specified and the topology is a sharded cluster, the test runner MUST NOT enforce any limit on the number of mongos hosts in the connection string and any tests using this client SHOULD NOT depend on a particular number of mongos hosts.

      This option SHOULD be set to true in test files if the resulting entity is used to conduct transactions against a sharded cluster. This is advised because connecting to multiple mongos servers is necessary to test session pinning.

      If the topology type is LoadBalanced and Atlas Serverless is not being used, the test runner MUST use one of the two load balancer URIs described in Initializing the Test Runner to configure the MongoClient. If useMultipleMongoses is true or unset, the test runner MUST use the URI of the load balancer fronting multiple servers. Otherwise, the test runner MUST use the URI of the load balancer fronting a single server.

      If the topology type is LoadBalanced and Atlas Serverless is being used, this option has no effect. This is because provisioning an Atlas Serverless instance yields a single URI (i.e. a load balancer fronting a single Atlas Serverless proxy).

      This option has no effect for topologies that are not sharded or load balanced.

    • ignoreCommandMonitoringEvents: Optional array of one or more strings. Command names for which the test runner MUST ignore any observed command monitoring events. The command(s) will be ignored in addition to configureFailPoint and any commands containing sensitive information (per the Command Logging and Monitoring spec) unless observeSensitiveCommands is true.

      Test files SHOULD NOT use this option unless one or more command monitoring events are specified in observeEvents.

    • observeSensitiveCommands: Optional boolean. If true, events associated with sensitive commands (per the Command Logging and Monitoring spec) will be observed for this client. Note that the command and replies for such events will already have been redacted by the driver. If false or not specified, events for commands containing sensitive information MUST be ignored. Authentication SHOULD be disabled when this property is true, i.e. auth should be false for each runOnRequirement. See rationale_observeSensitiveCommands.
    • storeEventsAsEntities: Optional array of one or more storeEventsAsEntity objects. Each object denotes an entity name and one or more events to be collected and stored in that entity. See storeEventsAsEntity for implementation details.

      Note: the implementation of storeEventsAsEntities is wholly independent from observeEvents and ignoreCommandMonitoringEvents.

      Example option value:

      storeEventsAsEntities:
        - id: client0_events
          events: [PoolCreatedEvent, ConnectionCreatedEvent, CommandStartedEvent]
      
    • observeLogMessages: Optional object where the key names are log components and the values are minimum log severity levels indicating which components to collect log messages for and what the minimum severity level of collected messages should be. Messages for unspecified components and/or with lower severity levels than those specified MUST be ignored by this client's log collector(s) and SHOULD NOT be included in test.expectLogMessages for this client.
    • serverApi: Optional serverApi object.
  • clientEncryption: Optional object. Defines a ClientEncryption object.

    The structure of this object is as follows:

    • id: Required string. Unique name for this entity. The YAML file SHOULD define a node anchor for this field (e.g. id: &clientEncryption0 clientEncryption0).

    • clientEncryptionOpts: Required document. A value corresponding to a ClientEncryptionOpts.

      Note: the tlsOptions document is intentionally omitted from the test format. However, drivers MAY internally configure TLS options as needed to satisfy the requirements of configured KMS providers.

      The structure of this document is as follows:

      • keyVaultClient: Required string. Client entity from which this ClientEncryption will be created. The YAML file SHOULD use an alias node for a client entity's id field (e.g. client: *client0).

      • keyVaultNamespace: Required string. The database and collection to use as the key vault collection for this clientEncryption. The namespace takes the form database.collection (e.g. keyVaultNamespace: keyvault.datakeys).

      • kmsProviders: Required document. Drivers MUST NOT configure a KMS provider if it is not given. This is to permit testing conditions where a required KMS provider is not configured. If a KMS provider is given as an empty document (e.g. kmsProviders: { aws: {} }), drivers MUST configure the KMS provider without credentials to permit testing conditions where KMS credentials are needed. If a KMS credentials field has a placeholder value (e.g. kmsProviders: { aws: { accessKeyId: { $$placeholder: 1 }, secretAccessKey: { $$placeholder: 1 } } }), drivers MUST replace the field with credentials that satisfy the operations required by the unified test files. Drivers MAY load the credentials from the environment or a configuration file as needed to satisfy the requirements of the given KMS provider and tests. If a KMS credentials field is not given (e.g. the required field secretAccessKey is omitted in: kmsProviders: { aws: { accessKeyId: { $$placeholder: 1 } }), drivers MUST NOT include the field during KMS configuration. This is to permit testing conditions where required KMS credentials fields are not provided. Otherwise, drivers MUST configure the KMS provider with the explicit value of KMS credentials field given in the test file (e.g. kmsProviders: { aws: { accessKeyId: abc, secretAccessKey: def } }). This is to permit testing conditions where invalid KMS credentials are provided.

        Tests may also reference named KMS providers. KMS providers with the name name1 are expected to be configured exactly as the unnamed KMS providers. The aws:name2 KMS provider and aws:name1 KMS providers deliberately use separate AWS accounts that do not have permission to the other's keys.

        See the Client-Side Encryption test README for instructions to obtain test credentials.

  • database: Optional object. Defines a Database object.

    The structure of this object is as follows:

    • id: Required string. Unique name for this entity. The YAML file SHOULD define a node anchor for this field (e.g. id: &database0 database0).
    • client: Required string. Client entity from which this database will be created. The YAML file SHOULD use an alias node for a client entity's id field (e.g. client: *client0).
    • databaseName: Required string. Database name. The YAML file SHOULD define a node anchor for this field (e.g. databaseName: &database0Name foo).
    • databaseOptions: Optional collectionOrDatabaseOptions object.
  • collection: Optional object. Defines a Collection object.

    The structure of this object is as follows:

    • id: Required string. Unique name for this entity. The YAML file SHOULD define a node anchor for this field (e.g. id: &collection0 collection0).
    • database: Required string. Database entity from which this collection will be created. The YAML file SHOULD use an alias node for a database entity's id field (e.g. database: *database0).
    • collectionName: Required string. Collection name. The YAML file SHOULD define a node anchor for this field (e.g. collectionName: &collection0Name foo).
    • collectionOptions: Optional collectionOrDatabaseOptions object.
  • session: Optional object. Defines an explicit ClientSession object.

    The structure of this object is as follows:

    • id: Required string. Unique name for this entity. The YAML file SHOULD define a node anchor for this field (e.g. id: &session0 session0).

    • client: Required string. Client entity from which this session will be created. The YAML file SHOULD use an alias node for a client entity's id field (e.g. client: *client0).

    • sessionOptions: Optional object. Map of parameters to pass to MongoClient.startSession when creating the session. Supported options are defined in the following specifications:

      When specifying TransactionOptions for defaultTransactionOptions, the transaction options MUST remain nested under defaultTransactionOptions and MUST NOT be flattened into sessionOptions.

  • bucket: Optional object. Defines a Bucket object, as defined in the GridFS spec.

    The structure of this object is as follows:

    • id: Required string. Unique name for this entity. The YAML file SHOULD define a node anchor for this field (e.g. id: &bucket0 bucket0).
    • database: Required string. Database entity from which this bucket will be created. The YAML file SHOULD use an alias node for a database entity's id field (e.g. database: *database0).
    • bucketOptions: Optional object. Additional options used to construct the bucket object. Supported options are defined in the GridFS specification. The readConcern, readPreference, and writeConcern options use the same structure as defined in Common Options.
  • thread: Optional object. Defines a test runner "thread". Once the "thread" has been created, it should be idle and waiting for operations to be dispatched to it later on by runOnThread operations.

    The structure of this object is as follows:

    • id: Required string. Unique name for this entity. The YAML file SHOULD define a node anchor for this field (e.g. id: &thread0 thread0).

storeEventsAsEntity

A list of one or more events that will be observed on a client and collectively stored within an entity. This object is used within storeEventsAsEntities.

The structure of this object is as follows:

  • id: Required string. Unique name for this entity.
  • events: Required array of one or more strings, which denote the events to be collected. Currently, only the following CMAP and command monitoring events MUST be supported:
    • PoolCreatedEvent
    • PoolReadyEvent
    • PoolClearedEvent
    • PoolClosedEvent
    • ConnectionCreatedEvent
    • ConnectionReadyEvent
    • ConnectionClosedEvent
    • ConnectionCheckOutStartedEvent
    • ConnectionCheckOutFailedEvent
    • ConnectionCheckedOutEvent
    • ConnectionCheckedInEvent
    • CommandStartedEvent
    • CommandSucceededEvent
    • CommandFailedEvent

For the specified entity name, the test runner MUST create the respective entity with a type of "event list", as described in Supported Entity Types. If the entity already exists (such as from a previous storeEventsAsEntity object) the test runner MUST raise an error.

The test runner MUST set up an event subscriber for each event named. The event subscriber MUST serialize the events it receives into a document, using the documented properties of the event as field names, and append the document to the list stored in the specified entity. Additionally, the following fields MUST be stored with each event document:

  • name: The name of the event (e.g. PoolCreatedEvent). The name of the event MUST be the name used in the respective specification that defines the event in question.
  • observedAt: The time, as the floating-point number of seconds since the Unix epoch, when the event was observed by the test runner.

The test runner MAY omit the command field for CommandStartedEvent and reply field for CommandSucceededEvent.

If an event field in the driver is of a type that does not directly map to a BSON type (e.g. Exception for the failure field of CommandFailedEvent) the test runner MUST convert values of that field to one of the BSON types. For example, a test runner MAY store the exception's error message string as the failure field of CommandFailedEvent.

If the specification defining an event permits deviation in field names, such as connectionId field for CommandStartedEvent, the test runner SHOULD use the field names used in the specification when serializing events to documents even if the respective field name is different in the driver's event object.

serverApi

Declares an API version for a client entity.

The structure of this object is as follows:

  • version: Required string. Test runners MUST fail if the given version string is not supported by the driver.

    Note: the format of this string is unrelated to Version String.

  • strict: Optional boolean.

  • deprecationErrors: Optional boolean.

See the Stable API spec for more details on these fields.

collectionData

List of documents corresponding to the contents of a collection. This structure is used by both initialData and test.outcome, which insert and read documents, respectively.

The structure of this object is as follows:

  • collectionName: Required string. See commonOptions_collectionName.
  • databaseName: Required string. See commonOptions_databaseName.
  • createOptions: Optional object. When used in initialData, these options MUST be passed to the create command when creating the collection. Test files MUST NOT specify writeConcern in this options document as that could conflict with the use of the majority write concern when the collection is created during test execution.
  • documents: Required array of objects. List of documents corresponding to the contents of the collection. This list may be empty.

test

Test case consisting of a sequence of operations to be executed.

The structure of this object is as follows:

  • description: Required string. The name of the test.

    This SHOULD describe the purpose of this test (e.g. "insertOne is retried").

  • runOnRequirements: Optional array of one or more runOnRequirement objects. List of server version and/or topology requirements for which this test can be run. If specified, these requirements are evaluated independently and in addition to any top-level runOnRequirements. If no requirements in this array are met, the test runner MUST skip this test.

    These requirements SHOULD be more restrictive than those specified in the top-level runOnRequirements (if any) and SHOULD NOT be more permissive. This is advised because both sets of requirements MUST be satisfied in order for a test to be executed and more permissive requirements at the test-level could be taken out of context on their own.

  • skipReason: Optional string. If set, the test will be skipped. The string SHOULD explain the reason for skipping the test (e.g. JIRA ticket).
  • operations: Required array of one or more operation objects. List of operations to be executed for the test case.
  • expectEvents: Optional array of one or more expectedEventsForClient objects. For one or more clients, a list of events that are expected to be observed in a particular order.

    If a driver only supports configuring event listeners globally (for all clients), the test runner SHOULD associate each observed event with a client in order to perform these assertions.

    Tests SHOULD NOT specify multiple expectedEventsForClient objects for a single client entity with the same eventType field. For example, a test containing two expectedEventsForClient objects with the eventType set to cmap for both would either be redundant (if the events arrays were identical) or likely to fail (if the events arrays differed).

  • expectLogMessages: Optional array of one or more expectedLogMessagesForClient objects. For one or more clients, a list of log messages that are expected to be observed in a particular order.

    If a driver only supports configuring log collectors globally (for all clients), the test runner SHOULD associate each observed message with a client in order to perform these assertions. One possible implementation is to add a test-only option to MongoClient which enables the client to store its entity name and add the entity name to each log message to enable filtering messages by client.

    Tests SHOULD NOT specify multiple expectedLogMessagesForClient objects for a single client entity.

  • outcome: Optional array of one or more collectionData objects. Data that is expected to exist in collections after each test case is executed.

    The list of documents herein SHOULD be sorted ascendingly by the _id field to allow for deterministic comparisons. The procedure for asserting collection contents is discussed in Executing a Test.

operation

An operation to be executed as part of the test.

The structure of this object is as follows:

  • name: Required string. Name of the operation (e.g. method) to perform on the object.
  • object: Required string. Name of the object on which to perform the operation. This SHOULD correspond to either an entity name (for Entity Test Operations) or "testRunner" (for Special Test Operations). If the object is an entity, The YAML file SHOULD use an alias node for its id field (e.g. object: *collection0).
  • expectResult: Optional mixed type. A value corresponding to the expected result of the operation. This field may be a scalar value, a single document, or an array of values. Test runners MUST follow the rules in Evaluating Matches when processing this assertion.

    This field is mutually exclusive with expectError.

    This field SHOULD NOT be used for Special Test Operations (i.e. object: testRunner).

  • saveResultAsEntity: Optional string. If specified, the actual result returned by the operation (if any) will be saved with this name in the Entity Map. The test runner MUST raise an error if the name is already in use or if the result does not comply with Supported Entity Types.

    This field is mutually exclusive with expectError.

    This field SHOULD NOT be used for Special Test Operations (i.e. object: testRunner).

expectedError

One or more assertions for an error/exception, which is expected to be raised by an executed operation. At least one key is required in this object.

The structure of this object is as follows:

  • isError: Optional boolean. If true, the test runner MUST assert that an error was raised. This is primarily used when no other error assertions apply but the test still needs to assert an expected error. Test files MUST NOT specify false, as expectedError is only applicable when an operation is expected to raise an error.

  • isClientError: Optional boolean. If true, the test runner MUST assert that the error originates from the client (i.e. it is not derived from a server response). If false, the test runner MUST assert that the error does not originate from the client.

    Client errors include, but are not limited to: parameter validation errors before a command is sent to the server; network errors.

  • isTimeoutError: Optional boolean. If true, the test runner MUST assert that the error represents a timeout due to use of the timeoutMS option. If false, the test runner MUST assert that the error does not represent a timeout.

  • errorContains: Optional string. A substring of the expected error message (e.g. "errmsg" field in a server error document). The test runner MUST assert that the error message contains this string using a case-insensitive match.

    See bulkWrite for special considerations for BulkWriteExceptions.

  • errorCode: Optional integer. The expected "code" field in the server-generated error response. The test runner MUST assert that the error includes a server-generated response whose "code" field equals this value. In the interest of readability, YAML files SHOULD use a comment to note the corresponding code name (e.g. errorCode: 26 # NamespaceNotFound).

    Server error codes are defined in error_codes.yml.

    Test files SHOULD NOT assert error codes for client errors, as specifications do not define standardized codes for client errors.

  • errorCodeName: Optional string. The expected "codeName" field in the server-generated error response. The test runner MUST assert that the error includes a server-generated response whose "codeName" field equals this value using a case-insensitive comparison.

    See bulkWrite for special considerations for BulkWriteExceptions.

    Server error codes are defined in error_codes.yml.

    Test files SHOULD NOT assert error codes for client errors, as specifications do not define standardized codes for client errors.

  • errorLabelsContain: Optional array of one or more strings. A list of error label strings that the error is expected to have. The test runner MUST assert that the error contains all of the specified labels (e.g. using the hasErrorLabel method).

  • errorLabelsOmit: Optional array of one or more strings. A list of error label strings that the error is expected not to have. The test runner MUST assert that the error does not contain any of the specified labels (e.g. using the hasErrorLabel method).

  • writeErrors: Optional document. The write errors expected to be present in the error. The writeErrors document contains numeric keys representing the index of the write that failed and writeError object values. The test runner MUST assert that the error contains a writeError for each index present in writeErrors and MUST assert that the writeErrors match as root-level documents according to the rules in Evaluating Matches. The test runner MUST assert that the error does not contain any additional writeErrors. This field is only intended for use with the clientBulkWrite operation.

  • writeConcernErrors: Optional array of one or more objects. An ordered list of write concern errors expected to be present in the error. The test runner MUST assert that each writeConcernError in this list matches the writeConcernError present at the same index in the error's list of writeConcernErrors as a root-level document according to the rules in Evaluating Matches. The test runner MUST assert that the error does not contain any additional writeConcernErrorss. This field is only intended for use with the clientBulkWrite operation.

  • errorResponse: Optional document. A value corresponding to the expected server response. The test runner MUST assert that the error includes a server response that matches this value as a root-level document according to the rules in Evaluating Matches.

    Note that some drivers may not be able to evaluate errorResponse for write commands (i.e. insert, update, delete) and bulk write operations. For example, a BulkWriteException is derived from potentially multiple server responses and may not provide direct access to a single response. Tests SHOULD avoid using errorResponse for such operations if possible; otherwise, affected drivers SHOULD skip such tests if necessary.

  • expectResult: Optional mixed type. This field follows the same rules as operation.expectResult and is only used in cases where the error includes a result (e.g. bulkWrite). If specified, the test runner MUST assert that the error includes a result and that it matches this value. If the result is optional (e.g. BulkWriteResult reported through the writeResult property of a BulkWriteException), this assertion SHOULD utilize the $$unsetOrMatches operator.

expectedEventsForClient

A list of events that are expected to be observed (in that order) for a client while executing operations.

The structure of each object is as follows:

  • client: Required string. Client entity on which the events are expected to be observed. See commonOptions_client.
  • eventType: Optional string. Specifies the type of the monitor which captured the events. Valid values are command for Command Monitoring events, cmap for CMAP events, and sdam for SDAM events. Defaults to command if omitted.
  • events: Required array of expectedEvent objects. List of events, which are expected to be observed (in this order) on the corresponding client while executing operations. If the array is empty, the test runner MUST assert that no events were observed on the client (excluding ignored events).
  • ignoreExtraEvents: Optional boolean. Specifies how the events array is matched against the observed events. If false, observed events after all specified events have matched MUST cause a test failure; if true, observed events after all specified events have been matched MUST NOT cause a test failure. Defaults to false.

expectedEvent

An event (e.g. APM), which is expected to be observed while executing the test's operations.

This object MUST contain exactly one top-level key that identifies the event type and maps to a nested object, which contains one or more assertions for the event's properties.

Some event properties are omitted in the following structures because they cannot be reliably tested. Taking command monitoring events as an example, requestId and operationId are nondeterministic and types for connectionId and failure can vary by implementation.

The events allowed in an expectedEvent object depend on the value of eventType in the corresponding expectedEventsForClient object, which can have one of the following values:

expectedCommandEvent

The structure of this object is as follows:

  • commandStartedEvent: Optional object. Assertions for one or more CommandStartedEvent fields.

    The structure of this object is as follows:

    • command: Optional document. A value corresponding to the expected command document. Test runners MUST follow the rules in Evaluating Matches when processing this assertion.
    • commandName: Optional string. Test runners MUST assert that the command name matches this value.
    • databaseName: Optional string. Test runners MUST assert that the database name matches this value. The YAML file SHOULD use an alias node for this value (e.g. databaseName: *database0Name).
    • hasServiceId: Defined in hasServiceId.
    • hasServerConnectionId: Defined in hasServerConnectionId.
  • commandSucceededEvent: Optional object. Assertions for one or more CommandSucceededEvent fields.

    The structure of this object is as follows:

    • reply: Optional document. A value corresponding to the expected reply document. Test runners MUST follow the rules in Evaluating Matches when processing this assertion.
    • commandName: Optional string. Test runners MUST assert that the command name matches this value.
    • databaseName: Optional string. Test runners MUST assert that the database name matches this value. The YAML file SHOULD use an alias node for this value (e.g. databaseName: *database0Name).
    • hasServiceId: Defined in hasServiceId.
    • hasServerConnectionId: Defined in hasServerConnectionId.
  • commandFailedEvent: Optional object. Assertions for one or more CommandFailedEvent fields.

    The structure of this object is as follows:

    • commandName: Optional string. Test runners MUST assert that the command name matches this value.
    • databaseName: Optional string. Test runners MUST assert that the database name matches this value. The YAML file SHOULD use an alias node for this value (e.g. databaseName: *database0Name).
    • hasServiceId: Defined in hasServiceId.
    • hasServerConnectionId: Defined in hasServerConnectionId.
expectedCmapEvent
  • poolCreatedEvent: Optional object. If present, this object MUST be an empty document as all fields in this event are non-deterministic.
  • poolReadyEvent: Optional object. If present, this object MUST be an empty document as all fields in this event are non-deterministic.
  • poolClearedEvent: Optional object. Assertions for one or more PoolClearedEvent fields.

    The structure of this object is as follows:

    • hasServiceId: Defined in hasServiceId.
    • interruptInUseConnections: Optional boolean. If specified, test runners MUST assert that the field is set and matches this value.
  • poolClosedEvent: Optional object. If present, this object MUST be an empty document as all fields in this event are non-deterministic.
  • connectionCreatedEvent: Optional object. If present, this object MUST be an empty document as all fields in this event are non-deterministic.
  • connectionReadyEvent: Optional object. If present, this object MUST be an empty document as all fields in this event are non-deterministic.
  • connectionClosedEvent: Optional object. Assertions for one or more ConnectionClosedEvent fields.

    The structure of this object is as follows:

    • reason: Optional string. Test runners MUST assert that the reason in the published event matches this value. Valid values for this field are defined in the CMAP spec.
  • connectionCheckOutStartedEvent: Optional object. If present, this object MUST be an empty document as all fields in this event are non-deterministic.
  • connectionCheckOutFailedEvent: Optional object. Assertions for one or more ConnectionCheckOutFailedEvent fields.

    The structure of this object is as follows:

    • reason: Optional string. Test runners MUST assert that the reason in the published event matches this value. Valid values for this field are defined in the CMAP spec.
  • connectionCheckedOutEvent: Optional object. If present, this object MUST be an empty document as all fields in this event are non-deterministic.
  • connectionCheckedInEvent: Optional object. If present, this object MUST be an empty document as all fields in this event are non-deterministic.
expectedSdamEvent

The structure of this object is as follows:

  • serverDescriptionChangedEvent: Optional object. Assertions for one or more ServerDescriptionChangedEvent fields.

    The structure of this object is as follows:

    • previousDescription: Optional object. A value corresponding to the server description as it was before the change that triggered this event.
    • newDescription: Optional object. A value corresponding to the server description as it was after the change that triggered this event.

    The structure of a server description object (which the previousDescription and newDescription fields contain) is as follows:

    • type: Optional string. The type of the server in the description. Test runners MUST assert that the type in the published event matches this value. See SDAM: ServerType for a list of valid values.
  • serverHeartbeatStartedEvent: Optional object. Assertions for one or more ServerHeartbeatStartedEvent fields.

    The structure of this object is as follows:

    • awaited: Optional boolean. If specified, test runners MUST assert that the field is set and matches this value.
  • serverHeartbeatSucceededEvent: Optional object. Assertions for one or more ServerHeartbeatSucceededEvent fields.

    The structure of this object is as follows:

    • awaited: Optional boolean. If specified, test runners MUST assert that the field is set and matches this value.
  • serverHeartbeatFailedEvent: Optional object. Assertions for one or more ServerHeartbeatFailedEvent fields.

    The structure of this object is as follows:

    • awaited: Optional boolean. If specified, test runners MUST assert that the field is set and matches this value.
  • topologyDescriptionChangedEvent: Optional object. Assertions for one TopologyDescriptionChangedEvent object.

    The structure of this object is as follows:

    • previousDescription: Optional Object. A value corresponding to the topology description as it was before the change that triggered the event.

    • newDescription : Optional Object. A value corresponding to the topology description as it is after the change that triggered the event.

      Test runners SHOULD ignore any other fields than the previousDescription and newDescription fields.

      The structure of a topology description object (which the previousDescription and newDescription fields contain is as follows:

      • type: Optional string. The type of the topology in the description. Test runners MUST assert that the type in the published event matches this value. See SDAM: TopologyType for a list of valid values.

      Test runners SHOULD ignore any other fields present on the previousDescription and newDescription fields of the captured topologyDescriptionChangedEvent.

hasServiceId

This field is an optional boolean that specifies whether or not the serviceId field of an event is set. If true, test runners MUST assert that the field is set and is a non-empty BSON ObjectId (i.e. all bytes of the ObjectId are not 0). If false, test runners MUST assert that the field is not set or is an empty BSON ObjectId.

hasServerConnectionId

This field is an optional boolean that specifies whether or not the serverConnectionId field of an event is set. If true, test runners MUST assert that the field is set and is a positive Int32. If false, test runners MUST assert that the field is not set, or, if the driver uses a nonpositive Int32 value to indicate the field being unset, MUST assert that serverConnectionId is a nonpositive Int32.

expectedLogMessagesForClient

A list of log messages that are expected to be observed (in that order) for a client while executing operations.

The structure of each object is as follows:

  • client: Required string. Client entity for which the messages are expected to be observed. See commonOptions_client.
  • messages: Required array of expectedLogMessage objects. List of messages, which are expected to be observed (in this order) on the corresponding client while executing operations. If the array is empty, the test runner MUST assert that no messages were observed on the client. The driver MUST assert that the messages produced are an exact match, i.e. that the expected and actual message counts are the same and that there are no extra messages emitted by the client during the test run. Note: ignoreMessages and ignoreExtraMessages may exclude log messages from this evaluation.
  • ignoreMessages: Optional array of expectedLogMessage objects. Unordered set of messages, which MUST be ignored on the corresponding client while executing operations. The test runner MUST exclude all log messages from observed messages that match any of the messages in ignoreMessages array before messages evaluation. Matching rules used to match messages in ignoreMessages are identical to match rules used for messages matching.
  • ignoreExtraMessages: Optional boolean. Specifies how the messages array is matched against the observed logs. If false, observed logs after all specified logs have matched MUST cause a test failure; if true, observed logs after all specified logs have been matched MUST NOT cause a test failure. Defaults to false.

expectedLogMessage

A log message which is expected to be observed while executing the test's operations.

The structure of each object is as follows:

  • level: Required string. This MUST be one of the level names listed in
    log severity levels. This specifies the expected level for the log message and corresponds to the level used for the message in the specification that defines it. Note that since not all drivers will necessarily support all log levels, some driver may need to map the specified level to the corresponding driver-supported level. Test runners MUST assert that the actual level matches this value.

  • component: Required string. This MUST be one of the component names listed
    in components. This specifies the expected component for the log message. Note that since naming variations are permitted for components, some drivers may need to map this to a corresponding language-specific component name. Test runners MUST assert that the actual component matches this value.

  • failureIsRedacted: Optional boolean. This field SHOULD only be specified when the log message data is expected to contain a failure value.

    When failureIsRedacted is present and its value is true, the test runner MUST assert that a failure is present and that the failure has been redacted according to the rules defined for error redaction in the command logging and monitoring specification.

    When false, the test runner MUST assert that a failure is present and that the failure has NOT been redacted.

    The exact form of these assertions and how thorough they are will vary based on the driver's chosen error representation in logs; e.g. drivers that use strings may only be able to assert on the presence/absence of substrings.

  • data: Required object. Contains key-value pairs that are expected to be attached to the log message. Test runners MUST assert that the actual data contained in the log message matches the expected data, and MUST treat the log message data as a root-level document.

    A suggested implementation approach is to decode data as a BSON document and serialize the data attached to each log message to a BSON document, and match those documents.

    Note that for drivers that do not implement structured logging, this requires designing logging internals such that data is first gathered in a structured form (e.g. a document or hashmap) which can be intercepted for testing purposes.

collectionOrDatabaseOptions

Map of parameters used to construct a collection or database object.

The structure of this object is as follows:

Common Options

This section defines the structure of common options that are referenced from various contexts in the test format. Comprehensive documentation for some of these types and their parameters may be found in the following specifications:

The structure of these common options is as follows:

  • collectionName: String. Collection name. The YAML file SHOULD use an alias node for a collection entity's collectionName field (e.g. collectionName: *collection0Name).
  • databaseName: String. Database name. The YAML file SHOULD use an alias node for a database entity's databaseName field (e.g. databaseName: *database0Name).
  • readConcern: Object. Map of parameters to construct a read concern.

    The structure of this object is as follows:

    • level: Required string.
  • readPreference: Object. Map of parameters to construct a read preference.

    The structure of this object is as follows:

    • mode: Required string.
    • tagSets: Optional array of objects.
    • maxStalenessSeconds: Optional integer.
    • hedge: Optional object.
  • client: String. Client entity name, which the test runner MUST resolve to a MongoClient object. The YAML file SHOULD use an alias node for a client entity's id field (e.g. client: *client0).
  • session: String. Session entity name, which the test runner MUST resolve to a ClientSession object. The YAML file SHOULD use an alias node for a session entity's id field (e.g. session: *session0).
  • writeConcern: Object. Map of parameters to construct a write concern.

    The structure of this object is as follows:

    • journal: Optional boolean.
    • w: Optional integer or string.
    • wtimeoutMS: Optional integer.

Version String

Version strings, which are used for schemaVersion and runOnRequirement, MUST conform to one of the following formats, where each component is a non-negative integer:

  • <major>.<minor>.<patch>
  • <major>.<minor> (<patch> is assumed to be zero)
  • <major> (<minor> and <patch> are assumed to be zero)

Any component other than major, minor, and patch MUST be discarded prior to comparing versions. This is necessary to ensure that spec tests run on pre-release versions of the MongoDB server. As an example, when checking if a server with the version 4.9.0-alpha4-271-g7d5cf02 passes the requirement for a test, only 4.9.0 is relevant for the comparison. When reading the server version from the buildInfo command reply, the three elements of the versionArray field MUST be used, and all other fields MUST be discarded for this comparison.

Entity Test Operations

Entity operations correspond to an API method on a driver object. If operation.object refers to an entity name (e.g. "collection0") then operation.name is expected to reference an API method on that class.

Test files SHALL use camelCase when referring to API methods and parameters, even if the defining specifications use other forms (e.g. snake_case in GridFS). Test files SHOULD use the exact API method names defined in specifications for entity test operations. Test files MAY use a different descriptive name if a naming conflict occurs. For example, the name "clientBulkWrite" is used for the client-level bulk write operation to differentiate it from the collection-level bulk write operation.

This spec does not provide exhaustive documentation for all possible API methods that may appear in a test; however, the following sections discuss all supported entities and their operations in some level of detail. Special handling for certain operations is also discussed as needed.

Expressing Required and Optional Parameters

Some specifications group optional parameters for API methods under an options parameter (e.g. options: Optional<UpdateOptions> in the CRUD spec); however, driver APIs vary in how they accept options (e.g. Python's keyword/named arguments, session as either an option or required parameter depending on whether a language supports method overloading). Therefore, test files SHALL declare all required and optional parameters for an API method directly within operation.arguments (e.g. upsert for updateOne is not nested under an options key).

Special Handling for Arguments

If session is specified in operation.arguments, it is defined according to commonOptions_session. Test runners MUST resolve the session argument to session entity before passing it as a parameter to any API method.

If readConcern, readPreference, or writeConcern are specified in operation.arguments, test runners MUST interpret them according to the corresponding definition in Common Options and MUST convert the value into the appropriate object before passing it as a parameter to any API method.

Converting Returned Model Objects to Documents

For operations that return a model object (e.g. BulkWriteResult for bulkWrite), the test runner MUST convert the model object to a document when evaluating expectResult or saveResultAsEntity. Similarly, for operations that may return iterables of model objects (e.g. DatabaseInfo for listDatabases), the test runner MUST convert the iterable to an array of documents when evaluating expectResult or saveResultAsEntity.

Iterating Returned Iterables

Unless otherwise stated by an operation below, test runners MUST fully iterate any iterable returned by an operation as part of that operation's execution. This is necessary to ensure consistent behavior among drivers, as discussed in collection_aggregate and find, and also ensures that error and event assertions can be evaluated consistently.

Client Operations

These operations and their arguments may be documented in the following specifications:

Client operations that require special handling or are not documented by an existing specification are described below.

close

Closes the client, i.e. close underlying connection pool(s) and cease monitoring the topology. For languages that rely on built-in language mechanisms such as reference counting to automatically close/deinitialize clients once they go out of scope, this may require implementing an abstraction to allow a client entity's underlying client to be set to null. Because drivers do not consistently propagate errors encountered while closing a client, test files SHOULD NOT specify expectResult or expectError for this operation. Test files SHOULD NOT specify any operations for a client entity or any entity descended from it following a close operation on it, as driver behavior when an operation is attempted on a closed client or one of its descendant objects is not consistent.

createChangeStream

Creates a cluster-level change stream and ensures that the server-side cursor has been created.

This operation proxies the client's watch method and supports the same arguments and options. Test files SHOULD NOT use the client's watch operation directly for reasons discussed in ChangeStream. Test runners MUST ensure that the server-side cursor is created (i.e. aggregate is executed) as part of this operation and before the resulting change stream might be saved with operation.saveResultAsEntity.

Test runners MUST NOT iterate the change stream when executing this operation and test files SHOULD NOT specify operation.expectResult for this operation.

clientBulkWrite

These considerations only apply to the MongoClient.bulkWrite method. See bulkWrite for special considerations for MongoCollection.bulkWrite.

The models parameter for clientBulkWrite is documented as a list of WriteModel interfaces. Each WriteModel implementation (e.g. InsertOneModel) provides important context to the method, but that type information is not easily expressed in YAML and JSON. To account for this, test files MUST nest each WriteModel object in a single-key object, where the key identifies the request type (e.g. "insertOne") and its value is an object expressing the parameters, as in the following example:

arguments:
  models:
    - insertOne:
        document: { _id: 1, x: 1 }
    - replaceOne:
        filter: { _id: 2 }
        replacement: { x: 2 }
        upsert: true
    - updateOne:
        filter: { _id: 3 }
        update: { $set: { x: 3 } }
        upsert: true
    - updateMany:
        filter: { }
        update: { $inc: { x: 1 } }
    - deleteOne:
        filter: { x: 2 }
    - deleteMany:
        filter: { x: { $gt: 2 } }
  ordered: true

Because the insertResults, updateResults, and deleteResults may be absent or empty in the BulkWriteResult returned from a summary-only bulk write, the clientBulkWrite operation SHOULD use the $$unsetOrMatches operator for assertions on these fields when verboseResults is not set to true. This also applies to result objects defined in the expectedResult field of expectedError.

The BulkWriteException thrown by MongoClient.bulkWrite contains an optional error field that stores a top-level error that occurred during the bulk write. Test runners MUST inspect the contents of this field when making assertions based on the contents of the errorCode and errorContains fields in expectedError.

BulkWriteException also contains writeErrors and writeConcernErrors fields that define the individual write errors and write concern errors that occurred during the bulk write. Unified tests SHOULD use writeErrors and writeConcernErrors in expectedError to assert on the contents of these fields. Test runners MUST NOT inspect the contents of these fields when making assertions based on any other fields defined in expectedError.

While operations typically raise an error or return a result, the MongoClient.bulkWrite operation may report both via the partialResult property of a BulkWriteException. In this case, the intermediary write result may be matched with expectedError_expectResult

watch

This operation SHOULD NOT be used in test files. See client_createChangeStream.

ClientEncryption Operations

These operations and their arguments may be documented in the following specifications:

Operations that require sending and receiving KMS requests to encrypt or decrypt data keys may require appropriate KMS credentials to be loaded by the driver. Drivers MUST load appropriate KMS credentials (i.e. from the environment or a configuration file) when prompted by a test providing a placeholder value in a corresponding kmsProviders field as described under entity.clientEncryption.

See the Client-Side Encryption test README for instructions to obtain test credentials.

Drivers MUST be running the mock KMS KMIP server when evaluating tests that require KMS requests to a KMIP KMS provider.

Drivers MAY enforce a unique index on keyAltNames as described in the Client Side Field Level Encryption spec when running key management operations on the key vault collection. Although unified tests are written assuming the existence of the unique index, no unified test currently requires its implementation for correctness (e.g. no unified test currently attempts to create a data key with an existing keyAltName or add an existing keyAltName to a data key).

Database Operations

These operations and their arguments may be documented in the following specifications:

Database operations that require special handling or are not documented by an existing specification are described below.

aggregate

When executing an aggregate operation, the test runner MUST fully iterate the result. This will ensure consistent behavior between drivers that eagerly create a server-side cursor and those that do so lazily when iteration begins.

createChangeStream

Creates a database-level change stream and ensures that the server-side cursor has been created.

This operation proxies the database's watch method and supports the same arguments and options. Test files SHOULD NOT use the database's watch operation directly for reasons discussed in ChangeStream. Test runners MUST ensure that the server-side cursor is created (i.e. aggregate is executed) as part of this operation and before the resulting change stream might be saved with operation.saveResultAsEntity.

Test runners MUST NOT iterate the change stream when executing this operation and test files SHOULD NOT specify operation.expectResult for this operation.

listCollections

When executing a listCollections operation, the test runner MUST fully iterate the resulting cursor.

runCommand

Generic command runner.

This method does not inherit a read preference (per the Server Selection spec); however, readPreference may be specified as an argument.

The following arguments are supported:

  • command: Required document. The command to be executed.
  • commandName: Required string. The name of the command to run. This is used by languages that are unable preserve the order of keys in the command argument when parsing YAML/JSON.
  • readPreference: Optional object. See commonOptions_readPreference.
  • session: Optional string. See commonOptions_session.

runCursorCommand

Generic cursor returning command runner.

This method does not inherit a read preference (per the Server Selection spec); however, readPreference may be specified as an argument.

This operation proxies the database's runCursorCommand method and supports the same arguments and options (note: handling for getMore options may vary by driver implementation).

When executing the provided command, the test runner MUST fully iterate the cursor. This will ensure consistent behavior between drivers that eagerly create a server-side cursor and those that do so lazily when iteration begins.

The following arguments are supported:

  • command: Required document. The command to be executed.
  • commandName: Required string. The name of the command to run. This is used by languages that are unable preserve the order of keys in the command argument when parsing YAML/JSON.
  • readPreference: Optional object. See commonOptions_readPreference.
  • session: Optional string. See commonOptions_session.
  • batchSize: Optional positive integer value. Use this value to configure the batchSize option sent on subsequent getMore commands.
  • maxTimeMS: Optional non-negative integer value. Use this value to configure the maxTimeMS option sent on subsequent getMore commands.
  • comment: Optional BSON value. Use this value to configure the comment option sent on subsequent getMore commands.
  • cursorType: Optional string enum value, one of 'tailable' | 'tailableAwait' | 'nonTailable'. Use this value to configure the enum passed to the cursorType option.
  • timeoutMode: Optional string enum value, one of 'iteration' | 'cursorLifetime'. Use this value to configure the enum passed to the timeoutMode option.

createCommandCursor

This operation proxies the database's runCursorCommand method and supports the same arguments and options (note: handling for getMore options may vary by driver implementation). Test runners MUST ensure that the server-side cursor is created (i.e. the command document has executed) as part of this operation and before the resulting cursor might be saved with operation.saveResultAsEntity. Test runners for drivers that lazily execute the command on the first iteration of the cursor MUST iterate the resulting cursor once. The result from this iteration MUST be used as the result for the first iteration operation on the cursor.

Test runners MUST NOT iterate the resulting cursor when executing this operation and test files SHOULD NOT specify operation.expectResult for this operation.

watch

This operation SHOULD NOT be used in test files. See database_createChangeStream.

Collection Operations

These operations and their arguments may be documented in the following specifications:

Collection operations that require special handling or are not documented by an existing specification are described below.

aggregate

When executing an aggregate operation, the test runner MUST fully iterate the result. This will ensure consistent behavior between drivers that eagerly create a server-side cursor and those that do so lazily when iteration begins.

bulkWrite

These considerations only apply to the MongoCollection.bulkWrite method. See clientBulkWrite for special considerations for MongoClient.bulkWrite.

The requests parameter for bulkWrite is documented as a list of WriteModel interfaces. Each WriteModel implementation (e.g. InsertOneModel) provides important context to the method, but that type information is not easily expressed in YAML and JSON. To account for this, test files MUST nest each WriteModel object in a single-key object, where the key identifies the request type (e.g. "insertOne") and its value is an object expressing the parameters, as in the following example:

arguments:
  requests:
    - insertOne:
        document: { _id: 1, x: 1 }
    - replaceOne:
        filter: { _id: 2 }
        replacement: { x: 2 }
        upsert: true
    - updateOne:
        filter: { _id: 3 }
        update: { $set: { x: 3 } }
        upsert: true
    - updateMany:
        filter: { }
        update: { $inc: { x: 1 } }
    - deleteOne:
        filter: { x: 2 }
    - deleteMany:
        filter: { x: { $gt: 2 } }
  ordered: true

Because the insertedIds field of BulkWriteResult is optional for drivers to implement, assertions for that field SHOULD utilize the $$unsetOrMatches operator.

While operations typically raise an error or return a result, the bulkWrite operation is unique in that it may report both via the writeResult property of a BulkWriteException. In this case, the intermediary write result may be matched with expectedError_expectResult. Because writeResult is optional for drivers to implement, such assertions SHOULD utilize the $$unsetOrMatches operator.

Additionally, BulkWriteException is unique in that it aggregates one or more server errors in its writeConcernError and writeErrors properties. When test runners evaluate expectedError assertions for errorContains and errorCodeName, they MUST examine the aggregated errors and consider any match therein to satisfy the assertion(s). Drivers that concatenate all write and write concern error messages into the BulkWriteException message MAY optimize the check for errorContains by examining the concatenated message. Drivers that expose code but not codeName through BulkWriteException MAY translate the expected code name to a number (see: error_codes.yml) and compare with code instead, but MUST raise an error if the comparison cannot be attempted (e.g. code is also not available, translation fails).

createChangeStream

Creates a collection-level change stream and ensures that the server-side cursor has been created.

This operation proxies the collection's watch method and supports the same arguments and options. Test files SHOULD NOT use the collection's watch operation directly for reasons discussed in ChangeStream. Test runners MUST ensure that the server-side cursor is created (i.e. aggregate is executed) as part of this operation and before the resulting change stream might be saved with operation.saveResultAsEntity.

Test runners MUST NOT iterate the change stream when executing this operation and test files SHOULD NOT specify operation.expectResult for this operation.

createFindCursor

This operation proxies the collection's find method and supports the same arguments and options. Test runners MUST ensure that the server-side cursor is created (i.e. a find command is executed) as part of this operation and before the resulting cursor might be saved with operation.saveResultAsEntity. Test runners for drivers that lazily execute the find command on the first iteration of the cursor MUST iterate the resulting cursor once. The result from this iteration MUST be used as the result for the first iteration operation on the cursor.

Test runners MUST NOT iterate the resulting cursor when executing this operation and test files SHOULD NOT specify operation.expectResult for this operation.

createSearchIndex

This operations proxies the collection's createSearchIndex helper with the same arguments.

Each createSearchIndex operation receives a SearchIndexModel. If a driver has chosen to implement the createSearchIndex(name: String, definition: Document) overload of createSearchIndex, then the SearchIndexModel should be parsed by createSearchIndex unified test runner helper and the correct arguments should be passed into the driver's helper.

createSearchIndexes

This operations proxies the collection's createSearchIndexes helper with the same arguments.

dropSearchIndex

This operation proxies the collection's dropSearchIndex helper with the same arguments.

find

When executing a find operation, the test runner MUST fully iterate the result. This will ensure consistent behavior between drivers that eagerly create a server-side cursor and those that do so lazily when iteration begins.

findOneAndReplace and findOneAndUpdate

The returnDocument option for findOneAndReplace and findOneAndUpdate is documented as an enum with possible values "Before" and "After". Test files SHOULD express returnDocument as a string and test runners MUST raise an error if its value does not case-insensitively match either enum value.

insertMany

The CRUD spec documents insertMany as returning a BulkWriteResult. Because the insertedIds field of BulkWriteResult is optional for drivers to implement, assertions for that field SHOULD utilize the $$unsetOrMatches operator.

insertOne

The CRUD spec documents insertOne as returning an InsertOneResult; however, because all fields InsertOneResult are optional drivers are permitted to forgo it entirely and have insertOne return nothing (i.e. void method). Tests asserting InsertOneResult SHOULD utilize the $$unsetOrMatches operator for both the result object and any optional fields within, as in the following examples:

- name: insertOne
  object: *collection0
  arguments:
    document: { _id: 2 }
  expectResult:
    $$unsetOrMatches:
      insertedId: { $$unsetOrMatches: 2 }

listSearchIndexes

This operation proxies the collection's listSearchIndexes helper and returns the result of the cursor as a list.

updateSearchIndex

This operation proxies the collection's updateSearchIndex helper with the same arguments.

watch

This operation SHOULD NOT be used in test files. See collection_createChangeStream.

Session Operations

These operations and their arguments may be documented in the following specifications:

Session operations that require special handling or are not documented by an existing specification are described below.

withTransaction

The withTransaction operation's callback parameter is a function and not easily expressed in YAML/JSON. For ease of testing, this parameter is expressed as an array of operation objects (analogous to test.operations). Test runners MUST evaluate error and result assertions when executing these operations in the callback.

Test runners MUST allow any errors from the callback operation(s) to propagate to withTransaction, irrespective of ignoreResultAndError or expectError. For example, if a callback operation raises an error, it should be possible to assert that error at both the callback operation and withTransaction level.

Bucket Operations

These operations and their arguments may be documented in the following specifications:

Bucket operations that require special handling or are not documented by an existing specification are described below.

download and downloadByName

These operations proxy the bucket's openDownloadStream and openDownloadStreamByName methods and support the same parameters and options, but return a string containing the stream's contents instead of the stream itself. Test runners MUST fully read the stream to yield the returned string. This is also necessary to ensure that any expected errors are raised (e.g. missing chunks). Test files SHOULD use $$matchesHexBytes in expectResult to assert the contents of the returned string.

downloadToStream and downloadToStreamByName

These operations SHOULD NOT be used in test files. See IO operations for GridFS streams in Future Work.

openDownloadStream and openDownloadStreamByName

These operations SHOULD NOT be used in test files. See download and downloadByName.

openUploadStream and openUploadStreamWithId

These operations SHOULD NOT be used in test files. See IO operations for GridFS streams in Future Work.

upload and uploadWithId

These operations proxy the bucket's uploadFromStream and uploadFromStreamWithId methods and support the same parameters and options with one exception: the source parameter is an object specifying hex bytes from which test runners MUST construct a readable stream for the underlying methods. The structure of source is as follows:

{ $$hexBytes: <string> }

The string MUST contain an even number of hexadecimal characters (case-insensitive) and MAY be empty. The test runner MUST raise an error if the structure of source or its string is malformed. The test runner MUST convert the string to a byte sequence denoting the stream's readable data (if any). For example, "12ab" would denote a stream with two bytes: "0x12" and "0xab".

uploadFromStream and uploadFromStreamWithId

These operations SHOULD NOT be used in test files. See upload and uploadWithId.

Cursor Operations

There are no defined APIs for change streams and cursors since the mechanisms for iteration may differ between synchronous and asynchronous drivers. To account for this, this section explicitly defines the supported operations for the ChangeStream, FindCursor, and CommandCursor entity types.

Test runners MUST ensure that the iteration operations defined in this section will not inadvertently skip the first document for a cursor. Albeit rare, this could happen if an operation were to blindly invoke next (or equivalent) on a cursor in a driver where newly created cursors are already positioned at their first element and the cursor had a non-empty firstBatch. Alternatively, some drivers may use a different iterator method for advancing a cursor to its first position (e.g. rewind in PHP).

iterateUntilDocumentOrError

Iterates the cursor until either a single document is returned or an error is raised. This operation takes no arguments. If expectResult is specified, it SHOULD be a single document.

Some specification sections (e.g. Iterating the Change Stream) caution drivers that implement a blocking mode of iteration (e.g. asynchronous drivers) not to iterate the cursor unnecessarily, as doing so could cause the test runner to block indefinitely. This should not be a concern for iterateUntilDocumentOrError as iteration only continues until either a document or error is encountered.

iterateOnce

Performs a single iteration of the cursor. If the cursor's current batch is empty, one getMore MUST be attempted to get more results. This operation takes no arguments. If expectResult is specified, it SHOULD be a single document.

Due to the non-deterministic nature of some cursor types (e.g. change streams on sharded clusters), test files SHOULD only use this operation to perform command monitoring assertions on the getMore command. Tests that perform assertions about the result of iteration should use iterateUntilDocumentOrError instead.

close

Closes the cursor. Because drivers do not consistently propagate errors from the killCursors command, test runners MUST suppress all errors when closing the cursor. Test files SHOULD NOT specify expectResult or expectError for this operation. To assert whether the killCursors command succeeded or failed, test files SHOULD use command monitoring assertions with commandSucceededEvent and commandFailedEvent events.

Special Test Operations

Certain operations do not correspond to API methods but instead represent special test operations (e.g. assertions). These operations are distinguished by operation.object having a value of "testRunner". The operation.name field will correspond to an operation defined below.

Special test operations return no result and are always expected to succeed. These operations SHOULD NOT be combined with expectError, expectResult, or saveResultAsEntity.

failPoint

The failPoint operation instructs the test runner to configure a fail point using a "primary" read preference using the specified client entity (fail points are not configured using an internal MongoClient).

The following arguments are supported:

  • failPoint: Required document. The configureFailPoint command to be executed.

  • client: Required string. See commonOptions_client.

    The client entity SHOULD specify false for useMultipleMongoses if this operation could be executed on a sharded topology (according to runOnRequirements or test.runOnRequirements). This is advised because server selection rules for mongos could lead to unpredictable behavior if different servers were selected for configuring the fail point and executing subsequent operations.

When executing this operation, the test runner MUST keep a record of the fail point so that it can be disabled after the test. The test runner MUST also ensure that the configureFailPoint command is excluded from the list of observed command monitoring events for this client (if applicable).

An example of this operation follows:

# Enable the fail point on the server selected with a primary read preference
- name: failPoint
  object: testRunner
  arguments:
    client: *client0
    failPoint:
      configureFailPoint: failCommand
      mode: { times: 1 }
      data:
        failCommands: ["insert"]
        closeConnection: true

targetedFailPoint

The targetedFailPoint operation instructs the test runner to configure a fail point on a specific mongos.

The following arguments are supported:

  • failPoint: Required document. The configureFailPoint command to be executed.

  • session: Required string. See commonOptions_session.

    The client entity associated with this session SHOULD specify true for useMultipleMongoses. This is advised because targeted fail points are intended to test mongos pinning, which warrants multiple mongoses.

The mongos on which to set the fail point is determined by the session argument (after resolution to a session entity). Test runners MUST error if the session is not pinned to a mongos server at the time this operation is executed.

If the driver exposes an API to target a specific server for a command, the test runner SHOULD use the client entity associated with the session to execute the configureFailPoint command. In this case, the test runner MUST also ensure that this command is excluded from the list of observed command monitoring events for this client (if applicable). If such an API is not available, but the test runner creates an internal MongoClient for each mongos, the test runner SHOULD use the internal MongoClient corresponding to the session's pinned server for this operation. Otherwise, test runners MUST create a new MongoClient that is directly connected to the session's pinned server for this operation. The new MongoClient instance MUST be closed once the command has finished executing.

A test utilizing targetedFailPoint SHOULD NOT be permitted execute on a load balancer (according to runOnRequirements or test.runOnRequirements). This is advised because the configureFailPoint command does not use the session directly, which means there is no guarantee that the load balancer will direct the command to the intended mongos.

When executing this operation, the test runner MUST keep a record of both the fail point and pinned mongos server so that the fail point can be disabled on the same mongos server after the test.

An example of this operation follows:

# Enable the fail point on the mongos to which session0 is pinned
- name: targetedFailPoint
  object: testRunner
  arguments:
    session: *session0
    failPoint:
      configureFailPoint: failCommand
      mode: { times: 1 }
      data:
        failCommands: ["commitTransaction"]
        closeConnection: true

Since mongos pinning only applies when multiple mongoses are used, tests SHOULD NOT use targetedFailPoint unless the session's MongoClient is configured with useMultipleMongoses: true.

assertSessionTransactionState

The assertSessionTransactionState operation instructs the test runner to assert that the given session has a particular transaction state.

The following arguments are supported:

  • session: Required string. See commonOptions_session.
  • state: Required string. Expected transaction state for the session. Possible values are as follows: none, starting, in_progress, committed, and aborted.

An example of this operation follows:

- name: assertSessionTransactionState
  object: testRunner
  arguments:
    session: *session0
    state: in_progress

assertSessionPinned

The assertSessionPinned operation instructs the test runner to assert that the given session is pinned to a mongos server.

The following arguments are supported:

An example of this operation follows:

- name: assertSessionPinned
  object: testRunner
  arguments:
    session: *session0

assertSessionUnpinned

The assertSessionUnpinned operation instructs the test runner to assert that the given session is not pinned to a mongos server.

The following arguments are supported:

An example of this operation follows:

- name: assertSessionUnpinned
  object: testRunner
  arguments:
    session: *session0

assertDifferentLsidOnLastTwoCommands

The assertDifferentLsidOnLastTwoCommands operation instructs the test runner to assert that the last two CommandStartedEvents observed on the client have different lsid fields. This assertion is primarily used to test that dirty server sessions are discarded from the pool.

The following arguments are supported:

The test runner MUST fail this assertion if fewer than two CommandStartedEvents have been observed on the client or if either command does not include an lsid field.

An example of this operation follows:

- name: assertDifferentLsidOnLastTwoCommands
  object: testRunner
  arguments:
    client: *client0

assertSameLsidOnLastTwoCommands

The assertSameLsidOnLastTwoCommands operation instructs the test runner to assert that the last two CommandStartedEvents observed on the client have identical lsid fields. This assertion is primarily used to test that non-dirty server sessions are not discarded from the pool.

The following arguments are supported:

The test runner MUST fail this assertion if fewer than two CommandStartedEvents have been observed on the client or if either command does not include an lsid field.

An example of this operation follows:

- name: assertSameLsidOnLastTwoCommands
  object: testRunner
  arguments:
    client: *client0

assertSessionDirty

The assertSessionDirty operation instructs the test runner to assert that the given session is marked dirty.

The following arguments are supported:

An example of this operation follows:

- name: assertSessionDirty
  object: testRunner
  arguments:
    session: *session0

assertSessionNotDirty

The assertSessionNotDirty operation instructs the test runner to assert that the given session is not marked dirty.

The following arguments are supported:

An example of this operation follows:

- name: assertSessionNotDirty
  object: testRunner
  arguments:
    session: *session0

assertCollectionExists

The assertCollectionExists operation instructs the test runner to assert that the given collection exists in the database. The test runner MUST use an internal MongoClient for this operation.

The following arguments are supported:

An example of this operation follows:

- name: assertCollectionExists
  object: testRunner
  arguments:
    collectionName: *collection0Name
    databaseName:  *database0Name

Use a listCollections command to check whether the collection exists. Note that it is currently not possible to run listCollections from within a transaction.

assertCollectionNotExists

The assertCollectionNotExists operation instructs the test runner to assert that the given collection does not exist in the database. The test runner MUST use an internal MongoClient for this operation.

The following arguments are supported:

An example of this operation follows:

- name: assertCollectionNotExists
  object: testRunner
  arguments:
    collectionName: *collection0Name
    databaseName:  *database0Name

Use a listCollections command to check whether the collection exists. Note that it is currently not possible to run listCollections from within a transaction.

assertIndexExists

The assertIndexExists operation instructs the test runner to assert that an index with the given name exists on the collection. The test runner MUST use an internal MongoClient for this operation.

The following arguments are supported:

An example of this operation follows:

- name: assertIndexExists
  object: testRunner
  arguments:
    collectionName: *collection0Name
    databaseName:  *database0Name
    indexName: t_1

Use a listIndexes command to check whether the index exists. Note that it is currently not possible to run listIndexes from within a transaction.

assertIndexNotExists

The assertIndexNotExists operation instructs the test runner to assert that an index with the given name does not exist on the collection. The test runner MUST use an internal MongoClient for this operation.

The following arguments are supported:

An example of this operation follows:

- name: assertIndexNotExists
  object: testRunner
  arguments:
    collectionName: *collection0Name
    databaseName:  *database0Name
    indexName: t_1

Use a listIndexes command to check whether the index exists. Note that it is currently not possible to run listIndexes from within a transaction.

createEntities

The createEntities operation instructs the test runner to create the provided entities and store them in the current test's Entity Map.

  • entities: Required array of one or more entity objects. As with the file-level createEntities directive, test files SHOULD declare entities in dependency order, such that all referenced entities are defined before any of their dependent entities.

An example of this operation follows:

- name: createEntities
  object: testRunner
  arguments:
    entities:
      - client:
          id: &client0 client0
      - database:
          id: &database0 database0
          client: *client0
          databaseName: &database0Name test

loop

The loop operation executes sub-operations in a loop.

The following arguments are supported:

  • operations: Required array of operation objects. List of operations (henceforth referred to as sub-operations) to run on each loop iteration. Each sub-operation must be a valid operation as described in Entity Test Operations.

    Sub-operations SHOULD NOT include the loop operation.

    If, in the course of executing sub-operations, a sub-operation yields an error or failure, the test runner MUST NOT execute subsequent sub-operations in the same loop iteration. If storeErrorsAsEntity and/or storeFailuresAsEntity options are specified, the loop MUST store the error/failure accordingly and continue to the next iteration (i.e. the error/failure will not interrupt the test). If neither storeErrorsAsEntity nor storeFailuresAsEntity are specified, the loop MUST terminate and raise the error/failure (i.e. the error/failure will interrupt the test).

  • storeErrorsAsEntity: Optional string. If specified, the runner MUST capture errors arising during sub-operation execution and append a document with error information to the array stored in the specified entity.

    If this option is specified, the test runner MUST check the existence and the type of the entity with the specified name before executing the loop. If the entity does not exist, the test runner MUST create it with the type of BSON array. If the entity exists and is of type BSON array, the test runner MUST do nothing. If the entity exists and is of a different type, the test runner MUST raise an error.

    If this option is specified and storeFailuresAsEntity is not, failures MUST also be captured and appended to the array.

    Documents appended to the array MUST contain the following fields:

    • error: the textual description of the error encountered.
    • time: the number of (floating-point) seconds since the Unix epoch when the error was encountered.
  • storeFailuresAsEntity: Optional string. If specified, the runner MUST capture failures arising during sub-operation execution and append a document with failure information to the array stored in the specified entity.

    If this option is specified, the test runner MUST check the existence and the type of the entity with the specified name before executing the loop. If the entity does not exist, the test runner MUST create it with the type of BSON array. If the entity exists and is of type BSON array, the test runner MUST do nothing. If the entity exists and is of a different type, the test runner MUST raise an error.

    If this option is specified and storeErrorsAsEntity is not, errors MUST also be captured and appended to the array.

    Documents appended to the array MUST contain the following fields:

    • error: the textual description of the failure encountered.
    • time: the number of (floating-point) seconds since the Unix epoch when the failure was encountered.
  • storeSuccessesAsEntity: Optional string. If specified, the runner MUST keep track of the number of sub-operations that completed successfully, and store that number in the specified entity. For example, if the loop contains two sub-operations, and they complete successfully, each loop execution would increment the number of successes by two.

    If the entity of the specified name already exists, the test runner MUST raise an error.

  • storeIterationsAsEntity: Optional string. If specified, the runner MUST keep track of the number of iterations of the loop performed, and store that number in the specified entity. The number of loop iterations is counted irrespective of whether sub-operations within the iteration succeed or fail.

    If the entity of the specified name already exists, the test runner MUST raise an error.

A failure is when the result or outcome of an operation executed by the test runner differs from its expected outcome. For example, an expectResult assertion failing to match a BSON document or an expectError assertion failing to match an error message would be considered a failure. An error is any other type of error raised by the test runner. For example, an unsupported operation or inability to resolve an entity name would be considered an error.

This specification permits the test runner to report some failures as errors and some errors as failures. When the test runner stores errors and failures as entities it MAY classify conditions as errors and failures in the same way as it would when used in the driver's test suite. This includes reporting all errors as failures or all failures as errors.

If the test runner does not distinguish errors and failures in its reporting, it MAY report both conditions under either category, but it MUST report any given condition in at most one category.

The following termination behavior MUST be implemented by the test runner:

  • The test runner MUST provide a way to request termination of loops. This request will be made by the Atlas testing workload executor in response to receiving the termination signal from Astrolabe.
  • When the termination request is received, the test runner MUST stop looping. If the test runner is looping when the termination request is received, the current loop iteration MUST complete to its natural conclusion (success or failure). If the test runner is not looping when the termination request is received, it MUST NOT start any new loop iterations in either the current test or subsequent tests for the lifetime of the test runner.
  • The termination request MUST NOT affect non-loop operations, including any operations after the loop. The tests SHOULD NOT be written in such a way that the success or failure of operations that follow loops depends on how many loop iterations were performed.
  • Receiving the termination request MUST NOT by itself be considered an error or a failure by the test runner.

The exact mechanism by which the workload executor requests termination of the loop in the test runner, including the respective API, is left to the driver.

Tests SHOULD NOT include multiple loop operations (nested or sequential).

An example of this operation follows:

- name: loop
  object: testRunner
  arguments:
    storeErrorsAsEntity: errors
    storeFailuresAsEntity: failures
    storeSuccessesAsEntity: successes
    storeIterationsAsEntity: iterations
    operations:
      - name: find
        object: *collection0
        arguments:
          filter: { _id: { $gt: 1 }}
          sort: { _id: 1 }
        expectResult:
          - _id: 2, x: 22
          - _id: 3, x: 33

assertNumberConnectionsCheckedOut

The assertNumberConnectionsCheckedOut operation instructs the test runner to assert that the given number of connections are currently checked out for the specified client entity. This operation only applies to drivers that implement connection pooling and should be skipped for drivers that do not.

The following arguments are supported:

  • client: Required string. See commonOptions_client.
  • connections: Required integer. The number of connections expected to be checked out.

An example of this operation follows:

- name: assertNumberConnectionsCheckedOut
  object: testRunner
  arguments:
    client: *client0
    connections: 1

runOnThread

The runOnThread operation instructs the test runner to schedule an operation to be run on a given thread. The given thread MUST begin executing the operation immediately. runOnThread MUST NOT wait for the operation to complete. If any of the operation's test assertions fail, the entire test case MUST fail as well.

When writing test cases that use runOnThread, it's important to note that certain entities are not concurrency-safe (e.g. sessions, cursors) and therefore SHOULD NOT be used in operations on multiple different threads entities.

The following arguments are supported:

  • thread: Required string. Thread entity on which this operation should be scheduled.
  • operation: Required operation object. The operation to schedule on the thread. This object must be a valid operation as described in Entity Test Operations.

An example of this operation follows:

- name: runOnThread
  object: testRunner
  arguments:
    thread: *thread0
    operation:
      name: insertOne
      object: *collection0
      arguments:
        document: { _id: 2 }
      expectResult:
        $$unsetOrMatches:
          insertedId: { $$unsetOrMatches: 2 }

waitForThread

The waitForThread operation instructs the test runner to notify the given thread that no more operations are forthcoming, wait for it to complete its last operation, and assert that it exited without any errors.

If the waitForThread operation is not satisfied after 10 seconds, this operation MUST cause a test failure.

The test.operations list SHOULD contain a waitForThread operation for each thread entity that the test creates.

The following arguments are supported:

  • thread: Required string. Thread entity that should be stopped and awaited for completion.

An example of this operation follows:

- name: waitForThread
  object: testRunner
  arguments:
    thread: *thread0

waitForEvent

The waitForEvent operation instructs the test runner to wait until the specified MongoClient has published a specific, matching event a given number of times. Note that this includes any events published before the waitForEvent operation started.

If the waitForEvent operation is not satisfied after 10 seconds, this operation MUST cause a test failure.

The following arguments are supported:

  • client: Required string. Client entity whose events the runner will be waiting for.
  • event: Required expectedEvent object. The event which the test runner is waiting to be observed on the provided client. The assertions laid out in expectedEvent MUST be used to determine if an observed event matches event. event SHOULD have an event type that was included in the client's observeEvents field.
  • count: Required integer. The number of matching events to wait for before resuming execution of subsequent operations.

For example, the following instructs the test runner to wait for at least one poolClearedEvent to be published:

- name: waitForEvent
  object: testRunner
  arguments:
    client: *client0
    event:
      poolClearedEvent: {}
    count: 1

assertEventCount

The assertEventCount operation instructs the test runner to assert the specified MongoClient has published a specific, matching event a given number of times so far in the test.

The following arguments are supported:

  • client: Required string. Client entity whose events the runner will be counting.
  • event: Required expectedEvent object. The event which the test runner will be counting on the provided client. The assertions laid out in expectedEvent MUST be used to determine if an observed event matches event. event SHOULD have an event type that was included in the client's observeEvents field.
  • count: Required integer. The exact number of matching events that client should have seen so far.

For example, the following instructs the test runner to assert that a single PoolClearedEvent was published:

- name: assertEventCount
  object: testRunner
  arguments:
    client: *client0
    event:
      poolClearedEvent: {}
    count: 1

recordTopologyDescription

The recordTopologyDescription operation instructs the test runner to retrieve the specified MongoClient's current TopologyDescription and store it in the Entity Map.

The following arguments are supported:

  • client: Required string. Client entity whose TopologyDescription will be recorded.
  • id: Required string. The name with which the TopologyDescription will be stored in the Entity Map.

For example:

- name: recordTopologyDescription
  object: testRunner
  arguments:
    client: *client0
    id: &postInsertTopology postInsertTopology

assertTopologyType

The assertTopologyType operation instructs the test runner to assert that the given TopologyDescription has a particular TopologyType.

The following arguments are supported:

  • topologyDescription: Required string. TopologyDescription entity whose TopologyType will be inspected.
  • topologyType: Required string. Expected TopologyType for the TopologyDescription. See SDAM: TopologyType for a list of possible values.

For example:

- name: assertTopologyType
  object: testRunner
  arguments:
    topologyDescription: *postInsertTopology
    topologyType: ReplicaSetWithPrimary

waitForPrimaryChange

The waitForPrimaryChange operation instructs the test runner to wait until the provided MongoClient discovers a different primary from the one in the provided TopologyDescription. If the provided TopologyDescription does not include a primary, then this operation will wait until the client discovers any primary.

The following arguments are supported:

  • client: Required string. See commonOptions_client.

    The client entity MUST be the same one from which topologyDescription was derived. Test runners do not need to verify this.

  • priorTopologyDescription: Required string. The name of a TopologyDescription entity which will be used to determine if the primary has changed or not.

  • timeoutMS: Optional integer. The number of milliseconds to wait for the primary change before timing out and failing the test. If unspecified, a default timeout of 10 seconds MUST be used.

For example:

- name: waitForPrimaryChange
  object: testRunner
  arguments:
    client: *client0
    priorTopologyDescription: *postInsertTopology
    timeoutMS: 1000

wait

The wait operation instructs the test runner to sleep for a provided number of milliseconds.

The following arguments are supported:

  • ms: Required integer. The number of milliseconds for which the current test runner thread should sleep.

For example:

- name: wait
  object: testRunner
  arguments:
    ms: 1000

Special Placeholder Value

$$placeholder

Syntax:

{ field: { $$placeholder: 1 } }

This special key-value pair can be used anywhere the value for a key might be specified in an test file. It is intended to act as a placeholder value in contexts where the test runner cannot provide a definite value or may be expected to replace the placeholder with a value that cannot be specified by the test file (e.g. KMS provider credentials). The test runner MUST raise an error if a placeholder value is used in an unexpected context or a replacement cannot be made.

An example of using this placeholder value follows:

kmsProviders:
  aws:
    accessKeyId: { $$placeholder: 1 }
    privateAccessKey: { $$placeholder: 1 }

Note: the test runner is not required to validate the type or value of a $$placeholder field.

Evaluating Matches

Expected values in tests (e.g. operation.expectResult) are expressed as either relaxed or canonical Extended JSON.

The algorithm for matching expected and actual values is specified with the following pseudo-code:

function match (expected, actual):
  if expected is a document:
    // handle special operators (e.g. $$type)
    if first and only key of expected starts with "$$":
      execute any assertion(s) for the special operator
      return

    assert that actual is a document

    for every key/value in expected:
      // handle key-based operators (e.g. $$exists, $$unsetOrMatches)
      if value is a document and its first and only key starts with "$$":
        execute any assertion(s) for the special operator
        continue to the next iteration unless actual value must be matched

      assert that actual[key] exists
      assert that actual[key] matches value

    if expected is not the root document:
      assert that actual does not contain additional keys

    return

  if expected is an array:
    assert that actual is an array
    assert that actual and expected have the same number of elements

    for every index/value in expected:
      assert that actual[index] matches value

    return

  // expected is neither a document nor array
  assert that actual and expected are the same type, noting flexible numerics
  assert that actual and expected are equal

The rules for comparing documents and arrays are discussed in more detail in subsequent sections. When comparing types other than documents and arrays, test runners MAY adopt any of the following approaches to compare expected and actual values, as long as they are consistent:

  • Convert both values to relaxed or canonical Extended JSON and compare strings
  • Convert both values to BSON, and compare bytes
  • Convert both values to native representations, and compare accordingly

When comparing types that contain documents as internal properties (e.g. CodeWScope), the rules in Evaluating Matches do not apply and the documents MUST match exactly; however, test runners MUST permit variation in document key order or otherwise normalize the documents before comparison.

Flexible Numeric Comparisons

When comparing numeric types (excluding Decimal128), test runners MUST consider 32-bit, 64-bit, and floating point numbers to be equal if their values are numerically equivalent. For example, an expected value of 1 would match an actual value of 1.0 (e.g. ok field in a server response) but would not match 1.5.

Allowing Extra Fields in Root-level Documents

When matching root-level documents, test runners MUST permit the actual document to contain additional fields not present in the expected document. Examples of root-level documents include, but are not limited to:

For example, the following documents match:

expected: { x: 1 }
actual: { x: 1, y: 1 }

The inverse is not true. For example, the following documents do not match:

expected: { x: 1, y: 1 }
actual: { x: 1 }

Test runners MUST NOT permit additional fields in nested documents. For example, the following documents do not match:

expected: { x: { y: 1 } }
actual: { x: { y: 1, z: 1 } }

It may be helpful to think of expected documents as a form of query criteria. The intention behind this rule is that it is not always feasible or relevant for a test to exhaustively specify all fields in an expected document (e.g. cluster time in command for CommandStartedEvent).

When the expected value is an array, test runners MUST differentiate between an array of values, which may be documents, (e.g. distinct) and an array of root-level documents (e.g. find, aggregate). For example, the following array of documents would not match if returned by distinct, but would match if returned via find (after iterating the cursor):

expected: [ { x: 1 }, { x: 2 } ]
actual: [ { x: 1, y: 1 }, { x: 2, y: 2 } ]

Document Key Order Variation

When matching documents, test runners MUST NOT require keys in the expected and actual document to appear in the same order. For example, the following documents would match:

expected: { x: 1, y: 1 }
actual: { y: 1, x: 1 }

Arrays Must Contain the Same Number of Elements

When comparing arrays, expected and actual values MUST contain the same number of elements. For example, the following arrays corresponding to a distinct operation result would not match:

expected: [ 1, 2, 3 ]
actual: [ 1, 2, 3, 4 ]

Special Operators for Matching Assertions

When matching expected and actual values, an equality comparison is not always sufficient. For instance, a test file cannot anticipate what a session ID will be at runtime, but may still want to analyze the contents of an lsid field in a command document. To address this need, special operators can be used.

These operators are objects with a single key identifying the operator. Such keys are prefixed with $$ to ease in detecting an operator (test runners need only inspect the first key of each object) and differentiate the object from MongoDB query operators, which use a single $ prefix. The key will map to some value that influences the operator's behavior (if applicable).

When examining the structure of an expected value during a comparison, test runners MUST check if the value is an object whose first and only key starts with $$ and, if so, defer to the special logic defined in this section.

$$exists

Syntax:

{ field: { $$exists: <boolean> } }

This operator can be used anywhere the value for a key might be specified in an expected document. If true, the test runner MUST assert that the key exists in the actual document, irrespective of its value (e.g. a key with a null value would match). If false, the test runner MUST assert that the key does not exist in the actual document. This operator is modeled after the $exists query operator.

An example of this operator checking for a field's presence follows:

command:
  getMore: { $$exists: true }
  collection: *collectionName,
  batchSize: 5

An example of this operator checking for a field's absence follows:

command:
  update: *collectionName
  updates: [ { q: {}, u: { $set: { x: 1 } } } ]
  ordered: true
  writeConcern: { $$exists: false }
$$type

Syntax:

{ $$type: <string> }
{ $$type: [ <string>, <string>, ... ] }

This operator can be used anywhere a matched value is expected (including expectResult). The test runner MUST assert that the actual value exists and matches one of the expected types, which correspond to the documented string types for the $type query operator.

An example of this operator follows:

command:
  getMore: { $$type: [ "int", "long" ] }
  collection: { $$type: "string" }

When the actual value is an array, test runners MUST NOT examine types of the array's elements. Only the type of actual field SHALL be checked. This is admittedly inconsistent with the behavior of the $type query operator, but there is presently no need for this behavior in tests.

$$matchesEntity

Syntax, where entityName is a string:

{ $$matchesEntity: <entityName> }

This operator can be used to reference a BSON entity anywhere a matched value is expected (including expectResult). If the BSON entity is defined in the current test's Entity Map, the test runner MUST fetch that entity and assert that the actual value matches the entity using the standard rules in Evaluating Matches; otherwise, the test runner MUST raise an error for an undefined or mistyped entity. The YAML file SHOULD use an alias node for the entity name.

This operator is primarily used to assert identifiers for uploaded GridFS files.

An example of this operator follows:

operations:
  -
    object: *bucket0
    name: upload
    arguments:
      filename: "filename"
      source: { $$hexBytes: "12AB" }
    expectResult: { $$type: "objectId" }
    saveResultAsEntity: &objectid0 "objectid0"
  - object: *filesCollection
    name: find
    arguments:
      sort: { uploadDate: -1 }
      limit: 1
    expectResult:
      - _id: { $$matchesEntity: *objectid0 }
$$matchesHexBytes

Syntax, where hexBytes is an even number of hexadecimal characters (case-insensitive) and MAY be empty:

{ $$matchesHexBytes: <hexBytes> }

This operator can be used anywhere a matched value is expected (including expectResult) and the actual value is a string. The test runner MUST raise an error if the hexBytes string is malformed. This operator is primarily used to assert the results of download and downloadByName, which return stream contents as a string.

$$unsetOrMatches

Syntax:

{ $$unsetOrMatches: <anything> }

This operator can be used anywhere a matched value is expected (including expectResult), excluding an array element because Arrays Must Contain the Same Number of Elements. The test runner MUST assert that the actual value either does not exist or matches the expected value. Matching the expected value MUST use the standard rules in Evaluating Matches, which means that it may contain special operators.

This operator is primarily used to assert driver-optional fields from the CRUD spec (e.g. insertedId for InsertOneResult, writeResult for BulkWriteException).

An example of this operator used for a result's field follows:

expectResult:
  insertedId: { $$unsetOrMatches: 2 }

An example of this operator used for an entire result follows:

expectError:
  expectResult:
    $$unsetOrMatches:
      deletedCount: 0
      insertedCount: 2
      matchedCount: 0
      modifiedCount: 0
      upsertedCount: 0
      upsertedIds: { }
$$sessionLsid

Syntax:

{ $$sessionLsid: <sessionEntityName> }

This operator can be used anywhere a matched value is expected (including expectResult). If the session entity is defined in the current test's Entity Map, the test runner MUST assert that the actual value equals its logical session ID; otherwise, the test runner MUST raise an error for an undefined or mistyped entity. The YAML file SHOULD use an alias node for a session entity's id field (e.g. session: *session0).

An example of this operator follows:

command:
  ping: 1
  lsid: { $$sessionLsid: *session0 }
$$lte

Syntax:

{ $$lte: 5 }

This operator can be used anywhere a matched value is expected (including expectResult). The test runner MUST assert that the actual value is less than or equal to the specified value. Test runners MUST also apply the rules specified in Flexible Numeric Comparisons for this operator. For example, an expected value of 1 would match an actual value of 1.0 and 0.0 but would not match 1.1.

$$matchAsDocument

Syntax:

{ $$matchAsDocument: <anything> }

This operator may be used anywhere a matched value is expected (including expectResult) and the actual value is an extended JSON string. The test runner MUST parse the actual value into a BSON document and match it against the expected value. Matching the expected value MUST use the standard rules in Evaluating Matches, which means that it may contain special operators. This operator is primarily used to assert on extended JSON commands and command replies included in log messages.

$$matchAsRoot

This operator may be used anywhere a matched value is expected (including expectResult) and the expected and actual values are documents. The test runner MUST treat the actual value as a root-level document as described in Evaluating Matches and match it against the expected value.

Test Runner Implementation

The sections below describe instructions for instantiating the test runner, loading each test file, and executing each test within a test file. Test runners MUST NOT share state created by processing a test file with the processing of subsequent test files, and likewise for tests within a test file.

Initializing the Test Runner

The test runner MUST be configurable with a connection string (or equivalent configuration), which will be used to initialize any internal MongoClient(s) and any client entities (in combination with other URI options). This specification is not prescriptive about how this information is provided. For example, it may be read from an environment variable or configuration file.

Create a new MongoClient, which will be used for internal operations (e.g. processing initialData and test.outcome). This is referred to elsewhere in the specification as the internal MongoClient. If this MongoClient would connect multiple mongos nodes and the driver does not provide a way to target operations to specific servers, the test runner MAY construct internal MongoClients for each mongos.

Determine the server version and topology type using an internal MongoClient. This information will be used to evaluate any future runOnRequirement checks. Test environments SHOULD NOT use mixed version clusters, so it is not necessary to check multiple servers.

In addition to the aforementioned connection string, the test runner MUST also be configurable with two other connection strings (or equivalent configuration) that point to TCP load balancers: one fronting multiple servers and one fronting a single server. These will be used to initialize client entities when executing tests against a load balanced sharded cluster. If the topology type is LoadBalanced and Atlas Serverless is not being used, the test runner MUST error if either of these URIs is not provided. When testing against other topology types or Atlas Serverless, these URIs SHOULD NOT be provided and MUST be ignored if provided.

The test runner SHOULD terminate any open transactions (see: Terminating Open Transactions) using the internal MongoClient(s) before executing any tests.

Executing a Test File

The instructions in this section apply for each test file loaded by the test runner.

Test files, which may be YAML or JSON files, MUST be interpreted using an Extended JSON parser. The parser MUST accept relaxed and canonical Extended JSON (per Extended JSON: Parsers), as test files may use either.

Upon loading a file, the test runner MUST read the schemaVersion field and determine if the test file can be processed further. Test runners MAY support multiple versions and MUST NOT process incompatible files (as discussed in Test Runner Support). If a test file is incompatible, test runners MUST raise an error and MAY do so by reporting a test failure. Test runners MAY make an effort to infer the number of tests (and their descriptions) from an incompatible file and report a failure for each test.

If runOnRequirements is specified, the test runner MUST skip the test file unless one or more runOnRequirement objects are satisfied.

For each element in tests, follow the process in Executing a Test.

Executing a Test

The instructions in this section apply for each test occurring in a test file loaded by the test runner. After processing a test, test runners MUST reset any internal state that resulted from doing so. For example, the Entity Map created for one test MUST NOT be shared with another.

If at any point while executing this test an unexpected error is encountered or an assertion fails, the test runner MUST consider this test to have failed and SHOULD continue with the instructions in this section to ensure that the test environment is cleaned up (e.g. disable fail points, kill sessions) while also forgoing any additional assertions.

If test.skipReason is specified, the test runner MUST skip this test and MAY use the string value to log a message.

If test.runOnRequirements is specified, the test runner MUST skip the test unless one or more runOnRequirement objects are satisfied.

If initialData is specified, for each collectionData therein the test runner MUST set up the collection. All setup operations MUST use the internal MongoClient and a "majority" write concern. The test runner MUST first drop the collection. If a createOptions document is present, the test runner MUST execute a create command to create the collection with the specified options. The test runner MUST then insert the specified documents (if any). If no documents are present and createOptions is not set, the test runner MUST create the collection. If the topology is sharded, the test runner SHOULD use a single mongos for handling initialData to avoid possible runtime errors.

If the test might execute a transaction against a sharded deployment, the test runner MUST collect the latest cluster time from processing initialData. See MigrationConflict Errors on Sharded Clusters for more information.

Create a new Entity Map that will be used for this test. If createEntities is specified, the test runner MUST create each entity accordingly and add it to the map. If the topology is a sharded cluster, the test runner MUST handle useMultipleMongoses accordingly if it is specified for any client entities. If the topology type is LoadBalanced, client entities MUST be initialized with the appropriate load balancer connection string as discussed in useMultipleMongoses.

If the test might execute a transaction against a sharded deployment, the test runner MUST advance the cluster time of any session entities created during the test using the cluster time collected from processing initialData. See MigrationConflict Errors on Sharded Clusters for more information.

If the test might execute a distinct command within a sharded transaction, for each target collection the test runner SHOULD execute a non-transactional distinct command on each mongos server using an internal MongoClient. See StaleDbVersion Errors on Sharded Clusters for more information.

If the test might execute a configureFailPoint command, for each target client the test runner MAY specify a reduced value for heartbeatFrequencyMS (and minHeartbeatFrequencyMS if possible) to speed up SDAM recovery time and server selection after a failure; however, test runners MUST NOT do so for any client that specifies heartbeatFrequencyMS in its uriOptions.

For each client entity where observeEvents has been specified, the test runner MUST enable all event listeners necessary to collect the specified event types. Test runners MAY leave event listeners disabled for tests that do not assert events (i.e. tests that omit both test.expectEvents and special operations such as assertSameLsidOnLastTwoCommands).

For each client with command monitoring enabled, the test runner MUST ignore events for the following:

  • Any command(s) specified in ignoreCommandMonitoringEvents.
  • Any configureFailPoint commands executed for failPoint and targetedFailPoint operations.
  • Any commands containing sensitive information (per the Command Logging and Monitoring spec) unless observeSensitiveCommands is true. Note that drivers will redact commands and replies for sensitive commands. For hello and legacy hello, which are conditionally sensitive based on the presence of a speculativeAuthenticate field, the test runner may need to infer that the events are sensitive based on whether or not the command and reply documents are redacted (i.e. empty documents).

For each element in test.operations, follow the process in Executing an Operation. If an unexpected error is encountered or an assertion fails, the test runner MUST consider this test to have failed.

If any event listeners were enabled on any client entities, the test runner MUST now disable those event listeners.

If any fail points were configured, the test runner MUST now disable those fail points (on the same server) to avoid spurious failures in subsequent tests. For any fail points configured using targetedFailPoint, the test runner MUST disable the fail point on the same mongos server on which it was originally configured. See Disabling Fail Points for more information.

If test.expectEvents is specified, for each object therein the test runner MUST assert that the number and sequence of expected events match the number and sequence of actual events observed on the specified client. If the list of expected events is empty, the test runner MUST assert that no events were observed on the client. The process for matching events is described in expectedEvent.

If test.outcome is specified, for each collectionData therein the test runner MUST assert that the collection contains exactly the expected data. The test runner MUST query each collection using the internal MongoClient, an ascending sort order on the _id field (i.e. { _id: 1 }), a "primary" read preference, and a "local" read concern. When comparing collection data, the rules in Evaluating Matches do not apply and the documents MUST match exactly; however, test runners MUST permit variations in document key order or otherwise normalize the documents before comparison. If the list of documents is empty, the test runner MUST assert that the collection is empty.

Before clearing the entity map at the end of each test, the test runner MUST allow its entities to be accessed externally. The exact mechanism for facilitating this access is not prescribed by this specification, but drivers should be mindful of concurrency if applicable. As an example, the test runner MAY be configured with a callback method, which will be invoked at the end of each test and provided with the entity map (or an equivalent data structure). As previously discussed in Entity Map, test runners MAY restrict access to driver objects if necessary.

Clear the entity map for this test. For each ClientSession in the entity map, the test runner MUST end the session (e.g. call endSession). For each ChangeStream and FindCursor in the entity map, the test runner MUST close the cursor.

If the test started a transaction (i.e. executed a startTransaction or withTransaction operation), the test runner MUST terminate any open transactions (see: Terminating Open Transactions).

Proceed to the subsequent test.

Executing an Operation

The instructions in this section apply for each operation occurring in a test contained within a test file loaded by the test runner.

If at any point while executing an operation an unexpected error is encountered or an assertion fails, the test runner MUST consider the parent test to have failed and proceed from Executing a Test accordingly.

If operation.object is "testRunner", this is a special operation. If operation.name is defined in Special Test Operations, the test runner MUST execute the operation accordingly and, if successful, proceed to the next operation in the test; otherwise, the test runner MUST raise an error for an undefined operation. The test runner MUST keep a record of any fail points configured by special operations so that they may be disabled after the current test.

If operation.object is not "testRunner", this is an entity operation. If operation.object is defined in the current test's Entity Map, the test runner MUST fetch that entity and note its type; otherwise, the test runner MUST raise an error for an undefined entity. If operation.name does not correspond to a known operation for the entity type (per Entity Test Operations), the test runner MUST raise an error for an unsupported operation. Test runners MAY skip tests that include operations that are intentionally unimplemented (e.g. listCollectionNames).

Proceed with preparing the operation's arguments. If session is specified in operation.arguments, the test runner MUST resolve it to a session entity and MUST raise an error if the name is undefined or maps to an unexpected type. If a key in operation.arguments does not correspond to a known parameter/option for the operation, the test runner MUST raise an error for an unsupported argument.

Before executing the operation, the test runner MUST be prepared to catch a potential error from the operation (e.g. enter a try block). Proceed with executing the operation and capture its result or error.

Note that some operations require special handling, as discussed in Entity Test Operations. For example, model objects may need to be converted to documents (before matching or saving in the entity map) and returned iterables may need to be fully iterated.

If operation.ignoreResultAndError is true, the test runner MUST NOT make any assertions regarding the result or error of the operation and MUST proceed to the subsequent operation.

If operation.expectError is specified, the test runner MUST assert that the operation yielded an error; otherwise, the test runner MUST assert that the operation did not yield an error. If an error was expected, the test runner MUST evaluate any assertions in expectedError accordingly.

If operation.expectResult is specified, the test MUST assert that it matches the actual result of the operation according to the rules outlined in Evaluating Matches.

If operation.saveResultAsEntity is specified, the test runner MUST store the result in the current test's entity map using the specified name. If the operation did not return a result or the result does not comply with Supported Entity Types then the test runner MUST raise an error.

After asserting the operation's error and/or result and optionally saving the result, proceed to the subsequent operation.

Special Procedures

This section describes some procedures that may be referenced from earlier sections.

Terminating Open Transactions

Open transactions can cause tests to block indiscriminately. When connected to MongoDB 3.6 or later, test runners SHOULD terminate all open transactions at the start of a test suite and after each failed test by killing all sessions in the cluster. Using the internal MongoClient(s), execute the killAllSessions command on either the primary or, if connected to a sharded cluster, each mongos server.

For example:

db.adminCommand({
  killAllSessions: []
});

The test runner MAY ignore the following command failures:

  • Interrupted(11601) to work around SERVER-38335.
  • Unauthorized(13) to work around SERVER-54216.
  • CommandNotFound(59) if the command is executed on a pre-3.6 server

Note that Atlas, by design, does not allow database users to kill sessions belonging to other users. This makes it impossible to guarantee that an existing transaction will not block test execution. To work around this, test runners SHOULD either ignore Unauthorized(13) command failures or avoid calling killAllSessions altogether when connected to Atlas (e.g. by detecting mongodb.net in the hostname or allowing the test runner to be configured externally).

Test runners MUST NOT execute killAllSessions when connected to Atlas Data Lake.

MigrationConflict Errors on Sharded Clusters

Following SERVER-82353, drivers may encounter MigrationConflict errors when executing transactions on sharded clusters due to an outdated cluster time. For example:

Transaction fe0b9a7d-4b64-4e38-aad2-4c7dd5ab0c2b - 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU= -  - :1 was aborted on statement 0 due to: a non-retryable snapshot error :: caused by :: Encountered error from localhost:27217 during a transaction :: caused by :: Database transaction-tests has undergone a catalog change operation at time Timestamp(1708062645, 94) and no longer satisfies the requirements for the current transaction which requires Timestamp(1708062645, 93). Transaction will be aborted.

To work around this issue, a test runner MUST collect the latest cluster time from processing initialData, which is executed using the internal MongoClient, and use it to advance the cluster time of every session entity created during the test, by either createEntities or the special test operation). This will ensure that the cluster time is gossipped to any other mongo hosts that may be used to execute a transaction during the test.

Depending on the test runner implementation, the cluster time may be collected in various ways. For example:

  • Execute a ping command against the same mongos host used for initialData and read the $clusterTime response field
  • Execute all operations for initialData using an explicit session and read its cluster time after the last operation.

This logic is only necessary for tests that might execute a transaction against a sharded deployment transaction (i.e. test.operations includes startTransaction or withTransaction and the topology is sharded or load-balanced). To ease the implementation, test runners MAY apply this logic to every test.

StaleDbVersion Errors on Sharded Clusters

When a shard receives its first command that contains a databaseVersion, the shard returns a StaleDbVersion error and mongos retries the operation. In a sharded transaction, mongos does not retry these operations and instead returns the error to the client. For example:

Command distinct failed: Transaction aa09e296-472a-494f-8334-48d57ab530b6:1 was aborted on statement 0 due to: an error from cluster data placement change :: caused by :: got stale databaseVersion response from shard sh01 at host localhost:27217 :: caused by :: don't know dbVersion.

To work around this limitation, a test runners MUST execute a non-transactional distinct command on each mongos server before running any test that might execute distinct within a transaction. To ease the implementation, test runners MAY execute distinct before every test.

Test runners can remove this workaround once SERVER-39704 is resolved, after which point mongos should retry the operation transparently. The distinct command is the only command allowed in a sharded transaction that uses the databaseVersion concept so it is the only command affected.

Server Fail Points

Many tests utilize the configureFailPoint command to trigger server-side errors such as dropped connections or command errors. Tests can configure fail points using the special failPoint or targetedFailPoint operations.

This internal command is not documented in the MongoDB manual (pending DOCS-10784); however, there is scattered documentation available on the server wiki (The "failCommand" Fail Point) and employee blogs (e.g. Intro to Fail Points, Testing Network Errors With MongoDB). Documentation can also be gleaned from JIRA tickets (e.g. SERVER-35004, SERVER-35083). This specification does not aim to provide comprehensive documentation for all fail points available for driver testing, but some fail points are documented in Fail Points Commonly Used in Tests.

Configuring Fail Points

The configureFailPoint command is executed on the admin database and has the following structure:

db.adminCommand({
    configureFailPoint: <string>,
    mode: <string|object>,
    data: <object>
});

The value of configureFailPoint is a string denoting the fail point to be configured (e.g. "failCommand").

The mode option is a generic fail point option and may be assigned a string or object value. The string values "alwaysOn" and "off" may be used to enable or disable the fail point, respectively. An object may be used to specify either times or skip, which are mutually exclusive:

  • { times: <integer> } may be used to limit the number of times the fail point may trigger before transitioning to "off".
  • { skip: <integer> } may be used to defer the first trigger of a fail point, after which it will transition to "alwaysOn".

The data option is an object that may be used to specify any options that control the particular fail point's behavior.

In order to use configureFailPoint, the undocumented enableTestCommands server parameter must be enabled by either the configuration file or command line option (e.g. --setParameter enableTestCommands=1). It cannot be enabled at runtime via the setParameter command). This parameter should already be enabled for most configuration files in mongo-orchestration.

Disabling Fail Points

A fail point may be disabled like so:

db.adminCommand({
    configureFailPoint: <string>,
    mode: "off"
});

Fail Points Commonly Used in Tests

failCommand

The failCommand fail point allows the client to force the server to return an error for commands listed in the data.failCommands field. Additionally, this fail point is documented in server wiki: The failCommand Fail Point.

The failCommand fail point was introduced in mongod 4.0.0 (SERVER-34551) and mongos 4.1.5 (SERVER-35518); however, the fail point was not usable for testing on mongos until version 4.1.7 (SERVER-34943).

The failCommand fail point may be configured like so:

db.adminCommand({
    configureFailPoint: "failCommand",
    mode: <string|object>,
    data: {
      failCommands: [<string>, ...],
      closeConnection: <boolean>,
      errorCode: <integer>,
      errorLabels: [<string>, ...],
      writeConcernError: <object>,
      appName: <string>,
      blockConnection: <boolean>,
      blockTimeMS: <integer>,
    }
});

failCommand supports the following data options, which may be combined if desired:

  • failCommands: Required array of strings. Lists the command names to fail.
  • closeConnection: Optional boolean, which defaults to false. If true, the command will not be executed, the connection will be closed, and the client will see a network error.
  • errorCode: Optional integer, which is unset by default. If set, the command will not be executed and the specified command error code will be returned as a command error.
  • errorLabels: Optional array of strings. If set, this list overrides the server's normal behavior for adding error labels. An empty array may be used to suppress the server from adding error labels to the response. New in server 4.3.1 (SERVER-43941).
  • appName: Optional string, which is unset by default. If set, the fail point will only apply to connections for MongoClients created with this appname. New in server 4.4.0-rc2 (SERVER-47195).
  • blockConnection: Optional boolean, which defaults to false. If true, the server should block the affected commands for blockTimeMS. New in server 4.3.4 and backported to 4.2.9 (SERVER-41070).
  • blockTimeMS: Optional integer, which is required when blockConnection is true. The number of milliseconds for which the affected commands should be blocked. New in server 4.3.4 and backported to 4.2.9 (SERVER-41070).
onPrimaryTransactionalWrite

The onPrimaryTransactionalWrite fail point allows the client to force a network error before the server would return a write result to the client. The fail point also allows control whether the server will successfully commit the write via its failBeforeCommitExceptionCode option. Keep in mind that the fail point only triggers for transaction writes (i.e. write commands including txnNumber and lsid fields).

The onPrimaryTransactionalWrite fail point was introduced in mongod 3.6.0 (SERVER-29606) to facilitate testing of retryable writes. It is only supported on mongod and cannot be used with sharded clusters.

onPrimaryTransactionalWrite supports the following data options:

  • closeConnection: Optional boolean, which defaults to true. If true, the connection on which the write is executed will be closed before a result can be returned.
  • failBeforeCommitExceptionCode: Optional integer, which is unset by default. If set, the specified exception code will be thrown and the write will not be committed. If unset, the write will be allowed to commit.

Determining if a Sharded Cluster Uses Replica Sets

When connected to a mongos server, the test runner can query the config.shards collection. Each shard in the cluster is represented by a document in this collection. If the shard is backed by a single server, the host field will contain a single host. If the shard is backed by a replica set, the host field contain the name of the replica followed by a forward slash and a comma-delimited list of hosts.

Note: MongoDB 3.6+ requires that all shard servers be replica sets.

Design Rationale

This specification was primarily derived from the test formats used by the Transactions and CRUD specs, which have served models or other specs.

This specification commonly uses "SHOULD" when providing guidance on writing test files. While this may appear contradictory to the driver mantra preferring "MUST", it is intentional. Some of this guidance addresses style (e.g. adding comments, using YAML anchors) and cannot be enforced with a JSON schema. Other guidance needs to be purposefully ignored in order to test the test runner implementation (e.g. defining entities out of order to trigger runtime errors). The specification does prefer "MUST" in other contexts, such as discussing parts of the test file format that are enforceable by the JSON schema or the test runner implementation.

Why can't observeSensitiveCommands be true when authentication is enabled?

When running against authentication-enabled clusters, the events observed by a client will always begin with auth-related commands (e.g. authenticate, saslStart, saslContinue) because the MongoClient will need to authenticate a connection before issuing the first command in the test specification. Since the exact form of the authentication command event will depend on whether authentication is enabled, as well as, the auth mechanism in use, it is not possible to anticipate the command monitoring output and perform the appropriate assertions. Consequently, we have restricted use of this property to situations where authentication is disabled on the server. This allows tests to explicitly test sensitive commands via the runCommand helper.

Breaking Changes

This section is reserved for future use. Any breaking changes to the test format SHOULD be described here in detail for historical reference, in addition to any shorter description that may be added to the Changelog.

Future Work

Assert expected log messages

When drivers support standardized logging, the test format may need to support assertions for messages expected to be logged while executing operations. Since log messages are strings, this may require an operator to match regex patterns within strings. Additionally, the test runner may need to support ignoring extra log output, similar to ignoreExtraEvents.

Target failPoint by read preference

The failPoint operation currently uses a "primary" read preference. To date, no spec has needed behavior to configure a fail point on a non-primary node. If the need does arise, failPoint can be enhanced to support a readPreference argument.

IO operations for GridFS streams

Original GridFS spec tests refer to "upload", "download", and "download_by_name" methods, which allow the tests to abstract stream IO and either upload a byte sequence or assert a downloaded byte sequence. These operations correspond to the download, downloadByName, upload, and uploadWithId Bucket Operations.

In order to support methods such as downloadToStream, openUploadStream, and openUploadStreamWithId, test runners would need to represent streams as entities and support IO operations to directly read from and write to a stream entity. This may not be worth the added complexity if the existing operations provide adequate test coverage for GridFS implementations.

Support Client-side Encryption integration tests

Supporting client-side encryption spec tests will require the following changes to the test format:

  • json_schema will need to be specified when creating a collection, via either the collection entity definition or initialData.
  • key_vault_data can be expressed via initialData
  • autoEncryptOpts will need to be specified when defining a client entity. Preparation of this field may require reading AWS credentials from environment variables.

The process for executing tests should not require significant changes, but test files will need to express a dependency on mongocryptd.

Incorporate referenced entity operations into the schema version

The Schema Version is not impacted by changes to operations defined in other specs and referenced in Entity Test Operations (e.g. find for CRUD). The operation.name and operation.arguments fields are loosely defined in the JSON schema as string and object types, respectively.

Ideally, all operations (and their arguments) would be enforced by the JSON schema and any changes to operations would affect the schema version accordingly. For example, a new find option would warrant a minor version bump both for the CRUD spec and this spec and its schema.

As discussed in Executing an Operation, test runners MUST raise errors for unsupported operations and arguments. This is a concession until such time that better processes can be established for versioning other specs and collating spec changes developed in parallel or during the same release cycle.

Changelog

  • 2024-05-08: Schema version 1.21.
    Add writeErrors and writeConcernErrors field to expectedError for the client-level bulk write API.

  • 2024-04-15: Note that when directConnection is set to true test runners should only provide a single seed.

  • 2024-03-25: Schema version 1.20.
    Add previousDescription and newDescription assertions to topologyDescriptionChangedEvent when checking events with expectEvents

  • 2024-03-11: Note that killAllSessions should not be executed on Atlas Data Lake

  • 2024-02-27: Document onPrimaryTransactionalWrite fail point, which is used in retryable writes tests.

  • 2024-02-23: Require test runners to gossip cluster time from internal MongoClient to each session entity.

  • 2024-02-14: Clarify that errors raised from callback operations should always propagate to withTransaction.

  • 2024-02-12: Clarify that targetedFailPoint should only be used when useMultipleMongoses is true and not on load-balanced topologies.

  • 2024-02-06: Migrated from reStructuredText to Markdown.

  • 2024-01-17: Schema version 1.19.
    Add authMechanism to runOnRequirement and require that uriOptions supports placeholder documents.

  • 2024-01-11: Schema version 1.18.
    Allow named KMS providers in kmsProviders. Note location of Client-Side Encryption test credentials.

  • 2024-01-03: Document server version requirements for errorLabels and
    blockConnection options for failCommand fail point.

  • 2023-10-04: Schema version 1.17.
    Add serverHeartbeatStartedEvent, serverHeartbeatSucceededEvent, and serverHeartbeatFailedEvent for asserting on SDAM server heartbeat events.

  • 2023-09-25: Clarify that the UTR is intended to be run against enterprise servers.

  • 2022-07-18: Schema version 1.16.
    Add ignoreMessages and ignoreExtraMessages fields to expectedLogMessagesForClient section.

  • 2023-06-26: runOnRequirement.csfle should check for crypt_shared and/or
    mongocryptd.

  • 2023-06-13: Schema version 1.15.
    Add databaseName field to CommandFailedEvent and CommandSucceededEvent.

  • 2023-05-26: Schema version 1.14.
    Add topologyDescriptionChangedEvent.

  • 2023-05-17: Add runCursorCommand and createCommandCursor operations.
    Added commandCursor entity type which can be used with existing cursor operations.

  • 2023-05-12: Deprecate "sharded-replicaset" topology type. Note that server 3.6+
    requires replica sets for shards, which is also relevant to load balanced topologies.

  • 2023-04-13: Remove readConcern and writeConcern options from runCommand operation.

  • 2023-02-24: Fix typo in the description of the $$matchAsRoot matching operator.

  • 2022-10-17: Add description of a close operation for client entities.

  • 2022-10-14: Schema version 1.13.
    Add support for logging assertions via the observeLogMessages field for client entities, along with a new top-level field expectLogMessages containing expectedLogMessagesForClient objects. Add new special matching operators to enable command logging assertions, $$matchAsDocument and $$matchAsRoot.

  • 2022-10-14: Schema version 1.12.
    Add errorResponse to expectedError.

  • 2022-10-05: Remove spec front matter, add "Current Schema Version" field, and
    reformat changelog. Add comment to remind editors to note schema version bumps in changelog updates (where applicable).

  • 2022-09-02: Schema version 1.11.
    Add interruptInUseConnections field to poolClearedEvent

  • 2022-07-28: Schema version 1.10.
    Add support for thread entities (runOnThread, waitForThread), TopologyDescription entities (recordTopologyDescription, waitForPrimaryChange, assertTopologyType), testRunner event assertion operations (waitForEvent, assertEventCount), expected SDAM events, and the wait operation.

  • 2022-07-27: Retroactively note schema version bumps in the changelog and
    require doing so for future changes.

  • 2022-07-11: Update Future Work to reflect that support for ignoring extra
    observed events was added in schema version 1.7.

  • 2022-06-16: Require server 4.2+ for csfle: true.

  • 2022-05-10: Add reference to Client Side Encryption spec under
    ClientEncryption Operations.

  • 2022-04-27: Schema version 1.9.
    Added createOptions field to initialData, introduced a new timeoutMS field in collectionOrDatabaseOptions, and added an isTimeoutError field to expectedError. Also introduced the createEntities operation.

  • 2022-04-27: Schema version 1.8.
    Add runOnRequirement.csfle.

  • 2022-04-26: Add clientEncryption entity and $$placeholder syntax.

  • 2022-04-22: Revise useMultipleMongoses and "Initializing the Test Runner"
    for Atlas Serverless URIs using a load balancer fronting a single proxy.

  • 2022-03-01: Schema version 1.7.
    Add ignoreExtraEvents field to expectedEventsForClient.

  • 2022-02-24: Rename Versioned API to Stable API

  • 2021-08-30: Schema version 1.6.
    Add hasServerConnectionId field to commandStartedEvent, commandSuccededEvent and commandFailedEvent.

  • 2021-08-30: Test runners may create an internal MongoClient for each mongos.
    Better clarify how internal MongoClients may be used. Clarify that drivers creating an internal MongoClient for each mongos should use those clients for targetedFailPoint operations.

  • 2021-08-23: Allow runOnRequirement conditions to be evaluated in any order.

  • 2021-08-09: Updated all existing schema files to require at least one element
    in test.expectEvents if specified.

  • 2021-07-29: Note that events for sensitive commands will have redacted
    commands and replies when using observeSensitiveCommands, and how that affects conditionally sensitive commands such as hello and legacy hello.

  • 2021-07-01: Note that expectError.expectResult should use
    $$unsetOrMatches when the result is optional.

  • 2021-06-09: Schema version 1.5.
    Added an observeSensitiveCommands property to the client entity type.

  • 2021-05-17: Ensure old JSON schema files remain in place

  • 2021-04-19: Schema version 1.4.
    Introduce serverless runOnRequirement.

  • 2021-04-12: Schema version 1.3.
    Added a FindCursor entity type. Defined a set of cursor operations. Added an auth property to runOnRequirements and modified the topologies property to accept load-balanced. Added CMAP events to the possible event types for expectedEvent. Add assertNumberConnectionsCheckedOut operation. Add ignoreResultAndError operation option.

  • 2021-04-08: List additional error codes that may be ignored when calling
    killAllSessions and note that the command should not be called when connected to Atlas.

  • 2021-03-22: Split serverApi into its own section. Note types for loop
    operation arguments. Clarify how loop iterations are counted for storeIterationsAsEntity.

  • 2021-03-10: Clarify that observedAt field measures time in seconds for
    storeEventsAsEntities.

  • 2021-03-09: Clarify which components of a version string are relevant for
    comparisons.

  • 2021-03-04: Change storeEventsAsEntities from a map to an array of
    storeEventsAsEntity objects.

  • 2021-03-01: Schema version 1.2.
    Added storeEventsAsEntities option for client entities and loop operation, which is needed for Atlas Driver Testing.

  • 2020-12-23: Clarify how JSON schema is renamed for new minor versions.

  • 2020-11-06: Schema version 1.1.
    Added serverApi option for client entities, _yamlAnchors property to define values for later use in YAML tests, and serverParameters property for runOnRequirements.

Atlas Data Lake Tests

Introduction

The YAML and JSON files in this directory are platform-independent tests that drivers can use to assert compatibility with Atlas Data Lake. These tests utilize the Unified Test Format.

Several prose tests, which are not easily expressed in YAML, are also presented in this file. Those tests will need to be manually implemented by each driver.

Test Considerations

Running these integration tests will require a running mongohoused with data available in its test.driverdata collection. See the ADL directory in drivers-evergreen-tools and 10gen/mongohouse README for more information.

The test runner for Atlas Data Lake testing MUST NOT drop the collection and/or database under test. In contrast to most other tests, which insert their own data fixtures into an empty collection, the data for these tests is specified in the mongohoused configuration file.

Additionally, the test runner MUST NOT execute killAllSessions (see: Terminating Open Transactions) when connected to Atlas Data Lake.

Prose Tests

The following tests MUST be implemented to fully test compatibility with Atlas Data Lake.

1. Support for killCursors command

Test that the driver properly constructs and issues a killCursors command to Atlas Data Lake. For this test, configure an APM listener on a client and execute a query on the test.driverdata collection that will leave a cursor open on the server (e.g. specify batchSize=2 for a query that would match 3+ documents). Drivers MAY iterate the cursor if necessary to execute the initial find command but MUST NOT iterate further to avoid executing a getMore.

Observe the CommandSucceededEvent event for the find command and extract the cursor's ID and namespace from the response document's cursor.id and cursor.ns fields, respectively. Destroy the cursor object and observe a CommandStartedEvent and CommandSucceededEvent for the killCursors command. Assert that the cursor ID and target namespace in the outgoing command match the values from the find command's CommandSucceededEvent. When matching the namespace, note that the killCursors field will contain the collection name and the database may be inferred from either the $db field or accessed via the CommandStartedEvent directly. Finally, assert that the killCursors CommandSucceededEvent indicates that the expected cursor was killed in the cursorsKilled field.

Note: this test assumes that drivers only issue a killCursors command internally when destroying a cursor that may still exist on the server. If a driver constructs and issues killCursors commands in other ways (e.g. public API), this test MUST be adapted to test all such code paths.

2. Connect without authentication

Test that the driver can establish a connection with Atlas Data Lake without authentication. For these tests, create a MongoClient using a valid connection string without auth credentials and execute a ping command.

3. Connect with authentication

Test that the driver can establish a connection with Atlas Data Lake with authentication. For these tests, create a MongoClient using a valid connection string with SCRAM-SHA-1 and credentials from the drivers-evergreen-tools ADL configuration and execute a ping command. Repeat this test using SCRAM-SHA-256.

Changelog

  • 2024-03-08: Convert legacy ADL tests to unified format. Convert test README from reStructuredText to Markdown.

  • 2022-10-05: Add spec front matter

  • 2020-07-15: Link to CRUD test runner implementation and note that the collection under test must not be dropped before each test.

MongoDB Driver Performance Benchmarking

  • Status: Accepted
  • Minimum Server Version: N/A

Abstract

This document describes a standard benchmarking suite for MongoDB drivers.

Overview

Name and purpose

Driver performance will be measured by the MongoDB Driver Performance Benchmark (AKA "DriverBench"). It will provide both "horizontal" insights into how individual language driver performance evolves over time and "vertical" insights into relative performance of different drivers.

We do expect substantial performance differences between language families (e.g. static vs. dynamic or compiled vs. virtual-machine-based). However we still expect "vertical" comparison within language families to expose outlier behavior that might be amenable to optimization.

Task Hierarchy

The benchmark consists of a number of micro-benchmarks tasks arranged into groups of increasing complexity. This allows us to better isolate areas within drivers that are faster or slower.

  • BSON -- BSON encoding/decoding tasks, to explore BSON codec efficiency
  • Single-Doc -- single-document insertion and query tasks, to explore basic wire protocol efficiency
  • Multi-Doc -- multi-document insertion and query tasks, to explore batch-write and cursor chunking efficiency
  • Parallel -- multi-process/thread ETL tasks, to explore concurrent operation efficiency

Measurement

In addition to timing data, all micro-benchmark tasks will be measured in terms of "megabytes/second" (MB/s) of documents processed, with higher scores being better. (In this document, "megabyte" refers to the SI decimal unit, i.e. 1,000,000 bytes.) This makes cross-benchmark comparisons easier.

To avoid various types of measurement skew, tasks will be measured over numerous iterations. Each iteration will have a "scale" -- the number of similar operations performed -- that will vary by task. The final score for a task will be the median score of the iterations. Other quantiles will be recorded for diagnostic analysis.

Data sets

Data sets will vary by micro-benchmark. In some cases, they it will be a synthetically generated document inserted repeatedly (with different _id fields) to construct an overall corpus of documents. In other cases, data sets will be synthetic line-delimited JSON files or mock binary files.

Composite scores

Micro-benchmark scores will be combined into a composite for each weight class ("BSONBench", "SingleBench", etc.) and for read and write operations ("ReadBench" and "WriteBench"). The read and write scores will be combined into an aggregate composite score ("DriverBench"). The compositing formula in the DriverBench uses simple averages with equal weighting.

Versioning

DriverBench will have vX.Y versioning. Minor updates and clarifications will increment "Y" and should have little impact on score comparison. Major changes, such as changing score weights, MongoDB version tested against, or hardware used, will increment "X" to indicate that older version scores are unlikely to be comparable.

Benchmark execution phases and measurement

All micro-benchmark tasks will be conducted via a number of iterations. Each iteration will be timed and will generally include a large number of individual driver operations.

We break up the measurement this way to better isolate the benchmark from external volatility. If we consider the problem of benchmarking an operation over many iterations, such as 100,000 document insertions, we want to avoid two extreme forms of measurement:

  • measuring a single insertion 100,000 times -- in this case, the timing code is likely to be a greater proportion of executed code, which could routinely evict the insertion code from CPU caches or mislead a JIT optimizer and throw off results
  • measuring 100,000 insertions one time -- in this case, the longer the timer runs, the higher the likelihood that an external event occurs that affects the time of the run

Therefore, we choose a middle ground:

  • measuring the same 1000 insertions over 100 iterations -- each timing run includes enough operations that insertion code dominates timing code; unusual system events are likely to affect only a fraction of the 100 timing measurements

With 100 timings of inserting the same 1000 documents, we build up a statistical distribution of the operation timing, allowing a more robust estimate of performance than a single measurement. (In practice, the number of iterations could exceed 100, but 100 is a reasonable minimum goal.)

Because a timing distribution is bounded by zero on one side, taking the mean would allow large positive outlier measurements to skew the result substantially. Therefore, for the benchmark score, we use the median timing measurement, which is robust in the face of outliers.

Each benchmark is structured into discrete setup/execute/teardown phases. Phases are as follows, with specific details given in a subsequent section:

  • setup -- (ONCE PER MICRO-BENCHMARK) something to do once before any benchmarking, e.g. construct a client object, load test data, insert data into a collection, etc.
  • before task -- (ONCE PER ITERATION) something to do before every task iteration, e.g. drop a collection, or reload test data (if the test run modifies it), etc.
  • do task -- (ONCE PER ITERATION) smallest amount of code necessary to execute the task; e.g. insert 1000 documents one by one into the database, or retrieve 1000 document of test data from the database, etc.
  • after task -- (ONCE PER ITERATION) something to do after every task iteration (if necessary)
  • teardown -- (ONCE PER MICRO-BENCHMARK) something done once after all benchmarking is complete (if necessary); e.g. drop the test database

The wall-clock execution time of each "do task" phase will be recorded. We use wall clock time to model user experience and as a lowest-common denominator across languages and threading models. Iteration timing should be done with a high-resolution monotonic timer (or best language approximation).

Unless otherwise specified, the number of iterations to measure per micro-benchmark is variable:

  • iterations should loop for at least 1 minute cumulative execution time
  • iterations should stop after 100 iterations or 5 minutes cumulative execution time, whichever is shorter

This balances measurement stability with a timing cap to ensure all micro-benchmarks can complete in a reasonable time. Languages with JIT compilers may do warm up iterations for which timings are discarded.

For each micro-benchmark, the 10th, 25th, 50th, 75th, 90th, 95th, 98th and 99th percentiles will be recorded using the following algorithm:

  • Given a 0-indexed array A of N iteration wall clock times
  • Sort the array into ascending order (i.e. shortest time first)
  • Let the index i for percentile p in the range [1,100] be defined as: i = int(N * p / 100) - 1

N.B. This is the Nearest Rank algorithm, chosen for its utter simplicity given that it needs to be implemented identically across multiple languages for every driver.

The 50th percentile (i.e. the median) will be used for score composition. Other percentiles will be stored for visualizations and analysis (e.g. a "candlestick" chart showing benchmark volatility over time).

Each task will have defined for it an associated size in megabytes (MB). The score for micro-benchmark composition will be the task size in MB divided by the median wall clock time.

Micro-benchmark definitions

Datasets are available in the data directory adjacent to this spec.

Note: The term "LDJSON" means "line-delimited JSON", which should be understood to mean a collection of UTF-8 encoded JSON documents (without embedded CR or LF characters), separated by a single LF character. (Some Internet definition of line-delimited JSON use CRLF delimiters, but this benchmark uses only LF.)

BSON micro-benchmarks

Datasets are in the 'extended_bson' tarball.

BSON tests focus on BSON encoding and decoding; they are client-side only and do not involve any transmission of data to or from the benchmark server. When appropriate, data sets will be stored on disk as extended strict JSON. For drivers that don't support extended JSON, a BSON analogue will be provided as well.

BSON micro-benchmarks include:

  • Flat BSON Encoding and Flat BSON Decoding -- shallow documents with only common BSON field types
  • Deep BSON Encoding and Deep BSON Decoding -- deeply nested documents with only common BSON field types
  • Full BSON Encoding and Full BSON Decoding -- shallow documents with all possible BSON field types

Flat BSON Encoding

Summary: This benchmark tests driver performance encoding documents with top level key/value pairs involving the most commonly-used BSON types.

Dataset: The dataset, designated FLAT_BSON (ftnt4 Disk file 'flat_bson.json'), will be synthetically generated and consist of an extended JSON document with a single _id key with an object ID value plus 24 top level keys/value pairs of the following types: string, Int32, Int64, Double, Boolean. (121 total key/value pairs) Keys will be random ASCII strings of length 8. String data will be random ASCII strings of length 80.

Dataset size: For score purposes, the dataset size for a task is the size of the single-document source file (7531 bytes) times 10,000 operations, which equals 75,310,000 bytes or 75.31 MB.

PhaseDescription
SetupLoad the FLAT_BSON dataset into memory as a language-appropriate document types. For languages like C without a document type, the raw JSON string for each document should be used instead.
Before taskn/a
Do taskEncode the FLAT_BSON document to a BSON byte-string. Repeat this 10,000 times.
After taskn/a
Teardownn/a

Flat BSON Decoding

Summary: This benchmark tests driver performance decoding documents with top level key/value pairs involving the most commonly-used BSON types.

Dataset: The dataset, designated FLAT_BSON, will be synthetically generated and consist of an extended JSON document with a single _id key with an object ID value plus 24 top level keys/value pairs of each of the following types: string, Int32, Int64, Double, Boolean. (121 total key/value pairs) Keys will be random ASCII strings of length 8. String data will be random ASCII strings of length 80.

Dataset size: For score purposes, the dataset size for a task is the size of the single-document source file (7531 bytes) times 10,000 operations, which equals 75,310,000 bytes or 75.31 MB.

PhaseDescription
SetupLoad the FLAT_BSON dataset into memory as a language-appropriate document types. For languages like C without a document type, the raw JSON string for each document should be used instead. Encode it to a BSON byte-string.
Before taskn/a
Do taskDecode the BSON byte-string to a language-appropriate document type. Repeat this 10,000 times. For languages like C without a document type, decode to extended JSON instead.
After taskn/a
Teardownn/a

Deep BSON Encoding

Summary: This benchmark tests driver performance encoding documents with deeply nested key/value pairs involving subdocuments, strings, integers, doubles and booleans.

Dataset: The dataset, designated DEEP_BSON (disk file 'deep_bson.json'), will be synthetically generated and consist of an extended JSON document representing a balanced binary tree of depth 6, with "left" and "right" keys at each level containing a sub-document until the final level, which will contain a random ASCII string of length 8 (126 total key/value pairs).

Dataset size: For score purposes, the dataset size for a task is the size of the single-document source file (1964 bytes) times 10,000 operations, which equals 19,640,000 bytes or 19.64 MB.

PhaseDescription
SetupLoad the DEEP_BSON dataset into memory as a language-appropriate document type. For languages like C without a document type, the raw JSON string for each document should be used instead.
Before taskn/a
Do taskEncode the DEEP_BSON document to a BSON byte-string. Repeat this 10,000 times.
After taskn/a
Teardownn/a

Deep BSON Decoding

Summary: This benchmark tests driver performance decoding documents with deeply nested key/value pairs involving subdocuments, strings, integers, doubles and booleans.

Dataset: The dataset, designated DEEP_BSON, will be synthetically generated and consist of an extended JSON document representing a balanced binary tree of depth 6, with "left" and "right" keys at each level containing a sub-document until the final level, which will contain a random ASCII string of length 8 (126 total key/value pairs).

Dataset size: For score purposes, the dataset size for a task is the size of the single-document source file (1964 bytes) times 10,000 operations, which equals 19,640,000 bytes or 19.64 MB.

PhaseDescription
SetupLoad the DEEP_BSON dataset into memory as a language-appropriate document types. For languages like C without a document type, the raw JSON string for each document should be used instead. Encode it to a BSON byte-string.
Before taskn/a
Do taskDecode the BSON byte-string to a language-appropriate document type. Repeat this 10,000 times. For languages like C without a document type, decode to extended JSON instead.
After taskn/a
Teardownn/a

Full BSON Encoding

Summary: This benchmark tests driver performance encoding documents with top level key/value pairs involving the full range of BSON types.

Dataset: The dataset, designated FULL_BSON (disk file 'full_bson.json'), will be synthetically generated and consist of an extended JSON document with a single _id key with an object ID value plus 6 each of the following types: string, double, Int64, Int32, boolean, minkey, maxkey, array, binary data, UTC datetime, regular expression, Javascript code, Javascript code with context, and timestamp. (91 total keys.) Keys (other than _id) will be random ASCII strings of length 8. Strings values will be random ASCII strings with length 80.

Dataset size: For score purposes, the dataset size for a task is the size of the single-document source file (5734 bytes) times 10,000 operations, which equals 57,340,000 bytes or 57.34 MB.

PhaseDescription
SetupLoad the FULL_BSON dataset into memory as a language-appropriate document type. For languages like C without a document type, the raw JSON string for each document should be used instead.
Before taskn/a
Do taskEncode the FULL_BSON document to a BSON byte-string. Repeat this 10,000 times.
After taskn/a
Teardownn/a

Full BSON Decoding

Summary: This benchmark tests driver performance decoding documents with top level key/value pairs involving the full range of BSON types.

Dataset: The dataset, designated FULL_BSON, will be synthetically generated and consist of an extended JSON document with a single _id key with an object ID value plus 6 each of the following types: string, double, Int64, Int32, boolean, minkey, maxkey, array, binary data, UTC datetime, regular expression, Javascript code, Javascript code with context, and timestamp. (91 total keys.) Keys (other than _id) will be random ASCII strings of length 8. Strings values will be random ASCII strings with length 80.

Dataset size: For score purposes, the dataset size for a task is the size of the single-document source file (5734 bytes) times 10,000 operations, which equals 57,340,000 bytes or 57.34 MB.

PhaseDescription
SetupLoad the FULL_BSON dataset into memory as a language-appropriate document types. For languages like C without a document type, the raw JSON string for each document should be used instead. Encode it to a BSON byte-string.
Before taskn/a
Do taskDecode the BSON byte-string to a language-appropriate document type. Repeat this 10,000 times. For languages like C without a document type, decode to extended JSON instead.
After taskn/a
Teardownn/a

Single-Doc Benchmarks

Datasets are in the 'single_and_multi_document' tarball.

Single-doc tests focus on single-document read and write operations. They are designed to give insights into the efficiency of the driver's implementation of the basic wire protocol.

The data will be stored as strict JSON with no extended types.

Single-doc micro-benchmarks include:

  • Run command
  • Find one by ID
  • Small doc insertOne
  • Large doc insertOne

Run command

Summary: This benchmark tests driver performance sending a command to the database and reading a response.

Dataset: n/a

Dataset size: While there is no external dataset, for score calculation purposes use 130,000 bytes (10,000 x the size of a BSON {hello:true} command).

N.B. We use {hello:true} rather than {hello:1} to ensure a consistent command size.

PhaseDescription
SetupConstruct a MongoClient object. Construct whatever language-appropriate objects (Database, etc.) would be required to send a command.
Before taskn/a
Do taskRun the command {hello:true} 10,000 times, reading (and discarding) the result each time.
After taskn/a
Teardownn/a

Find one by ID

Summary: This benchmark tests driver performance sending an indexed query to the database and reading a single document in response.

Dataset: The dataset, designated TWEET (disk file 'tweet.json'), consists of a sample tweet stored as strict JSON.

Dataset size: For score purposes, the dataset size for a task is the size of the single-document source file (1622 bytes) times 10,000 operations, which equals 16,220,000 bytes or 16.22 MB.

PhaseDescription
SetupConstruct a MongoClient object. Drop the 'perftest' database. Load the TWEET document into memory as a language-appropriate document type (or JSON string for C). Construct a Collection object for the 'corpus' collection to use for querying. Insert the document 10,000 times to the 'perftest' database in the 'corpus' collection using sequential _id values. (1 to 10,000)
Before taskn/a
Do taskFor each of the 10,000 sequential _id numbers, issue a find command for that _id on the 'corpus' collection and retrieve the single-document result.
After taskn/a
TeardownDrop the 'perftest' database.

Small doc insertOne

Summary: This benchmark tests driver performance inserting a single, small document to the database.

Dataset: The dataset, designated SMALL_DOC (disk file 'small_doc.json'), consists of a JSON document with an encoded length of approximately 250 bytes.

Dataset size: For score purposes, the dataset size for a task is the size of the single-document source file (275 bytes) times 10,000 operations, which equals 2,750,000 bytes or 2.75 MB.

PhaseDescription
SetupConstruct a MongoClient object. Drop the 'perftest' database. Load the SMALL_DOC dataset into memory as a language-appropriate document type (or JSON string for C).
Before taskDrop the 'corpus' collection. Create an empty 'corpus' collection with the 'create' command. Construct a Collection object for the 'corpus' collection to use for insertion.
Do taskInsert the document with the insertOne CRUD method. DO NOT manually add an _id field; leave it to the driver or database. Repeat this 10,000 times.
After taskn/a
TeardownDrop the 'perftest' database.

Large doc insertOne

Summary: This benchmark tests driver performance inserting a single, large document to the database.

Dataset: The dataset, designated LARGE_DOC (disk file 'large_doc.json'), consists of a JSON document with an encoded length of approximately 2,500,000 bytes.

Dataset size: For score purposes, the dataset size for a task is the size of the single-document source file (2,731,089 bytes) times 10 operations, which equals 27,310,890 bytes or 27.31 MB.

PhaseDescription
SetupConstruct a MongoClient object. Drop the 'perftest' database. Load the LARGE_DOC dataset into memory as a language-appropriate document type (or JSON string for C).
Before taskDrop the 'corpus' collection. Create an empty 'corpus' collection with the 'create' command. Construct a Collection object for the 'corpus' collection to use for insertion.
Do taskInsert the document with the insertOne CRUD method. DO NOT manually add an _id field; leave it to the driver or database. Repeat this 10 times.
After taskn/a
TeardownDrop the 'perftest' database.

Multi-Doc Benchmarks

Datasets are in the 'single_and_multi_document' tarball.

Multi-doc benchmarks focus on multiple-document read and write operations. They are designed to give insight into the efficiency of the driver's implementation of bulk/batch operations such as bulk writes and cursor reads.

Multi-doc micro-benchmarks include:

  • Find many and empty the cursor
  • Small doc bulk insert
  • Large doc bulk insert
  • GridFS upload
  • GridFS download

Find many and empty the cursor

Summary: This benchmark tests driver performance retrieving multiple documents from a query.

Dataset: The dataset, designated TWEET consists of a sample tweet stored as strict JSON.

Dataset size: For score purposes, the dataset size for a task is the size of the single-document source file (1622 bytes) times 10,000 operations, which equals 16,220,000 bytes or 16.22 MB.

PhaseDescription
SetupConstruct a MongoClient object. Drop the 'perftest' database. Load the TWEET dataset into memory as a language-appropriate document type (or JSON string for C). Construct a Collection object for the 'corpus' collection to use for querying. Insert the document 10,000 times to the 'perftest' database in the 'corpus' collection. (Let the driver generate _ids).
Before taskn/a
Do taskIssue a find command on the 'corpus' collection with an empty filter expression. Retrieve (and discard) all documents from the cursor.
After taskn/a
TeardownDrop the 'perftest' database.

Small doc bulk insert

Summary: This benchmark tests driver performance inserting multiple, small documents to the database.

Dataset: The dataset, designated SMALL_DOC consists of a JSON document with an encoded length of approximately 250 bytes.

Dataset size: For score purposes, the dataset size for a task is the size of the single-document source file (275 bytes) times 10,000 operations, which equals 2,750,000 bytes or 2.75 MB.

PhaseDescription
SetupConstruct a MongoClient object. Drop the 'perftest' database. Load the SMALL_DOC dataset into memory as a language-appropriate document type (or JSON string for C).
Before taskDrop the 'corpus' collection. Create an empty 'corpus' collection with the 'create' command. Construct a Collection object for the 'corpus' collection to use for insertion.
Do taskDo an ordered 'insert_many' with 10,000 copies of the document. DO NOT manually add an _id field; leave it to the driver or database.
After taskn/a
TeardownDrop the 'perftest' database.

Large doc bulk insert

Summary: This benchmark tests driver performance inserting multiple, large documents to the database.

Dataset: The dataset, designated LARGE_DOC consists of a JSON document with an encoded length of approximately 2,500,000 bytes.

Dataset size: For score purposes, the dataset size for a task is the size of the single-document source file (2,731,089 bytes) times 10 operations, which equals 27,310,890 bytes or 27.31 MB.

PhaseDescription
SetupConstruct a MongoClient object. Drop the 'perftest' database. Load the LARGE_DOC dataset into memory as a language-appropriate document type (or JSON string for C).
Before taskDrop the 'corpus' collection. Create an empty 'corpus' collection with the 'create' command. Construct a Collection object for the 'corpus' collection to use for insertion.
Do taskDo an ordered 'insert_many' with 10 copies of the document. DO NOT manually add an _id field; leave it to the driver or database.
After taskn/a
TeardownDrop the 'perftest' database.

GridFS upload

Summary: This benchmark tests driver performance uploading a GridFS file from memory.

Dataset: The dataset, designated GRIDFS_LARGE (disk file 'gridfs_large.bin'), consists of a single file containing about 50 MB of random data. We use a large file to ensure multiple database round-trips even if chunks are are sent in batches.

Dataset size: For score purposes, the dataset size for a task is the size of the source file (52,428,800 bytes) times 1 operation or 52.43 MB.

PhaseDescription
SetupConstruct a MongoClient object. rop the 'perftest' database. Load the GRIDFS_LARGE file as a string or other language-appropriate type for binary octet data.
Before taskDrop the default GridFS bucket. Insert a 1-byte file into the bucket. (This ensures the bucket collections and indices have been created.) Construct a GridFSBucket object to use for uploads.
Do taskUpload the GRIDFS_LARGE data as a GridFS file. Use whatever upload API is most natural for each language (e.g. open_upload_stream(), write the data to the stream and close the stream).
After taskn/a
TeardownDrop the 'perftest' database.

GridFS download

Summary: This benchmark tests driver performance downloading a GridFS file to memory.

Dataset: The dataset, designated GRIDFS_LARGE, consists of a single file containing about 50 MB of random data. We use a large file to ensure multiple database round-trips even if chunks are are sent in batches.

Dataset size: For score purposes, the dataset size for a task is the size of the source file (52,428,800 bytes) times 1 operation or 52.43 MB.

PhaseDescription
SetupConstruct a MongoClient object. Drop the 'perftest' database. Upload the GRIDFS_LARGE file to the default gridFS bucket with the name "gridfstest". Record the _id of the uploaded file.
Before taskConstruct a GridFSBucket object to use for downloads.
Do taskDownload the "gridfstest" file by its _id. Use whatever download API is most natural for each language (e.g. open_download_stream(), read from the stream into a variable). Discard the downloaded data.
After taskn/a
TeardownDrop the 'perftest' database.

Parallel

Datasets are in the 'parallel' tarball.

Parallel tests simulate ETL operations from disk to database or vice-versa. They are designed to be implemented using a language's preferred approach to concurrency and thus stress how drivers handle concurrency. These intentionally involve overhead above and beyond the driver itself to simulate -- however loosely -- the sort of "real-world" pressures that a drivers would be under during concurrent operation.

They are intended for directional indication of which languages perform best for this sort of pseudo-real-world activity, but are not intended to represent real-world performance claims.

Drivers teams are expected to treat these as a competitive "shoot-out" to surface optimal ETL patterns for each language (e.g. multi-thread, multi-process, asynchronous I/O, etc.).

Parallel micro-benchmarks include:

  • LDJSON multi-file import
  • LDJSON multi-file export
  • GridFS multi-file upload
  • GridFS multi-file download

LDJSON multi-file import

Summary: This benchmark tests driver performance importing documents from a set of LDJSON files.

Dataset: The dataset, designated LDJSON_MULTI (disk directory 'ldjson_multi'), consists of 100 LDJSON files, each containing 5,000 JSON documents. Each document should be about 1000 bytes.

Dataset size: For score purposes, the dataset size for a task is the total size of all source files: 565,000,000 bytes or 565 MB.

PhaseDescription
SetupConstruct a MongoClient object. Drop the 'perftest' database.
Before taskDrop the 'corpus' collection. Create an empty 'corpus' collection with the 'create' command.
Do taskDo an unordered insert of all 500,000 documents in the dataset into the 'corpus' collection as fast as possible. Data must be loaded from disk during this phase. Concurrency is encouraged.
After taskn/a
TeardownDrop the 'perftest' database.

LDJSON multi-file export

Summary: This benchmark tests driver performance exporting documents to a set of LDJSON files.

Dataset: The dataset, designated LDJSON_MULTI, consists of 100 LDJSON files, each containing 5,000 JSON documents. Each document should be about 1000 bytes.

Dataset size: For score purposes, the dataset size for a task is the total size of all source files: 565,000,000 bytes or 565 MB.

PhaseDescription
SetupConstruct a MongoClient object. Drop the 'perftest' database. Drop the 'corpus' collection. Do an unordered insert of all 500,000 documents in the dataset into the 'corpus' collection.
Before taskConstruct whatever objects, threads, etc. are required for exporting the dataset.
Do taskDump all 500,000 documents in the dataset into 100 LDJSON files of 5,000 documents each as fast as possible. Data must be completely written/flushed to disk during this phase. Concurrency is encouraged. The order and distribution of documents across files does not need to match the original LDJSON_MULTI files.
After taskn/a
TeardownDrop the 'perftest' database.

GridFS multi-file upload

Summary: This benchmark tests driver performance uploading files from disk to GridFS.

Dataset: The dataset, designated GRIDFS_MULTI (disk directory 'gridfs_multi'), consists of 50 files, each of 5MB. This file size corresponds roughly to the output of a (slightly dated) digital camera. Thus the task approximates uploading 50 "photos".

Dataset size: For score purposes, the dataset size for a task is the total size of all source files: 262,144,000 bytes or 262.144 MB.

PhaseDescription
SetupConstruct a MongoClient object. Drop the 'perftest' database.
Before taskDrop the default GridFS bucket in the 'perftest' database. Construct a GridFSBucket object for the default bucket in 'perftest' to use for uploads. Insert a 1-byte file into the bucket (to initialize indexes).
Do taskUpload all 50 files in the GRIDFS_MULTI dataset (reading each from disk). Concurrency is encouraged.
After taskn/a
TeardownDrop the 'perftest' database.

GridFS multi-file download

Summary: This benchmark tests driver performance downloading files from GridFS to disk.

Dataset: The dataset, designated GRIDFS_MULTI, consists of 50 files, each of 5MB. This file size corresponds roughly to the output of a (slightly dated) digital camera. Thus the task approximates downloading 50 "photos".

Dataset size: For score purposes, the dataset size for a task is the total size of all source files: 262,144,000 bytes or 262.144 MB.

PhaseDescription
SetupConstruct a MongoClient object. Drop the 'perftest' database. Construct a temporary directory for holding downloads. Drop the default GridFS bucket in the 'perftest' database. Upload the 50 file dataset to the default GridFS bucket in 'perftest'.
Before taskDelete all files in the temporary folder for downloads. Construct a GridFSBucket object to use for downloads from the default bucket in 'perftest'.
Do taskDownload all 50 files in the GRIDFS_MULTI dataset, saving each to a file in the temporary folder for downloads. Data must be completely written/flushed to disk during this phase. Concurrency is encouraged.
After taskn/a
TeardownDrop the 'perftest' database.

Composite score calculation

Every micro-benchmark has a score equal to the 50th percentile (median) of sampled timings, expressed as Megabytes (1,000,000 bytes) per second where the micro-benchmark "database size" given in each section above is divided by the 50th percentile of measured wall-clock times.

From these micro-benchmarks, the following composite scores must be calculated:

Composite NameCompositing formula
BSONBenchAverage of all BSON micro-benchmarks
SingleBenchAverage of all Single-doc micro-benchmarks, except "Run Command"
MultiBenchAverage of all Multi-doc micro-benchmarks
ParallelBenchAverage of all Parallel micro-benchmarks
ReadBenchAverage of "Find one", "Find many and empty cursor", "GridFS download", "LDJSON multi-file export", and "GridFS multi-file download" microbenchmarks
WriteBenchAverage of "Small doc insertOne", "Large doc insertOne", "Small doc bulk insert", "Large doc bulk insert", "GridFS upload", "LDJSON multi-file import", and "GridFS multi-file upload" micro-benchmarks
DriverBenchAverage of ReadBench and WriteBench

At least for this first DriverBench version, scores are combined with simple averages. In addition, the BSONBench scores do not factor into the overall DriverBench scores, as encoding and decoding are inherent in all other tasks.

Benchmark platform, configuration and environments

Benchmark Client

TBD: spec Amazon instance size; describe in general terms how language clients will be run independently; same AWS zone as server

All operations must be run with write concern "w:1".

Benchmark Server

TBD: spec Amazon instance size; describe configuration (e.g. no auth, journal, pre-alloc sizes?, WT with compression to minimize disk I/O impact?); same AWS zone as client

Score Server

TBD: spec system to hold scores over time

Datasets

TBD: generated datasets should be park in S3 or somewhere for retrieval by URL

Changelog

  • 2024-01-22: Migrated from reStructuredText to Markdown.

  • 2022-10-05: Remove spec front matter and reformat changelog.

  • 2021-04-06: Update run command test to use 'hello' command

  • 2016-08-13:

    • Update corpus files to allow much greater compression of data
    • Updated LDJSON corpus size to reflect revisions to the test data
    • Published data files on GitHub and updated instructions on how to find datasets
    • RunCommand and query benchmark can create collection objects during setup rather than before task. (No change on actual benchmark.)
  • 2016-01-06:

    • Clarify that 'bulk insert' means 'insert_many'
    • Clarify that "create a collection" means using the 'create' command
    • Add omitted "upload files" step to setup for GridFS multi-file download; also clarify that steps should be using the default bucket in the 'perftest' database
  • 2015-12-23:

    • Rename benchmark names away from MMA/weight class names
    • Split BSON encoding and decoding micro-benchmarks
    • Rename BSON micro-benchmarks to better match dataset names
    • Move "Run Command" micro-benchmark out of composite
    • Reduced amount of data held in memory and sent to/from the server to decrease memory pressure and increase number of iterations in a reasonable time (e.g. file sizes and number of documents in certain datasets changed)
    • Create empty collections/indexes during the 'before' phase when appropriate
    • Updated data set sizes to account for changes in the source file structure/size

BSON Corpus

  • Status: Accepted
  • Minimum Server Version: N/A

Abstract

The official BSON specification does not include test data, so this pseudo-specification describes tests for BSON encoding and decoding. It also includes tests for MongoDB's "Extended JSON" specification (hereafter abbreviated as extjson).

Meta

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Motivation for Change

To ensure correct operation, we want drivers to implement identical tests for important features. BSON (and extjson) are critical for correct operation and data exchange, but historically had no common test corpus. This pseudo-specification provides such tests.

Goals

  • Provide machine-readable test data files for BSON and extjson encoding and decoding.
  • Cover all current and historical BSON types.
  • Define test data patterns for three cases:
    • conversion/roundtrip,
    • decode errors, and
    • parse errors.

Non-Goals

  • Replace or extend the official BSON spec at http://bsonspec.org.
  • Provide a formal specification for extjson.

Specification

The specification for BSON lives at http://bsonspec.org. The extjson format specification is here.

Test Plan

This test plan describes a general approach for BSON testing. Future BSON specifications (such as for new types like Decimal128) may specialize or alter the approach described below.

Description of the BSON Corpus

This BSON test data corpus consists of a JSON file for each BSON type, plus a top.json file for testing the overall, enclosing document and a multi-type.json file for testing a document with all BSON types. There is also a multi-type-deprecated.json that includes deprecated keys.

Top level keys

  • description: human-readable description of what is in the file
  • bson_type: hex string of the first byte of a BSON element (e.g. "0x01" for type "double"); this will be the synthetic value "0x00" for "whole document" tests like top.json.
  • test_key: (optional) name of a field in a single-BSON-type valid test case that contains the data type being tested.
  • valid (optional): an array of validity test cases (see below).
  • decodeErrors (optional): an array of decode error cases (see below).
  • parseErrors (optional): an array of type-specific parse error case (see below).
  • deprecated (optional): this field will be present (and true) if the BSON type has been deprecated (i.e. Symbol, Undefined and DBPointer)

Validity test case keys

Validity test cases include 'canonical' forms of BSON and Extended JSON that are deemed equivalent and may provide additional cases or metadata for additional assertions. For each case, keys include:

  • description: human-readable test case label.
  • canonical_bson: an (uppercase) big-endian hex representation of a BSON byte string. Be sure to mangle the case as appropriate in any roundtrip tests.
  • canonical_extjson: a string containing a Canonical Extended JSON document. Because this is itself embedded as a string inside a JSON document, characters like quote and backslash are escaped.
  • relaxed_extjson: (optional) a string containing a Relaxed Extended JSON document. Because this is itself embedded as a string inside a JSON document, characters like quote and backslash are escaped.
  • degenerate_bson: (optional) an (uppercase) big-endian hex representation of a BSON byte string that is technically parseable, but not in compliance with the BSON spec. Be sure to mangle the case as appropriate in any roundtrip tests.
  • degenerate_extjson: (optional) a string containing an invalid form of Canonical Extended JSON that is still parseable according to type-specific rules. (For example, "1e100" instead of "1E+100".)
  • converted_bson: (optional) an (uppercase) big-endian hex representation of a BSON byte string. It may be present for deprecated types. It represents a possible conversion of the deprecated type to a non-deprecated type, e.g. symbol to string.
  • converted_extjson: (optional) a string containing a Canonical Extended JSON document. Because this is itself embedded as a string inside a JSON document, characters like quote and backslash are escaped. It may be present for deprecated types and is the Canonical Extended JSON representation of converted_bson.
  • lossy (optional) -- boolean; present (and true) iff canonical_bson can't be represented exactly with extended JSON (e.g. NaN with a payload).

Decode error case keys

Decode error cases provide an invalid BSON document or field that should result in an error. For each case, keys include:

  • description: human-readable test case label.
  • bson: an (uppercase) big-endian hex representation of an invalid BSON string that should fail to decode correctly.

Parse error case keys

Parse error cases are type-specific and represent some input that can not be encoded to the bson_type under test. For each case, keys include:

  • description: human-readable test case label.
  • string: a text or numeric representation of an input that can't be parsed to a valid value of the given type.

Extended JSON encoding, escaping and ordering

Because the canonical_extjson and other Extended JSON fields are embedded in a JSON document, all their JSON metacharacters are escaped. Control characters and non-ASCII codepoints are represented with \uXXXX. Note that this means that the corpus JSON will appear to have double-escaped characters \\uXXXX. This is by design to ensure that the Extended JSON fields remain printable ASCII without embedded null characters to ensure maximum portability to different language JSON or extended JSON decoders.

There are legal differences in JSON representation that may complicate testing for particular codecs. The JSON in the corpus may not resemble the JSON generated by a codec, even though they represent the same data. Some known differences include:

  • JSON only requires certain characters to be escaped but allows any character to be escaped.
  • The JSON format is unordered and whitespace (outside of strings) is not significant.

Implementations using these tests MUST normalize JSON comparisons however necessary for effective comparison.

Language-specific differences

Some programming languages may not be able to represent or transmit all types accurately. In such cases, implementations SHOULD ignore (or modify) any tests which are not supported on that platform.

Testing validity

To test validity of a case in the valid array, we consider up to five possible representations:

  • Canonical BSON (denoted herein as "cB") -- fully valid, spec-compliant BSON
  • Degenerate BSON (denoted herein as "dB") -- invalid but still parseable BSON (bad array keys, regex options out of order)
  • Canonical Extended JSON (denoted herein as "cEJ") -- A string format based on the JSON standard that emphasizes type preservation at the expense of readability and interoperability.
  • Degenerate Extended JSON (denoted herin as "dEJ") -- An invalid form of Canonical Extended JSON that is still parseable. (For example, "1e100" instead of "1E+100".)
  • Relaxed Extended JSON (denoted herein as "rEJ") -- A string format based on the JSON standard that emphasizes readability and interoperability at the expense of type preservation.

Not all input types will exist for a given test case.

There are two forms of BSON/Extended JSON codecs: ones that have a language-native "intermediate" representation and ones that do not.

For a codec without an intermediate representation (i.e. one that translates directly from BSON to JSON or back), the following assertions MUST hold (function names are for clarity of illustration only):

  • for cB input:
    • bson_to_canonical_extended_json(cB) = cEJ
    • bson_to_relaxed_extended_json(cB) = rEJ (if rEJ exists)
  • for cEJ input:
    • json_to_bson(cEJ) = cB (unless lossy)
  • for dB input (if it exists):
    • bson_to_canonical_extended_json(dB) = cEJ
    • bson_to_relaxed_extended_json(dB) = rEJ (if rEJ exists)
  • for dEJ input (if it exists):
    • json_to_bson(dEJ) = cB (unless lossy)
  • for rEJ input (if it exists):
    • bson_to_relaxed_extended_json( json_to_bson(rEJ) ) = rEJ

For a codec that has a language-native representation, we want to test both conversion and round-tripping. For these codecs, the following assertions MUST hold (function names are for clarity of illustration only):

  • for cB input:
    • native_to_bson( bson_to_native(cB) ) = cB
    • native_to_canonical_extended_json( bson_to_native(cB) ) = cEJ
    • native_to_relaxed_extended_json( bson_to_native(cB) ) = rEJ (if rEJ exists)
  • for cEJ input:
    • native_to_canonical_extended_json( json_to_native(cEJ) ) = cEJ
    • native_to_bson( json_to_native(cEJ) ) = cB (unless lossy)
  • for dB input (if it exists):
    • native_to_bson( bson_to_native(dB) ) = cB
  • for dEJ input (if it exists):
    • native_to_canonical_extended_json( json_to_native(dEJ) ) = cEJ
    • native_to_bson( json_to_native(dEJ) ) = cB (unless lossy)
  • for rEJ input (if it exists):
    • native_to_relaxed_extended_json( json_to_native(rEJ) ) = rEJ

Implementations MAY test assertions in an implementation-specific manner.

Testing decode errors

The decodeErrors cases represent BSON documents that are sufficiently incorrect that they can't be parsed even with liberal interpretation of the BSON schema (e.g. reading arrays with invalid keys is possible, even though technically invalid, so they are not decodeErrors).

Drivers SHOULD test that each case results in a decoding error. Implementations MAY test assertions in an implementation-specific manner.

Testing parsing errors

The interpretation of parseErrors is type-specific. The structure of test cases within parseErrors is described in Parse error case keys.

Drivers SHOULD test that each case results in a parsing error (e.g. parsing Extended JSON, constructing a language type). Implementations MAY test assertions in an implementation-specific manner.

Top-level Document (type 0x00)

For type "0x00" (i.e. top-level documents), the string field contains input for an Extended JSON parser. Drivers MUST parse the Extended JSON input using an Extended JSON parser and verify that doing so yields an error. Drivers that parse Extended JSON into language types instead of directly to BSON MAY need to additionally convert the resulting language type(s) to BSON to expect an error.

Drivers SHOULD also parse the Extended JSON input using a regular JSON parser (not an Extended JSON one) and verify the input is parsed successfully. This serves to verify that the parseErrors test cases are testing Extended JSON-specific error conditions and that they do not have, for example, unintended syntax errors.

Note: due to the generic nature of these tests, they may also be used to test Extended JSON parsing errors for various BSON types appearing within a document.

Binary (type 0x05)

For type "0x05" (i.e. binary), the rules for handling parseErrors are the same as those for Top-level Document (type 0x00).

Decimal128 (type 0x13)

For type "0x13" (i.e. Decimal128), the string field contains input for a Decimal128 parser that converts string input to a binary Decimal128 value (e.g. Decimal128 constructor). Drivers MUST assert that these strings cannot be successfully converted to a binary Decimal128 value and that parsing the string produces an error.

Deprecated types

The corpus files for deprecated types are provided for informational purposes. Implementations MAY ignore or modify them to match legacy treatment of deprecated types. The converted_bson and converted_extjson fields MAY be used to test conversion to a standard type or MAY be ignored.

Prose Tests

The following tests have not yet been automated, but MUST still be tested.

1. Prohibit null bytes in null-terminated strings when encoding BSON

The BSON spec uses null-terminated strings to represent document field names and regex components (i.e. pattern and flags/options). Drivers MUST assert that null bytes are prohibited in the following contexts when encoding BSON (i.e. creating raw BSON bytes or constructing BSON-specific type classes):

  • Field name within a root document
  • Field name within a sub-document
  • Pattern for a regular expression
  • Flags/options for a regular expression

Depending on how drivers implement BSON encoding, they MAY expect an error when constructing a type class (e.g. BSON Document or Regex class) or when encoding a language representation to BSON (e.g. converting a dictionary, which might allow null bytes in its keys, to raw BSON bytes).

Implementation Notes

A tool for visualizing BSON

The test directory includes a Perl script bsonview, which will decompose and highlight elements of a BSON document. It may be used like this:

echo "0900000010610005000000" | perl bsonview -x

Notes for certain types

Array

Arrays can have degenerate BSON if the array indexes are not set as "0", "1", etc.

Boolean

The only valid values are 0 and 1. Other non-zero numbers MUST be interpreted as errors rather than "true" values.

Binary

The Base64 encoded text in the extended JSON representation MUST be padded.

Code

There are multiple ways to encode Unicode characters as a JSON document. Individual implementers may need to normalize provided and generated extended JSON before comparison.

Decimal

NaN with payload can't be represented in extended JSON, so such conversions are lossy.

Double

There is not yet a way to represent Inf, -Inf or NaN in extended JSON. Even if a $numberDouble is added, it is unlikely to support special values with payloads, so such doubles would be lossy when converted to extended JSON.

String representation of doubles is fairly unportable so it's hard to provide a single string that all platforms/languages will generate. Testers may need to normalize/modify the test cases.

String

There are multiple ways to encode Unicode characters as a JSON document. Individual implementers may need to normalize provided and generated extended JSON before comparison.

DBPointer

This type is deprecated. The provided converted form (converted_bson) represents them as DBRef documents, but such conversion is outside the scope of this spec.

Symbol

This type is deprecated. The provided converted form converts these to strings, but such conversion is outside the scope of this spec.

Undefined

This type is deprecated. The provided converted form converts these to Null, but such conversion is outside the scope of this spec.

Reference Implementation

The Java, C# and Perl drivers.

Design Rationale

Use of extjson

Testing conversion requires an "input" and an "output". With a BSON string as both input and output, we can only test that it roundtrips correctly --we can't test that the decoded value visible to the language is correct.

For example, a pathological encoder/decoder could invert Boolean true and false during decoding and encoding. The BSON would roundtrip but the program would see the wrong values.

Therefore, we need a separate, semantic description of the contents of a BSON string in a machine readable format. Fortunately, we already have extjson as a means of doing so. The extended JSON strings contained within the tests adhere to the Extended JSON Specification.

Repetition across cases

Some validity cases may result in duplicate assertions across cases, particularly if the degenerate_bson field is different in different cases, but the canonical_bson field is the same. This is by design so that each case stands alone and can be confirmed to be internally consistent via the assertions. This makes for easier and safer test case development.

Changelog

  • 2024-01-22: Migrated from reStructuredText to Markdown.

  • 2023-06-14: Add decimal128 Extended JSON parse tests for clamped zeros with
    very large exponents.

  • 2022-10-05: Remove spec front matter and reformat changelog.

  • 2021-09-09: Clarify error expectation rules for parseErrors.

  • 2021-09-02: Add spec and prose tests for prohibiting null bytes in
    null-terminated strings within document field names and regular expressions. Clarify type-specific rules for parseErrors.

  • 2017-05-26: Revised to be consistent with Extended JSON spec 2.0: valid case
    fields have changed, as have the test assertions.

  • 2017-01-23: Added multi-type.json to test encoding and decoding all BSON
    types within the same document. Amended all extended JSON strings to adhere to the Extended JSON Specification. Modified the "Use of extjson" section of this specification to note that canonical extended JSON is now used.

  • 2016-11-14: Removed "invalid flags" BSON Regexp case.

  • 2016-10-25: Added a "non-alphabetized flags" case to the BSON Regexp corpus
    file; decoders must be able to read non-alphabetized flags, but encoders must emit alphabetized flags. Added an "invalid flags" case to the BSON Regexp corpus file.

=========================================== Connections Survive Primary Step Down Tests

These tests can be used to verify a driver's compliance with server discovery and monitoring requirements with respect to handling "not primary" and "node is shutting down" error responses from the server.

.. contents::

These tests apply only to replica set topologies.

Server Fail Point

See: Server Fail Point_ in the Transactions spec test suite.

.. _Server Fail Point: ../../transactions/tests#server-fail-point

Disabling Fail Point after Test Execution


After each test that configures a fail point, drivers should disable the
``failCommand`` fail point to avoid spurious failures in
subsequent tests. The fail point may be disabled like so::

    db.runCommand({
        configureFailPoint: "failCommand",
        mode: "off"
    });

Consideration when using serverStatus
`````````````````````````````````````

Drivers executing `serverStatus`_ for connection assertions MUST take its own
connection into account when making their calculations. Those drivers SHOULD
execute `serverStatus`_ using a separate client not under test.


Tests
-----


Test setup
``````````

For each test, make sure the following steps have been completed before running the actual test:

- Create a ``MongoClient`` with ``retryWrites=false``
- Create a collection object from the ``MongoClient``, using ``step-down`` for the database and collection name.
- Drop the test collection, using ``writeConcern`` "majority".
- Execute the "create" command to recreate the collection, using ``writeConcern``
  "majority".

The driver should implement the following tests:

getMore Iteration
`````````````````

This test requires a replica set with server version 4.2 or higher.

Perform the following operations:

- Insert 5 documents into a collection with a majority write concern.
- Start a find operation on the collection with a batch size of 2, and
  retrieve the first batch of results.
- Send a ``{replSetFreeze: 0}`` command to any secondary and verify that the
  command succeeded. This command will unfreeze the secondary and ensure that
  it will be eligible to be elected immediately.
- Send a ``{replSetStepDown: 30, force: true}`` command to the current primary and verify that
  the command succeeded.
- Retrieve the next batch of results from the cursor obtained in the find
  operation, and verify that this operation succeeded.
- If the driver implements the `CMAP`_ specification, verify that no new `PoolClearedEvent`_ has been
  published. Otherwise verify that `connections.totalCreated`_ in `serverStatus`_ has not changed.


Not Primary - Keep Connection Pool
``````````````````````````````````

This test requires a replica set with server version 4.2 or higher.

- Set the following fail point: ``{configureFailPoint: "failCommand", mode: {times: 1},
  data: {failCommands: ["insert"], errorCode: 10107}}``
- Execute an insert into the test collection of a ``{test: 1}``
  document.
- Verify that the insert failed with an operation failure with 10107 code.
- Execute an insert into the test collection of a ``{test: 1}``
  document and verify that it succeeds.
- If the driver implements the `CMAP`_ specification, verify that no new `PoolClearedEvent`_ has been
  published. Otherwise verify that `connections.totalCreated`_ in `serverStatus`_ has not changed.



Not Primary - Reset Connection Pool
```````````````````````````````````

This test requires a replica set with server version 4.0.


- Set the following fail point: ``{configureFailPoint: "failCommand", mode: {times: 1},
  data: {failCommands: ["insert"], errorCode: 10107}}``
- Execute an insert into the test collection of a ``{test: 1}``
  document.
- Verify that the insert failed with an operation failure with 10107 code.
- If the driver implements the `CMAP`_ specification, verify that a `PoolClearedEvent`_
  has been published
- Execute an insert into the test collection of a ``{test: 1}``
  document and verify that it succeeds.
- If the driver does NOT implement the `CMAP`_ specification, use the `serverStatus`_
  command to verify `connections.totalCreated`_ has increased by 1.


Shutdown in progress - Reset Connection Pool

This test should be run on all server versions >= 4.0.

Perform the following operations on a client configured to NOT retry writes:

  • Set the following fail point: {configureFailPoint: "failCommand", mode: {times: 1}, data: {failCommands: ["insert"], errorCode: 91}}
  • Execute an insert into the test collection of a {test: 1} document.
  • Verify that the insert failed with an operation failure with 91 code.
  • If the driver implements the CMAP_ specification, verify that a PoolClearedEvent_ has been published
  • Execute an insert into the test collection of a {test: 1} document and verify that it succeeds.
  • If the driver does NOT implement the CMAP_ specification, use the serverStatus_ command to verify connections.totalCreated_ has increased by 1.

Interrupted at shutdown - Reset Connection Pool


This test should be run on all server versions >= 4.0.

Perform the following operations on a client configured to NOT retry writes:

- Set the following fail point: ``{configureFailPoint: "failCommand", mode: {times: 1},
  data: {failCommands: ["insert"], errorCode: 11600}}``
- Execute an insert into the test collection of a ``{test: 1}``
  document.
- Verify that the insert failed with an operation failure with 11600 code.
- If the driver implements the `CMAP`_ specification, verify that a `PoolClearedEvent`_
  has been published
- Execute an insert into the test collection of a ``{test: 1}``
  document and verify that it succeeds.
- If the driver does NOT implement the `CMAP`_ specification, use the `serverStatus`_
  command to verify `connections.totalCreated`_ has increased by 1.



Questions and Answers
---------------------

Do we need to wait for re-election after the first test?

Since test setup requires creation of a collection, a primary must exist, so subsequent tests will block in server selection until a primary is available again.

Why do tests check for a successful insert operation in addition to checking that the pool was updated appropriately?


Ensuring that we can run a successful insert after the primary steps down and without needing to recreate the
``MongoClient`` serves to test the resiliency of drivers in the event of a failover/election. Even though checking for
a successful insert operation does not directly test functionality introduced in this specification, it is a
straightforward way to test driver resiliency against a live replica set undergoing an election. This testing
methodology is in contrast to the one adopted by the SDAM spec tests that rely entirely on mocking with no actual
server communication.


.. _CMAP: ../../connection-monitoring-and-pooling/connection-monitoring-and-pooling.md
.. _PoolClearedEvent: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.md#events
.. _serverStatus: https://www.mongodb.com/docs/manual/reference/command/serverStatus
.. _connections.totalCreated: https://www.mongodb.com/docs/manual/reference/command/serverStatus/#serverstatus.connections.totalCreated

FaaS Automated Testing

  • Status:
  • Minimum Server Version: 3.6

Abstract

This specification is about the ability for drivers to automate tests for "Functions as a Service" from continuous integration.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Specification

Terms

FaaS

"Function as a Service", such as AWS Lambda.

Implementing Automated FaaS Tests

AWS Lambda

This section describes the required setup of an AWS Lambda function and the steps needed to automate the deployment and execution of the function in Evergreen.

Local Execution
Prerequisites

For the initial local setup the following are required:

  • The docker daemon running on the local machine.
  • The AWS SAM CLI

AWS access MUST be configured, either through $HOME/.aws/credentials or with the following environment variables:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • AWS_REGION - Set to us-east-1
  • MONGODB_URI - The local MongoDB instance
Project Initialization

Create the new project via SAM and follow the prompts:

sam init

For the template, select "AWS Quick Start Template".

Which template source would you like to use?
  1 - AWS Quick Start Templates
  2 - Custom Template Location
Choice: 1

For the quick start template, select "Hello World Example".

Choose an AWS Quick Start application template
  1 - Hello World Example
  2 - Multi-step workflow
  3 - Serverless API
  4 - Scheduled task
  5 - Standalone function
  6 - Data processing
  7 - Infrastructure event management
  8 - Hello World Example With Powertools
  9 - Serverless Connector Hello World Example
  10 - Multi-step workflow with Connectors
  11 - Lambda EFS example
  12 - DynamoDB Example
  13 - Machine Learning
Template: 1

When prompted for language if the driver language is not Python, select "N".

Use the most popular runtime and package type? (Python and zip) [y/N]: n

Then select the runtime for your driver:

Which runtime would you like to use?
  1 - aot.dotnet7 (provided.al2)
  2 - dotnet6
  3 - dotnet5.0
  4 - dotnetcore3.1
  5 - go1.x
  6 - go (provided.al2)
  7 - graalvm.java11 (provided.al2)
  8 - graalvm.java17 (provided.al2)
  9 - java11
  10 - java8.al2
  11 - java8
  12 - nodejs18.x
  13 - nodejs16.x
  14 - nodejs14.x
  15 - nodejs12.x
  16 - python3.9
  17 - python3.8
  18 - python3.7
  19 - ruby2.7
  20 - rust (provided.al2)
Runtime: 12

Select Zip package type:

What package type would you like to use?
  1 - Zip
  2 - Image
Package type: 1

Then follow the remaining prompts for the driver language to finish setup. Drivers MAY choose to also enable X-Ray tracing and CloudWatch Application Insights during these next steps.

NOTE - If the driver wants to skip prompts in the setup it can provide defaults to the sam init command. Example:

sam init --name my-hello-world-app \
    --app-template "hello-world" \
    --runtime go1.x \
    --package-type Zip
Function Setup

In the newly created project directory modify the template.yaml file:

Change default timeout to 30 seconds:

Globals:
  Function:
    Timeout: 30

Add a root parameter for the MongoDB connection string:

Parameters:
  MongoDbUri:
    Type: String
    Description: The MongoDB connection string.

Replace all instances in the template.yaml of HelloWorld with MongoDB and then modify the root Resources config to add the MONGODB_URI env variable reference and change the CodeUri to mongodb/ : Then rename the hello-world directory to mongodb. Do not change the Handler and Runtime properties.

Resources:
  MongoDBFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: mongodb/
      Environment:
        Variables:
          MONGODB_URI: !Ref MongoDbUri

If the generated template contains Resources.Events.CatchAll.Properties.Path then change it to /mongodb and if it also contains Resources.Handler modify that to mongodb as well.

Resources:
  Events:
    CatchAll:
      Properties:
        Path: /mongodb
  Handler: mongodb

If possible, install the current driver under test into the lambda environment, to avoid having to release the driver in order to test features or catch regressions. See docs on https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-zip.html for how to create a .zip file deployment with dependencies.

Start the local MongoDB instance. If using Docker Desktop on MacOS, set MONGODB_URI=mongodb://host.docker.internal:27017 in order for the function to be able to access the host port.

Run the function locally from the same directory where the template.yaml resides:

sam build
sam local invoke --parameter-overrides "MongoDbUri=${MONGODB_URI}"

NOTE "127.0.0.1" in the MONGODB_URI MUST be replaced with "host.docker.internal" to test a local MongoDB deployment. If "host.docker.internal" does not work (can occur on M1 machines), drivers MAY choose to use a [bridged docker container](https://docs.docker.com/network/bridge/) to test locally.

Implementing the Function

Drivers MUST setup the function as would be done in their appropriate language. In the function implementation the driver MUST:

  • Create a MongoClient that points to MONGODB_URI.
  • Add listeners for the following monitoring events: ServerHeartbeatStarted, ServerHeartbeatFailed, CommandSucceeded, CommandFailed, ConnectionCreated, ConnectionClosed.
  • Drivers MUST perform a single insert and then a single delete of the inserted document to force write operations on the primary node.
  • Drivers MUST record the durations and counts of the heartbeats, the durations of the commands, as well as keep track of the number of open connections, and report this information in the function response as JSON.
  • Drivers MUST assert no ServerHeartbeat events contain the awaited=True flag to confirm that the streaming protocol is disabled (DRIVERS-2578).
Running in Continuous Integration

Running in CI requires Evergreen to be setup to assume the appropriate role in AWS and then execute the script in drivers-evergreen-tools with the required environment variables. An explanation of the required environment is as follows:

NameDescription
LAMBDA_AWS_ROLE_ARNThe role ARN to assume
TEST_LAMBDA_DIRECTORYThe lambda function directory
DRIVERS_TOOLSLocation of drivers-evergreen-tools
DRIVERS_ATLAS_PUBLIC_API_KEYThe Atlas public API key
DRIVERS_ATLAS_PRIVATE_API_KEYThe Atlas private API key
DRIVERS_ATLAS_LAMBDA_USERThe Atlas cluster user name
DRIVERS_ATLAS_LAMBDA_PASSWORDThe Atlas cluster user password
DRIVERS_ATLAS_GROUP_IDThe driver's Atlas group id
LAMBDA_STACK_NAMEThe driver's Lambda stack name
AWS_REGIONThe function AWS region
AWS_ACCESS_KEY_IDAssume role automatically sets this
AWS_SECRET_ACCESS_KEYAssume role automatically sets this
AWS_SESSION_TOKENAssume role automatically sets this

Supported Evergreen variants that have the AWS SAM CLI installed:

  • ubuntu2204
  • ubuntu1804
  • ubuntu1804-workstation
  • ubuntu2204-arm64
  • ubuntu2004-arm64
  • ubuntu1804-arm64
  • rhel90
  • rhel80
  • rhel84
  • rhel90-selinux
  • rhel80-selinux
  • rhel90-arm64
  • rhel82-arm64

This is an example task group in the Evergreen config that accomplishes this, using subprocess.exec to execute scripts that call the drivers-evergreen-tools functions inside of it for setup, teardown, and execution:

tasks:
  - name: "test-aws-lambda-deployed"
    commands:
      - func: "install dependencies"
      - command: ec2.assume_role
        params:
          role_arn: ${LAMBDA_AWS_ROLE_ARN}
          duration_seconds: 3600
      - command: subprocess.exec
        params:
          working_dir: src
          binary: bash
          add_expansions_to_env: true
          args:
            - ${DRIVERS_TOOLS}/.evergreen/aws_lambda/run-deployed-lambda-aws-tests.sh
          env:
            TEST_LAMBDA_DIRECTORY: ${PROJECT_DIRECTORY}/test/lambda
            AWS_REGION: us-east-1
task_groups:
  - name: test_aws_lambda_task_group
    setup_group:
      - func: fetch source
      - command: subprocess.exec
        params:
          working_dir: src
          binary: bash
          add_expansions_to_env: true
          args:
            - ${DRIVERS_TOOLS}/.evergreen/atlas/setup-atlas-cluster.sh
      - command: expansions.update
        params:
          file: src/atlas-expansion.yml
    teardown_group:
      - command: subprocess.exec
        params:
          working_dir: src
          binary: bash
          add_expansions_to_env: true
          args:
            - ${DRIVERS_TOOLS}/.evergreen/atlas/teardown-atlas-cluster.sh
    setup_group_can_fail_task: true
    setup_group_timeout_secs: 1800
    tasks:
      - test-aws-lambda-deployed

Drivers MUST run the function on a single variant in Evergreen, in order to not potentially hit the Atlas API rate limit. The variant itself MUST have the SAM CLI installed.

Description of the behaviour of run-deployed-lambda-aws-tests.sh:

  • Builds the Lambda function locally
  • Deploys the Lambda function to AWS.
  • Queries for the Lambda function ARN.
  • Invokes the Lambda function cold and frozen.
  • Initiates a primary failover of the cluster in Atlas.
  • Calls the frozen lambda function again.
  • Deletes the Lambda function.

Changelog

  • 2024-02-27: Migrated from reStructuredText to Markdown.
  • 2023-08-21: Drivers MUST assert that the streaming protocol is disabled in the Lambda function.
  • 2023-08-17: Fixed URI typo, added host note, increase assume role duration.
  • 2023-06-22: Updated evergreen configuration to use task groups.
  • 2023-04-14: Added list of supported variants, added additional template config.

====================== Atlas Serverless Tests

:Status: Accepted :Minimum Server Version: N/A

.. contents::


Introduction

This file describes a subset of existing tests that drivers MUST use to assert compatibility with Atlas Serverless.

Serverless Configuration

These tests MUST be run against a live Atlas Serverless instance. A new instance MUST be created each time the test suite is run, and that instance MUST be used for all of the tests required by this specification. Once the tests are finished, the instance MUST be deleted regardless of the outcome of the tests. The serverless directory in the drivers-evergreen-tools repository_ contains scripts for creating and deleting Atlas Serverless instances, and the config.yml contains an example Evergreen configuration that uses them to run the tests. It can take up to 15 minutes or so to provision a new Atlas Serverless instance, so it is recommended to create one manually via the scripts in drivers-evergreen-tools that can be reused for the initial implementation of the tests before moving to Evergreen patches.

.. _serverless directory in the drivers-evergreen-tools repository: https://github.com/mongodb-labs/drivers-evergreen-tools/tree/master/.evergreen/serverless

Drivers MUST use the create-instance.sh_ script in drivers-evergreen-tools to create a new Atlas Serverless instance for Evergreen tasks. The script writes a URI for the newly created instance to a YAML expansions file:

.. _create-instance.sh: https://github.com/mongodb-labs/drivers-evergreen-tools/blob/master/.evergreen/serverless/create-instance.sh

  • SERVERLESS_URI An SRV connection string to a load balancer fronting a single Atlas Serverless proxy.

The expansions.update Evergreen command can be used to read this file and copy the URI into a SERVERLESS_URI environment variable.

Test Runner Configuration

All tests MUST be run with wire protocol compression and authentication enabled.

In contrast to the Load Balancer testing_, which has separate URIs for load balancers fronting a single or multiple servers, there is only a single URI for Atlas Serverless testing (i.e. SERVERLESS_URI).

.. _Load Balancer testing: ../load-balancers/tests/README.md

The TXT record for SERVERLESS_URI already specifies loadBalanced=true so drivers need not add that.

SERVERLESS_URI does not include authentication credentials. Drivers MUST specify a username and password (see: Required Variables_) when connecting to SERVERLESS_URI.

Drivers MUST use SERVERLESS_URI to configure both internal clients and clients under test (as described in the Unified Test Format spec_).

.. _Unified Test Format spec: ../unified-test-format/unified-test-format.md

Required Variables


Managing the Atlas Serverless instances and connecting to them requires a few
variables to be specified. The variables marked "private" are confidential and
MUST be specified as private Evergreen variables or used only in private
Evergreen projects. If using a public Evergreen project, xtrace MUST be disabled
when using these variables to help prevent accidental leaks.

- ``${SERVERLESS_DRIVERS_GROUP}``: Contains the ID of the Atlas group dedicated
  to drivers testing of Atlas Serverless. The backing multi-tenant MongoDB (MTM)
  MUST have the ``SINGLE_TARGET_SERVERLESS_DEPLOYMENT`` feature flag enabled
  (`CLOUDP-117288`_).

- ``${SERVERLESS_API_PUBLIC_KEY}``: The public key required to use the Atlas API
  for managing Atlas Serverless instances.

- ``${SERVERLESS_API_PRIVATE_KEY}``: (private) The private key required to use
  the Atlas API for managing Atlas Serverless instances.

- ``${SERVERLESS_ATLAS_USER}``: (private) The SCRAM username used to
  authenticate to any Atlas Serverless instance created in the drivers testing
  Atlas group.

- ``${SERVERLESS_ATLAS_PASSWORD}``: (private) The SCRAM password used to
  authenticate to any Atlas Serverless instance created in the drivers testing
  Atlas group.

.. _CLOUDP-117288: https://jira.mongodb.org/browse/CLOUDP-117288

Existing Spec Tests
===================

Unified spec tests from all specifications MUST be run against Atlas Serverless.
Since schema version 1.4, unified tests can specify Atlas Serverless
compatibility in their ``runOnRequirements``.

Any prose and legacy spec tests defined in the following specifications MUST be
included in a driver's Atlas Serverless testing suite:

- CRUD
- Load Balancer
- Retryable Reads
- Retryable Writes
- Sessions
- Transactions (excluding convenient API)

  - Note: the killAllSessions command is not supported on Serverless, so the
    transactions tests may hang if an individual test leaves a transaction open
    when it finishes (`CLOUDP-84298`_).
- Versioned/Stable API
- Client Side Encryption

  - Drivers MUST test with a version of the ``crypt_shared`` shared library that
    matches the MongoDB Server version running in Serverless.
    See `Using crypt_shared <https://github.com/mongodb/specifications/tree/e761591616849d9b507287811e77f7a359fb9587/source/client-side-encryption/tests#using-crypt-shared>`_.

.. _CLOUDP-84298: https://jira.mongodb.org/browse/CLOUDP-84298

Note that the legacy JSON/YAML test formats for these specifications were
updated to include a new ``runOnRequirement`` specifically for Atlas Serverless
testing. To ensure these requirements are enforced properly, the runner MUST be
informed that it is running against an Atlas Serverless instance through some
indicator (e.g. an environment variable).

Note that since Atlas Serverless testing uses a load balancer fronting a single
Atlas Serverless proxy, the connection string cannot be modified for
``useMultipleMongoses``. Drivers SHOULD ignore ``useMultipleMongoses`` for
purposes of Atlas Serverless testing.

Other Tests
===========

Any other existing tests for cursor behavior that a driver may have implemented
independently of any spec requirements SHOULD also be included in the driver's
Atlas Serverless testing suite. Note that change streams are not supported by the
proxy, so their spec and prose tests MUST be skipped.


Changelog
=========

:2022-10-05: Add spec front matter
:2022-04-22: Testing uses a load balancer fronting a single proxy.
:2021-08-25: Update tests for load balanced serverless instances.