Extended JSON

  • Status: Accepted
  • Minimum Server Version: N/A

Abstract

MongoDB Extended JSON is a string format for representing BSON documents. This specification defines the canonical format for representing each BSON type in the Extended JSON format. Thus, a tool that implements Extended JSON will be able to parse the output of any tool that emits Canonical Extended JSON. It also defines a Relaxed Extended JSON format that improves readability at the expense of type information preservation.

META

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Naming

Acceptable naming deviations should fall within the basic style of the language. For example, CanonicalExtendedJSON would be a name in Java, where camel-case method names are used, but in Ruby canonical_extended_json would be acceptable.

Terms

Type wrapper object - a JSON value consisting of an object with one or more $-prefixed keys that collectively encode a BSON type and its corresponding value using only JSON value primitives.

Extended JSON - A general term for one of many string formats based on the JSON standard that describes how to represent BSON documents in JSON using standard JSON types and/or type wrapper objects. This specification gives a formal definition to variations of such a format.

Relaxed Extended JSON - A string format based on the JSON standard that describes BSON documents. Relaxed Extended JSON emphasizes readability and interoperability at the expense of type preservation.

Canonical Extended JSON - A string format based on the JSON standard that describes BSON documents. Canonical Extended JSON emphasizes type preservation at the expense of readability and interoperability.

Legacy Extended JSON - A string format based on the JSON standard that describes a BSON document. The Legacy Extended JSON format does not describe a specific, standardized format, and many tools, drivers, and libraries implement Extended JSON in conflicting ways.

Specification

Extended JSON Format

The Extended JSON grammar extends the JSON grammar as defined in section 2 of the JSON specification by augmenting the possible JSON values as defined in Section 3. This specification defines two formats for Extended JSON:

  • Canonical Extended JSON
  • Relaxed Extended JSON

An Extended JSON value MUST conform to one of these two formats as described in the table below.

Notes on grammar

  • Key order:
    • Keys within Canonical Extended JSON type wrapper objects SHOULD be emitted in the order described.
    • Keys within Relaxed Extended JSON type wrapper objects are unordered.
  • Terms in italics represent types defined elsewhere in the table or in the JSON specification.
  • JSON numbers (as defined in Section 6 of the JSON specification) include both integer and floating point types. For the purpose of this document, we define the following subtypes:
    • Type integer means a JSON number without frac or exp components; this is expressed in the JSON spec grammar as [minus] int.
    • Type non-integer means a JSON number that is not an integer; it must include either a frac or exp component or both.
    • Type pos-integer means a non-negative JSON number without frac or exp components; this is expressed in the JSON spec grammar as int.
  • A hex string is a JSON string that contains only hexadecimal digits [0-9a-f]. It SHOULD be emitted lower-case, but MUST be read in a case-insensitive fashion.
  • < Angle brackets > detail the contents of a value, including type information.
  • [Square brackets] specify a type constraint that restricts the specification to a particular range or set of values.

Conversion table

BSON 1.1 Type or ConventionCanonical Extended JSON FormatRelaxed Extended JSON Format
ObjectId{"$oid": < ObjectId bytes as 24-character, big-endian hex string > }< Same as Canonical Extended JSON >
Symbol{"$symbol": string }< Same as Canonical Extended JSON >
Stringstring< Same as Canonical Extended JSON >
Int32{"$numberInt": < 32-bit signed integer as a string > }integer
Int64{"$numberLong": < 64-bit signed integer as a string > }integer
Double [finite]{"$numberDouble": < 64-bit signed floating point as a decimal string > }non-integer
Double [non-finite]{"$numberDouble": < One of the strings: "Infinity", "-Infinity", or "NaN". > }< Same as Canonical Extended JSON >
Decimal128{"$numberDecimal": < decimal as a string1 > }< Same as Canonical Extended JSON >
Binary{"$binary": {"base64": < base64-encoded (with padding as =) payload as a string > , "subType": < BSON binary type as a one- or two-character hex string > }}< Same as Canonical Extended JSON >
Code{"$code": string }< Same as Canonical Extended JSON >
CodeWScope{"$code": string , "$scope": Document }< Same as Canonical Extended JSON >
Documentobject (with Extended JSON extensions)< Same as Canonical Extended JSON >
Timestamp{"$timestamp": {"t": pos-integer , "i": pos-integer }}< Same as Canonical Extended JSON >
Regular Expression{"$regularExpression": {pattern: string , "options": < BSON regular expression options as a string or ""2 > }}< Same as Canonical Extended JSON >
DBPointer{"$dbPointer": {"$ref": < namespace3 as a string > , "$id": ObjectId }}< Same as Canonical Extended JSON >
Datetime [year from 1970 to 9999 inclusive]{"$date": {"$numberLong": < 64-bit signed integer giving millisecs relative to the epoch, as a string > }}{"$date": ISO-8601 Internet Date/Time Format as described in RFC-33394 with maximum time precision of milliseconds5 as a string }
Datetime [year before 1970 or after 9999]{"$date": {"$numberLong": < 64-bit signed integer giving millisecs relative to the epoch, as a string > }}< Same as Canonical Extended JSON >
DBRef6

Note: this is not technically a BSON type, but it is a common convention.
{"$ref": < collection name as a string > , "$id": < Extended JSON for the id > }

If the generator supports DBRefs with a database component, and the database component is nonempty:

{"$ref": < collection name as a string > , "$id": < Extended JSON for the id > , "$db": < database name as a string > }

DBRefs may also have other fields, which MUST appear after $id and $db (if supported).
< Same as Canonical Extended JSON >
MinKey{"$minKey": 1}< Same as Canonical Extended JSON >
MaxKey{"$maxKey": 1}< Same as Canonical Extended JSON >
Undefined{"$undefined": true}< Same as Canonical Extended JSON >
Arrayarray< Same as Canonical Extended JSON >
Booleantrue or false< Same as Canonical Extended JSON >
Nullnull< Same as Canonical Extended JSON >

Representation of Non-finite Numeric Values

Following the Extended JSON format for the Decimal128 type, non-finite numeric values are encoded as follows:

ValueString
Positive InfinityInfinity
Negative Infinity-Infinity
NaN (all variants)NaN

For example, a BSON floating-point number with a value of negative infinity would be encoded as Extended JSON as follows:

{"$numberDouble": "-Infinity"}

Parsers

An Extended JSON parser (hereafter just "parser") is a tool that transforms an Extended JSON string into another representation, such as BSON or a language-native data structure.

By default, a parser MUST accept values in either Canonical Extended JSON format or Relaxed Extended JSON format as described in this specification. A parser MAY allow users to restrict parsing to only Canonical Extended JSON format or only Relaxed Extended JSON format.

A parser MAY also accept strings that adhere to other formats, such as Legacy Extended JSON formats emitted by old versions of mongoexport or other tools, but only if explicitly configured to do so.

A parser that accepts Legacy Extended JSON MUST be configurable such that a JSON text of a MongoDB query filter containing the regex query operator can be parsed, e.g.:

{ "$regex": {
    "$regularExpression" : { "pattern": "foo*", "options": "" }
    },
    "$options" : "ix"
}

or:

{ "$regex": {
    "$regularExpression" : { "pattern": "foo*", "options": "" }
    }
}

A parser that accepts Legacy Extended JSON MUST be configurable such that a JSON text of a MongoDB query filter containing the type query operator can be parsed, e.g.:

{ "zipCode" : { $type : 2 } }

or:

{ "zipCode" : { $type : "string" } }

A parser SHOULD support at least 200 levels of nesting in an Extended JSON document but MAY set other limits on strings it can accept as defined in section 9 of the JSON specification.

When parsing a JSON object other than the top-level object, the presence of a $-prefixed key indicates the object could be a type wrapper object as described in the Extended JSON Conversion table. In such a case, the parser MUST follow these rules, unless configured to allow Legacy Extended JSON, in which case it SHOULD follow these rules:

  • Parsers MUST NOT consider key order as having significance. For example, the document {"$code": "function(){}", "$scope": {}} must be considered identical to {"$scope": {}, "$code": "function(){}"}.

  • If the parsed object contains any of the special keys for a type in the Conversion table (e.g. "$binary", "$timestamp") then it must contain exactly the keys of the type wrapper. Any missing or extra keys constitute an error.

    DBRef is the lone exception to this rule, as it is only a common convention and not a proper type. An object that resembles a DBRef but fails to fully comply with its structure (e.g. has $ref but missing $id) MUST be left as-is and MUST NOT constitute an error.

  • If the keys of the parsed object exactly match the keys of a type wrapper in the Conversion table, and the values of the parsed object have the correct type for the type wrapper as described in the Conversion table, then the parser MUST interpret the parsed object as a type wrapper object of the corresponding type.

  • If the keys of the parsed object exactly match the keys of a type wrapper in the Conversion table, but any of the values are of an incorrect type, then the parser MUST report an error.

  • If the $-prefixed key does not match a known type wrapper in the Conversion table, the parser MUST NOT raise an error and MUST leave the value as-is. See Restrictions and limitations for additional information.

Special rules for parsing JSON numbers

The Relaxed Extended JSON format uses JSON numbers for several different BSON types. In order to allow parsers to use language-native JSON decoders (which may not distinguish numeric type when parsing), the following rules apply to parsing JSON numbers:

  • If the number is a non-integer, parsers SHOULD interpret it as BSON Double.
  • If the number is an integer, parsers SHOULD interpret it as being of the smallest BSON integer type that can represent the number exactly. If a parser is unable to represent the number exactly as an integer (e.g. a large 64-bit number on a 32-bit platform), it MUST interpret it as a BSON Double even if this results in a loss of precision. The parser MUST NOT interpret it as a BSON String containing a decimal representation of the number.

Special rules for parsing $uuid fields

As per the UUID specification, Binary subtype 3 or 4 are used to represent UUIDs in BSON. Consequently, UUIDs are handled as per the convention described for the Binary type in the Conversion table, e.g. the following document written with the MongoDB Python Driver:

{"Binary": uuid.UUID("c8edabc3-f738-4ca3-b68d-ab92a91478a3")}

is transformed into the following (newlines and spaces added for readability):

{"Binary": {
    "$binary": {
        "base64": "yO2rw/c4TKO2jauSqRR4ow==",
        "subType": "04"}
    }
}

[!NOTE] The above described type conversion assumes that UUID representation is set to STANDARD. See the UUID specification for more information about UUID representations.

While this transformation preserves BSON subtype information (since UUIDs can be represented as BSON subtype 3 or 4), base64-encoding is not the standard way of representing UUIDs and using it makes comparing these values against textual representations coming from platform libraries difficult. Consequently, we also allow UUIDs to be represented in extended JSON as:

{"$uuid": <canonical textual representation of a UUID>}

The rules for generating the canonical string representation of a UUID are defined in RFC 4122 Section 3. Use of this format result in a more readable extended JSON representation of the UUID from the previous example:

{"Binary": {
    "$uuid": "c8edabc3-f738-4ca3-b68d-ab92a91478a3"
    }
}

Parsers MUST interpret the $uuid key as BSON Binary subtype 4. Parsers MUST accept textual representations of UUIDs that omit the URN prefix (usually urn:uuid:). Parsers MAY also accept textual representations of UUIDs that omit the hyphens between hex character groups (e.g. c8edabc3f7384ca3b68dab92a91478a3).

Generators

An Extended JSON generator (hereafter just "generator") produces strings in an Extended JSON format.

A generator MUST allow users to produce strings in either the Canonical Extended JSON format or the Relaxed Extended JSON format. If generators provide a default format, the default SHOULD be the Relaxed Extended JSON format.

A generator MAY be capable of exporting strings that adhere to other formats, such as Legacy Extended JSON formats.

A generator SHOULD support at least 100 levels of nesting in a BSON document.

Transforming BSON

Given a BSON document (e.g. a buffer of bytes meeting the requirements of the BSON specification), a generator MUST use the corresponding JSON values or Extended JSON type wrapper objects for the BSON type given in the Extended JSON Conversion table for the desired format. When transforming a BSON document into Extended JSON text, a generator SHOULD emit the JSON keys and values in the same order as given in the BSON document.

Transforming Language-Native data

Given language-native data (e.g. type primitives, container types, classes, etc.), if there is a semantically-equivalent BSON type for a given language-native type, a generator MUST use the corresponding JSON values or Extended JSON type wrapper objects for the BSON type given in the Extended JSON Conversion table for the desired format. For example, a Python datetime object must be represented the same as a BSON datetime type. A generator SHOULD error if a language-native type has no semantically-equivalent BSON type.

Format and Method Names

The following format names SHOULD be used for selecting formats for generator output:

  • canonicalExtendedJSON (references Canonical Extended JSON as described in this specification)
  • relaxedExtendedJSON (references Relaxed Extended JSON as described in this specification)
  • legacyExtendedJSON (if supported: references Legacy Extended JSON, with implementation-defined behavior)

Generators MAY use these format names as part of function/method names or MAY use them as arguments or constants, as needed.

If a generator provides a generic to_json or to_extended_json method, it MUST default to producing Relaxed Extended JSON or MUST be deprecated in favor of a spec-compliant method.

Restrictions and limitations

Extended JSON is designed primarily for testing and human inspection of BSON documents. It is not designed to reliably round-trip BSON documents. One fundamental limitation is that JSON objects are inherently unordered and BSON objects are ordered.

Further, Extended JSON uses $-prefixed keys in type wrappers and has no provision for escaping a leading $ used elsewhere in a document. This means that the Extended JSON representation of a document with $-prefixed keys could be indistinguishable from another document with a type wrapper with the same keys.

Extended JSON formats SHOULD NOT be used in contexts where $-prefixed keys could exist in BSON documents (with the exception of the DBRef convention, which is accounted for in this spec).

Test Plan

Drivers, tools, and libraries can test their compliance to this specification by running the tests in version 2.0 and above of the BSON Corpus Test Suite.

Examples

Canonical Extended JSON Example

Consider the following document, written with the MongoDB Python Driver:

{
    "_id": bson.ObjectId("57e193d7a9cc81b4027498b5"),
    "String": "string",
    "Int32": 42,
    "Int64": bson.Int64(42),
    "Double": 42.42,
    "Decimal": bson.Decimal128("1234.5"),
    "Binary": uuid.UUID("c8edabc3-f738-4ca3-b68d-ab92a91478a3"),
    "BinaryUserDefined": bson.Binary(b'123', 80),
    "Code": bson.Code("function() {}"),
    "CodeWithScope": bson.Code("function() {}", scope={}),
    "Subdocument": {"foo": "bar"},
    "Array": [1, 2, 3, 4, 5],
    "Timestamp": bson.Timestamp(42, 1),
    "RegularExpression": bson.Regex("foo*", "xi"),
    "DatetimeEpoch": datetime.datetime.utcfromtimestamp(0),
    "DatetimePositive": datetime.datetime.max,
    "DatetimeNegative": datetime.datetime.min,
    "True": True,
    "False": False,
    "DBRef": bson.DBRef(
        "collection", bson.ObjectId("57e193d7a9cc81b4027498b1"), database="database"),
    "DBRefNoDB": bson.DBRef(
        "collection", bson.ObjectId("57fd71e96e32ab4225b723fb")),
    "Minkey": bson.MinKey(),
    "Maxkey": bson.MaxKey(),
    "Null": None
}

The above document is transformed into the following (newlines and spaces added for readability):

{
    "_id": {
        "$oid": "57e193d7a9cc81b4027498b5"
    },
    "String": "string",
    "Int32": {
        "$numberInt": "42"
    },
    "Int64": {
        "$numberLong": "42"
    },
    "Double": {
        "$numberDouble": "42.42"
    },
    "Decimal": {
        "$numberDecimal": "1234.5"
    },
    "Binary": {
        "$binary": {
            "base64": "yO2rw/c4TKO2jauSqRR4ow==",
            "subType": "04"
        }
    },
    "BinaryUserDefined": {
        "$binary": {
            "base64": "MTIz",
            "subType": "80"
        }
    },
    "Code": {
        "$code": "function() {}"
    },
    "CodeWithScope": {
        "$code": "function() {}",
        "$scope": {}
    },
    "Subdocument": {
        "foo": "bar"
    },
    "Array": [
        {"$numberInt": "1"},
        {"$numberInt": "2"},
        {"$numberInt": "3"},
        {"$numberInt": "4"},
        {"$numberInt": "5"}
    ],
    "Timestamp": {
        "$timestamp": { "t": 42, "i": 1 }
    },
    "RegularExpression": {
        "$regularExpression": {
            "pattern": "foo*",
            "options": "ix"
        }
    },
    "DatetimeEpoch": {
        "$date": {
            "$numberLong": "0"
        }
    },
    "DatetimePositive": {
        "$date": {
            "$numberLong": "253402300799999"
        }
    },
    "DatetimeNegative": {
        "$date": {
            "$numberLong": "-62135596800000"
        }
    },
    "True": true,
    "False": false,
    "DBRef": {
        "$ref": "collection",
        "$id": {
            "$oid": "57e193d7a9cc81b4027498b1"
        },
        "$db": "database"
    },
    "DBRefNoDB": {
        "$ref": "collection",
        "$id": {
            "$oid": "57fd71e96e32ab4225b723fb"
        }
    },
    "Minkey": {
        "$minKey": 1
    },
    "Maxkey": {
        "$maxKey": 1
    },
    "Null": null
}

Relaxed Extended JSON Example

In Relaxed Extended JSON, the example document is transformed similarly to Canonical Extended JSON, with the exception of the following keys (newlines and spaces added for readability):

{
    ...
    "Int32": 42,
    "Int64": 42,
    "Double": 42.42,
    ...
    "DatetimeEpoch": {
        "$date": "1970-01-01T00:00:00.000Z"
    },
    ...
}

Motivation for Change

There existed many Extended JSON parser and generator implementations prior to this specification that used conflicting formats, since there was no agreement on the precise format of Extended JSON. This resulted in problems where the output of some generators could not be consumed by some parsers.

MongoDB drivers needed a single, standard Extended JSON format for testing that covers all BSON types. However, there were BSON types that had no defined Extended JSON representation. This spec primarily addresses that need, but provides for slightly broader use as well.

Design Rationale

Of Relaxed and Canonical Formats

There are various use cases for expressing BSON documents in a text rather that binary format. They broadly fall into two categories:

  • Type preserving: for things like testing, where one has to describe the expected form of a BSON document, it's helpful to be able to precisely specify expected types. In particular, numeric types need to differentiate between Int32, Int64 and Double forms.
  • JSON-like: for things like a web API, where one is sending a document (or a projection of a document) that only uses ordinary JSON type primitives, it's desirable to represent numbers in the native JSON format. This output is also the most human readable and is useful for debugging and documentation.

The two formats in this specification address these two categories of use cases.

Of Parsers and Generators

Parsers need to accept any valid Extended JSON string that a generator can produce. Parsers and generators are permitted to accept and output strings in other formats as well for backwards compatibility.

Acceptable nesting depth has implications for resource usage so unlimited nesting is not permitted.

Generators support at least 100 levels of nesting in a BSON document being transformed to Extended JSON. This aligns with MongoDB's own limitation of 100 levels of nesting.

Parsers support at least 200 levels of nesting in Extended JSON text, since the Extended JSON language can double the level of apparent nesting of a BSON document by wrapping certain types in their own documents.

Of Canonical Type Wrapper Formats

Prior to this specification, BSON types fell into three categories with respect to Legacy Extended JSON:

  1. A single, portable representation for the type already existed.
  2. Multiple representations for the type existed among various Extended JSON generators, and those representations were in conflict with each other or with current portability goals.
  3. No Legacy Extended JSON representation existed.

If a BSON type fell into category (1), this specification just declares that form to be canonical, since all drivers, tools, and libraries already know how to parse or output this form. There are two exceptions:

RegularExpression

The form {"$regex: <string>, $options: <string>"} has until this specification been canonical. The change to {"$regularExpression": {pattern: <string>, "options": <string>"}} is motivated by a conflict between the previous canonical form and the $regex MongoDB query operator. The form specified here disambiguates between the two, such that a parser can accept any MongoDB query filter, even one containing the $regex operator.

Binary

The form {"$binary": "AQIDBAU=", "$type": "80"} has until this specification been canonical. The change to {"$binary": {"base64": "AQIDBAU=", "subType": "80"}} is motivated by a conflict between the previous canonical form and the $type MongoDB query operator. The form specified here disambiguates between the two, such that a parser can accept any MongoDB query filter, even one containing the $type operator.

Reconciled type wrappers

If a BSON type fell into category (2), this specification selects a new common representation for the type to be canonical. Conflicting formats were gathered by surveying a number of Extended JSON generators, including the MongoDB Java Driver (version 3.3.0), the MongoDB Python Driver (version 3.4.0.dev0), the MongoDB Extended JSON module on NPM (version 1.7.1), and each minor version of mongoexport from 2.4.14 through 3.3.12. When possible, we set the "strict" option on the JSON codec. The following BSON types had conflicting Extended JSON representations:

Binary

Some implementations write the Extended JSON form of a Binary object with a strict two-hexadecimal digit subtype (e.g. they output a leading 0 for subtypes < 16). However, the NPM mongodb-extended-json module and Java driver use a single hexadecimal digit to represent subtypes less than 16. This specification makes both one- and two-digit representations acceptable.

Code

Mongoexport 2.4 does not quote the Code value when writing out the extended JSON form of a BSON Code object. All other implementations do so. This spec canonicalises the form where the Javascript code is quoted, since the latter form adheres to the JSON specification and the former does not. As an additional note, the NPM mongodb-extended-json module uses the form {"code": "<javascript code>"}, omitting the dollar sign ($) from the key. This specification does not accommodate the eccentricity of a single library.

CodeWithScope

In addition to the same variants as BSON Code types, there are other variations when turning CodeWithScope objects into Extended JSON. Mongoexport 2.4 and 2.6 omit the scope portion of CodeWithScope if it is empty, making the output indistinguishable from a Code type. All other implementations include the empty scope. This specification therefore canonicalises the form where the scope is always included. The presence of $scope is what differentiates Code from CodeWithScope.

Datetime

Mongoexport 2.4 and the Java driver always transform a Datetime object into an Extended JSON string of the form {"$date": <ms since epoch>}. This form has the problem of a potential loss of precision or range on the Datetimes that can be represented. Mongoexport 2.6 transforms Datetime objects into an extended JSON string of the form {"$date": <ISO-8601 date string in local time>}for dates starting at or after the Unix epoch (UTC). Dates prior to the epoch take the form {"$date": {"$numberLong": "<ms since epoch>"}}. Starting in version 3.0, mongoexport always turns Datetime objects into strings of the form {"$date": <ISO-8601 date string in UTC>}. The NPM mongodb-extended-json module does the same. The Python driver can also transform Datetime objects into strings like {"$date": {"$numberLong": "<ms since epoch>"}}. This specification canonicalises this form, since this form is the most portable. In Relaxed Extended JSON format, this specification provides for ISO-8601 representation for better readability, but limits it to a portable subset, from the epoch to the end of the largest year that can be represented with four digits. This should encompass most typical use of dates in applications.

DBPointer

Mongoexport 2.4 and 2.6 use the form{"$ref": <namespace>, "$id": <hex string>}. All other implementations studied include the canonical ObjectId form:{"$ref": <namespace>, "$id": {"$oid": <hex string>}}. Neither of these forms are distinguishable from that of DBRef, so this specification creates a new format: {"$dbPointer": {"$ref": <namespace>, "$id": {"$oid": <hex string>}}}.

Newly-added type wrappers .

If a BSON type fell into category (3), above, this specification creates a type wrapper format for the type. The following new Extended JSON type wrappers are introduced by this spec:

  • $dbPointer- See above.

  • $numberInt - This is used to preserve the "int32" BSON type in Canonical Extended JSON. Without using $numberInt, this type will be indistinguishable from a double in certain languages where the distinction does not exist, such as Javascript.

  • $numberDouble - This is used to preserve the doubletype in Canonical Extended JSON, as some JSON generators might omit a trailing ".0" for integral types.

    It also supports representing non-finite values like NaN or Infinity which are prohibited in the JSON specification for numbers.

  • $symbol - The use of the $symbol key preserves the symbol type in Canonical Extended JSON, distinguishing it from JSON strings.

Reference Implementation

Canonical Extended JSON format reference implementation needs to be updated

PyMongo implements the Canonical Extended JSON format, which must be chosen by selecting the right option on the JSONOptions object::

from bson.json_util import dumps, DatetimeRepresentation, CANONICAL_JSON_OPTIONS    
dumps(document, json_options=CANONICAL_JSON_OPTIONS)  

Relaxed Extended JSON format reference implementation is TBD

Implementation Notes

JSON File Format

Some applications like mongoexport may wish to write multiple Extended JSON documents to a single file. One way to do this is to list each JSON document one-per-line. When doing this, it is important to ensure that special characters like newlines are encoded properly (e.g.n).

Duplicate Keys

The BSON specification does not prohibit duplicate key names within the same BSON document, but provides no semantics for the interpretation of duplicate keys. The JSON specification says that names within an object should be unique, and many JSON libraries are incapable of handling this scenario. This specification is silent on the matter, so as not to conflict with a future change by either specification.

Future Work

This specification will need to be amended if future BSON types are added to the BSON specification.

Q&A

Q. Why was version 2 of the spec necessary?

A. After Version 1 was released, several stakeholders raised concerns that not providing an option to output BSON numbers as ordinary JSON numbers limited the utility of Extended JSON for common historical uses. We decided to provide a second format option and more clearly distinguish the use cases (and limitations) inherent in each format.

Q. My BSON parser doesn't distinguish every BSON type. Does my Extended JSON generator need to distinguish these types?

A. No. Some BSON parsers do not emit a unique type for each BSON type, making round-tripping BSON through such libraries impossible without changing the document. For example, a DBPointer will be parsed into a DBRef by PyMongo. In such cases, a generator must emit the Extended JSON form for whatever type the BSON parser emitted. It does not need to preserve type information when that information has been lost by the BSON parser.

Q. How can implementations which require backwards compatibility with Legacy Extended JSON, in which BSON regular expressions were represented with $regex, handle parsing of extended JSON test representing a MongoDB query filter containing the $regex operator?

A. An implementation can handle this in a number of ways: - Introduce an enumeration that determines the behavior of the parser. If the value is LEGACY, it will parse $regexand not treat $regularExpression specially, and if the value is CANONICAL, it will parse $regularExpression and not treat $regex specially. - Support both legacy and canonical forms in the parser without requiring the application to specify one or the other. Making that work for the $regex query operator use case will require that the rules set forth in the 1.0.0 version of this specification are followed for $regex; specifically, that a document with a $regex key whose value is a JSON object should be parsed as a normal document and not reported as an error.

Q. How can implementations which require backwards compatibility with Legacy Extended JSON, in which BSON binary values were represented like {"$binary": "AQIDBAU=", "$type": "80"}, handle parsing of extended JSON test representing a MongoDB query filter containing the $typeoperator?

A. An implementation can handle this in a number of ways:

Introduce an enumeration that determines the behavior of the parser. If the value is LEGACY, it will parse the new binary form and not treat the legacy one specially, and if the value is CANONICAL, it will parse the new form and not treat the legacy form specially. - Support both legacy and canonical forms in the parser without requiring the application to specify one or the other. Making that work for the $type query operator use case will require that the rules set forth in the 1.0.0 version of this specification are followed for $type; specifically, that a document with a $type key whose value is an integral type, or a document with a $type key but without a $binary key, should be parsed as a normal document and not reported as an error.

Q. Sometimes I see the term "extjson" used in other specifications. Is "extjson" related to this specification?

A. Yes, "extjson" is short for "Extended JSON".

Changelog

  • 2024-05-29: Migrated from reStructuredText to Markdown.
  • 2022-10-05: Remove spec front matter and reformat changelog.
  • 2021-05-26:
    • Remove any mention of extra dollar-prefixed keys being prohibited in a DBRef. MongoDB 5.0 and compatible drivers no longer enforce such restrictions.
    • Objects that resemble a DBRef without fully complying to its structure should be left as-is during parsing. - 2020-09-01: Note that $-prefixed keys not matching a known type MUST be left as-is when parsing. This is patch-level change as this behavior was already required in the BSON corpus tests ("Document with keys that start with $").
  • 2020-09-08:
    • Added support for parsing $uuid fields as BSON Binary subtype 4.
    • Changed the example to using the MongoDB Python Driver. It previously used the MongoDB Java Driver. The new example excludes the following BSON types that are unsupported in Python - Symbol,SpecialFloat,DBPointer, and Undefined. Transformations for these types are now only documented in the Conversion table
  • 2017-07-20:
    • Bumped specification to version 2.0.

    • Added "Relaxed" format.

    • Changed BSON timestamp type wrapper back to {"t": *int*, "i": *int*} for backwards compatibility. (The change in v1 to unsigned 64-bit string was premature optimization)

    • Changed BSON regular expression type wrapper to {"$regularExpression": {pattern: *string*, "options": *string*"}}.

    • Changed BSON binary type wrapper to {"$binary": {"base64": <base64-encoded payload as a *string*>, "subType": <BSON binary type as a one- or two-character *hex string*>}}

    • Added "Restrictions and limitations" section.

    • Clarified parser and generator rules.

  • 2017-02-01: Initial specification version 1.0.
1

This MUST conform to the Decimal128 specification

2

BSON Regular Expression options MUST be in alphabetical order.

5

Fractional seconds SHOULD have exactly 3 decimal places if the fractional part is non-zero. Otherwise, fractional seconds SHOULD be omitted if zero.