BSON Corpus
- Status: Accepted
- Minimum Server Version: N/A
Abstract
The official BSON specification does not include test data, so this pseudo-specification describes tests for BSON
encoding and decoding. It also includes tests for MongoDB's "Extended JSON" specification (hereafter abbreviated as
extjson
).
Meta
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Motivation for Change
To ensure correct operation, we want drivers to implement identical tests for important features. BSON (and extjson
)
are critical for correct operation and data exchange, but historically had no common test corpus. This
pseudo-specification provides such tests.
Goals
- Provide machine-readable test data files for BSON and
extjson
encoding and decoding. - Cover all current and historical BSON types.
- Define test data patterns for three cases:
- conversion/roundtrip,
- decode errors, and
- parse errors.
Non-Goals
- Replace or extend the official BSON spec at https://bsonspec.org.
- Provide a formal specification for
extjson
.
Specification
The specification for BSON lives at https://bsonspec.org. The extjson
format specification is
here.
Test Plan
This test plan describes a general approach for BSON testing. Future BSON specifications (such as for new types like Decimal128) may specialize or alter the approach described below.
Description of the BSON Corpus
This BSON test data corpus consists of a JSON file for each BSON type, plus a top.json
file for testing the overall,
enclosing document and a multi-type.json
file for testing a document with all BSON types. There is also a
multi-type-deprecated.json
that includes deprecated keys.
Top level keys
description
: human-readable description of what is in the filebson_type
: hex string of the first byte of a BSON element (e.g. "0x01" for type "double"); this will be the synthetic value "0x00" for "whole document" tests liketop.json
.test_key
: (optional) name of a field in a single-BSON-typevalid
test case that contains the data type being tested.valid
(optional): an array of validity test cases (see below).decodeErrors
(optional): an array of decode error cases (see below).parseErrors
(optional): an array of type-specific parse error case (see below).deprecated
(optional): this field will be present (and true) if the BSON type has been deprecated (i.e. Symbol, Undefined and DBPointer)
Validity test case keys
Validity test cases include 'canonical' forms of BSON and Extended JSON that are deemed equivalent and may provide additional cases or metadata for additional assertions. For each case, keys include:
description
: human-readable test case label.canonical_bson
: an (uppercase) big-endian hex representation of a BSON byte string. Be sure to mangle the case as appropriate in any roundtrip tests.canonical_extjson
: a string containing a Canonical Extended JSON document. Because this is itself embedded as a string inside a JSON document, characters like quote and backslash are escaped.relaxed_extjson
: (optional) a string containing a Relaxed Extended JSON document. Because this is itself embedded as a string inside a JSON document, characters like quote and backslash are escaped.degenerate_bson
: (optional) an (uppercase) big-endian hex representation of a BSON byte string that is technically parseable, but not in compliance with the BSON spec. Be sure to mangle the case as appropriate in any roundtrip tests.degenerate_extjson
: (optional) a string containing an invalid form of Canonical Extended JSON that is still parseable according to type-specific rules. (For example, "1e100" instead of "1E+100".)converted_bson
: (optional) an (uppercase) big-endian hex representation of a BSON byte string. It may be present for deprecated types. It represents a possible conversion of the deprecated type to a non-deprecated type, e.g. symbol to string.converted_extjson
: (optional) a string containing a Canonical Extended JSON document. Because this is itself embedded as a string inside a JSON document, characters like quote and backslash are escaped. It may be present for deprecated types and is the Canonical Extended JSON representation ofconverted_bson
.lossy
(optional) -- boolean; present (and true) iffcanonical_bson
can't be represented exactly with extended JSON (e.g. NaN with a payload).
Decode error case keys
Decode error cases provide an invalid BSON document or field that should result in an error. For each case, keys include:
description
: human-readable test case label.bson
: an (uppercase) big-endian hex representation of an invalid BSON string that should fail to decode correctly.
Parse error case keys
Parse error cases are type-specific and represent some input that can not be encoded to the bson_type
under test. For
each case, keys include:
description
: human-readable test case label.string
: a text or numeric representation of an input that can't be parsed to a valid value of the given type.
Extended JSON encoding, escaping and ordering
Because the canonical_extjson
and other Extended JSON fields are embedded in a JSON document, all their JSON
metacharacters are escaped. Control characters and non-ASCII codepoints are represented with \uXXXX
. Note that this
means that the corpus JSON will appear to have double-escaped characters \\uXXXX
. This is by design to ensure that the
Extended JSON fields remain printable ASCII without embedded null characters to ensure maximum portability to different
language JSON or extended JSON decoders.
There are legal differences in JSON representation that may complicate testing for particular codecs. The JSON in the corpus may not resemble the JSON generated by a codec, even though they represent the same data. Some known differences include:
- JSON only requires certain characters to be escaped but allows any character to be escaped.
- The JSON format is unordered and whitespace (outside of strings) is not significant.
Implementations using these tests MUST normalize JSON comparisons however necessary for effective comparison.
Language-specific differences
Some programming languages may not be able to represent or transmit all types accurately. In such cases, implementations SHOULD ignore (or modify) any tests which are not supported on that platform.
Testing validity
To test validity of a case in the valid
array, we consider up to five possible representations:
- Canonical BSON (denoted herein as "cB") -- fully valid, spec-compliant BSON
- Degenerate BSON (denoted herein as "dB") -- invalid but still parseable BSON (bad array keys, regex options out of order)
- Canonical Extended JSON (denoted herein as "cEJ") -- A string format based on the JSON standard that emphasizes type preservation at the expense of readability and interoperability.
- Degenerate Extended JSON (denoted herin as "dEJ") -- An invalid form of Canonical Extended JSON that is still parseable. (For example, "1e100" instead of "1E+100".)
- Relaxed Extended JSON (denoted herein as "rEJ") -- A string format based on the JSON standard that emphasizes readability and interoperability at the expense of type preservation.
Not all input types will exist for a given test case.
There are two forms of BSON/Extended JSON codecs: ones that have a language-native "intermediate" representation and ones that do not.
For a codec without an intermediate representation (i.e. one that translates directly from BSON to JSON or back), the following assertions MUST hold (function names are for clarity of illustration only):
- for cB input:
- bson_to_canonical_extended_json(cB) = cEJ
- bson_to_relaxed_extended_json(cB) = rEJ (if rEJ exists)
- for cEJ input:
- json_to_bson(cEJ) = cB (unless lossy)
- for dB input (if it exists):
- bson_to_canonical_extended_json(dB) = cEJ
- bson_to_relaxed_extended_json(dB) = rEJ (if rEJ exists)
- for dEJ input (if it exists):
- json_to_bson(dEJ) = cB (unless lossy)
- for rEJ input (if it exists):
- bson_to_relaxed_extended_json( json_to_bson(rEJ) ) = rEJ
For a codec that has a language-native representation, we want to test both conversion and round-tripping. For these codecs, the following assertions MUST hold (function names are for clarity of illustration only):
- for cB input:
- native_to_bson( bson_to_native(cB) ) = cB
- native_to_canonical_extended_json( bson_to_native(cB) ) = cEJ
- native_to_relaxed_extended_json( bson_to_native(cB) ) = rEJ (if rEJ exists)
- for cEJ input:
- native_to_canonical_extended_json( json_to_native(cEJ) ) = cEJ
- native_to_bson( json_to_native(cEJ) ) = cB (unless lossy)
- for dB input (if it exists):
- native_to_bson( bson_to_native(dB) ) = cB
- for dEJ input (if it exists):
- native_to_canonical_extended_json( json_to_native(dEJ) ) = cEJ
- native_to_bson( json_to_native(dEJ) ) = cB (unless lossy)
- for rEJ input (if it exists):
- native_to_relaxed_extended_json( json_to_native(rEJ) ) = rEJ
Implementations MAY test assertions in an implementation-specific manner.
Testing decode errors
The decodeErrors
cases represent BSON documents that are sufficiently incorrect that they can't be parsed even with
liberal interpretation of the BSON schema (e.g. reading arrays with invalid keys is possible, even though technically
invalid, so they are not decodeErrors
).
Drivers SHOULD test that each case results in a decoding error. Implementations MAY test assertions in an implementation-specific manner.
Testing parsing errors
The interpretation of parseErrors
is type-specific. The structure of test cases within parseErrors
is described in
Parse error case keys.
Drivers SHOULD test that each case results in a parsing error (e.g. parsing Extended JSON, constructing a language type). Implementations MAY test assertions in an implementation-specific manner.
Top-level Document (type 0x00)
For type "0x00" (i.e. top-level documents), the string
field contains input for an Extended JSON parser. Drivers MUST
parse the Extended JSON input using an Extended JSON parser and verify that doing so yields an error. Drivers that parse
Extended JSON into language types instead of directly to BSON MAY need to additionally convert the resulting language
type(s) to BSON to expect an error.
Drivers SHOULD also parse the Extended JSON input using a regular JSON parser (not an Extended JSON one) and verify the
input is parsed successfully. This serves to verify that the parseErrors
test cases are testing Extended JSON-specific
error conditions and that they do not have, for example, unintended syntax errors.
Note: due to the generic nature of these tests, they may also be used to test Extended JSON parsing errors for various BSON types appearing within a document.
Binary (type 0x05)
For type "0x05" (i.e. binary), the rules for handling parseErrors
are the same as those for
Top-level Document (type 0x00).
Decimal128 (type 0x13)
For type "0x13" (i.e. Decimal128), the string
field contains input for a Decimal128 parser that converts string input
to a binary Decimal128 value (e.g. Decimal128 constructor). Drivers MUST assert that these strings cannot be
successfully converted to a binary Decimal128 value and that parsing the string produces an error.
Deprecated types
The corpus files for deprecated types are provided for informational purposes. Implementations MAY ignore or modify them
to match legacy treatment of deprecated types. The converted_bson
and converted_extjson
fields MAY be used to test
conversion to a standard type or MAY be ignored.
Prose Tests
The following tests have not yet been automated, but MUST still be tested.
1. Prohibit null bytes in null-terminated strings when encoding BSON
The BSON spec uses null-terminated strings to represent document field names and regex components (i.e. pattern and flags/options). Drivers MUST assert that null bytes are prohibited in the following contexts when encoding BSON (i.e. creating raw BSON bytes or constructing BSON-specific type classes):
- Field name within a root document
- Field name within a sub-document
- Pattern for a regular expression
- Flags/options for a regular expression
Depending on how drivers implement BSON encoding, they MAY expect an error when constructing a type class (e.g. BSON Document or Regex class) or when encoding a language representation to BSON (e.g. converting a dictionary, which might allow null bytes in its keys, to raw BSON bytes).
Implementation Notes
A tool for visualizing BSON
The test directory includes a Perl script bsonview
, which will decompose and highlight elements of a BSON document. It
may be used like this:
echo "0900000010610005000000" | perl bsonview -x
Notes for certain types
Array
Arrays can have degenerate BSON if the array indexes are not set as "0", "1", etc.
Boolean
The only valid values are 0 and 1. Other non-zero numbers MUST be interpreted as errors rather than "true" values.
Binary
The Base64 encoded text in the extended JSON representation MUST be padded.
Code
There are multiple ways to encode Unicode characters as a JSON document. Individual implementers may need to normalize provided and generated extended JSON before comparison.
Decimal
NaN with payload can't be represented in extended JSON, so such conversions are lossy.
Double
There is not yet a way to represent Inf, -Inf or NaN in extended JSON. Even if a $numberDouble
is added, it is
unlikely to support special values with payloads, so such doubles would be lossy when converted to extended JSON.
String representation of doubles is fairly unportable so it's hard to provide a single string that all platforms/languages will generate. Testers may need to normalize/modify the test cases.
String
There are multiple ways to encode Unicode characters as a JSON document. Individual implementers may need to normalize provided and generated extended JSON before comparison.
DBPointer
This type is deprecated. The provided converted form (converted_bson
) represents them as DBRef documents, but such
conversion is outside the scope of this spec.
Symbol
This type is deprecated. The provided converted form converts these to strings, but such conversion is outside the scope of this spec.
Undefined
This type is deprecated. The provided converted form converts these to Null, but such conversion is outside the scope of this spec.
Reference Implementation
The Java, C# and Perl drivers.
Design Rationale
Use of extjson
Testing conversion requires an "input" and an "output". With a BSON string as both input and output, we can only test that it roundtrips correctly --we can't test that the decoded value visible to the language is correct.
For example, a pathological encoder/decoder could invert Boolean true and false during decoding and encoding. The BSON would roundtrip but the program would see the wrong values.
Therefore, we need a separate, semantic description of the contents of a BSON string in a machine readable format. Fortunately, we already have extjson as a means of doing so. The extended JSON strings contained within the tests adhere to the Extended JSON Specification.
Repetition across cases
Some validity cases may result in duplicate assertions across cases, particularly if the degenerate_bson
field is
different in different cases, but the canonical_bson
field is the same. This is by design so that each case stands
alone and can be confirmed to be internally consistent via the assertions. This makes for easier and safer test case
development.
Changelog
-
2024-01-22: Migrated from reStructuredText to Markdown.
-
2023-06-14: Add decimal128 Extended JSON parse tests for clamped zeros with very large exponents.
-
2022-10-05: Remove spec front matter and reformat changelog.
-
2021-09-09: Clarify error expectation rules for
parseErrors
. -
2021-09-02: Add spec and prose tests for prohibiting null bytes in null-terminated strings within document field names and regular expressions. Clarify type-specific rules for
parseErrors
. -
2017-05-26: Revised to be consistent with Extended JSON spec 2.0: valid case fields have changed, as have the test assertions.
-
2017-01-23: Added
multi-type.json
to test encoding and decoding all BSON types within the same document. Amended all extended JSON strings to adhere to the Extended JSON Specification. Modified the "Use of extjson" section of this specification to note that canonical extended JSON is now used. -
2016-11-14: Removed "invalid flags" BSON Regexp case.
-
2016-10-25: Added a "non-alphabetized flags" case to the BSON Regexp corpus file; decoders must be able to read non-alphabetized flags, but encoders must emit alphabetized flags. Added an "invalid flags" case to the BSON Regexp corpus file.